Xinyu Huang (黄新宇)
I am a final-year Ph.D. student at the School of Computer Science, Fudan University, advised by
Prof. Rui Feng and Prof. Yuejie Zhang. Currently, I'm visiting the MMLab at Nanyang Technological University under the supervision of Prof. Ziwei Liu from March to September 2025.
From 2020 to 2024, I researched on visual perception and created the Recognize Anything Model (RAM) Family: a series of open-source and powerful image perception models that exceed CLIP by more than 20 points in fine-grained perception. This work was done at OPPO Research Institute & IDEA, where I was fortunate to collaborate with Youcai Zhang, Prof. Yandong Guo, and Prof. Lei Zhang.
Since 2024, I focus on the the full process construction of large multimodal models, such as GPT-4o, working with a talented team at TikTok.
I expect to graduate in September 2025. I am open to both academic and industry research positions. Please feel free to download my Resume and do not hesitate to email me if you're interested :)
Email  / 
Scholar
 / 
Github  / 
Zhihu
|
|
Research (* indicates equal contribution)
|
|
Recognize Anything Plus Model (RAM++)
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, Lei Zhang
Arxiv,
2023
arXiv
/
code
RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
|
|
Recognize Anything Model (RAM)
Recognize Anything: A Strong Image Tagging Model
Youcai Zhang*,
Xinyu Huang*,
Jinyu Ma*, Zhaoyang Li*, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang
CVPR 2024, Multimodal Foundation Models Workshop
project page
/
arXiv
/
demo
/
code
RAM is an image tagging model, which can recognize any common category with high accuracy.
|
|
Tag2Text Vision-Language Model
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang,
Youcai Zhang,
Jinyu Ma,
Weiwei Tian,
Rui Feng,
Yuejie Zhang,
Yaqian Li,
Yandong Guo,
Lei Zhang
ICLR 2024
project page
/
arXiv
/
demo
/
code
Tag2Text is a vision-language model guided by tagging, which can support tagging and comprehensive captioning simultaneously.
|
|
IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Xinyu Huang,
Youcai Zhang,
Ying Cheng,
Weiwei Tian,
Ruiwei Zhao,
Rui Feng,
Yuejie Zhang,
Yaqian Li,
Yandong Guo,
Xiaobo Zhang
ACM MM,
2022
arXiv
/
code
We propose IDEA to provide more explicit textual supervision (including multiple valuable tags and texts composed by multiple tags) for visual models.
|
|
Simple and Robust Loss Design for Multi-Label Learning with Missing Labels
Youcai Zhang*,
Yuhao Cheng*,
Xinyu Huang*,
Fei Wen,
Rui Feng,
Yaqian Li,
Yandong Guo
Arxiv,
2021
arXiv
/
code
Multi-label learning in the presence of missing labels(MLML) is a challenging problem. We propose two simple yet effective methods via robust loss design based on an observation.
|
|
Recognize Anything Family
Project Creator/Owner
3.1K+ stars!
We provide Recognize Anything Model Family (RAM) demonstrating superior image recognition ability!
|
|
RAM-Grounded-SAM
Project Co-Leader
15.9K+ stars!
RAM Faimly marry Grounded-SAM, which can automatically recognize, detect, and segment for an image! RAM Family showcases powerful image recognition capabilities!
|
|