Sheng Liu

I am an applied scientist at Amazon Prime Video. I work on computer vision and natural language processing.

I obtained my Ph.D. degree from University at Buffalo, SUNY, where I was advised by Prof. Junsong Yuan. Before that, I worked at Nanyang Technological University. I received my Bachelor's degree from Xian Jiaotong University, where I was a member of the Special Class of the Gifted Young.

Email / CV / Google Scholar / LinkedIn

News

2023.02: Our paper is accepted by CVPR, 2023.
2022.05: I joined Amazon Prime Video as an applied scientist.
2022.05: I graduated 🎓! Thank you Prof. Yuan!
2022.02: Our paper is accepted by CVPR, 2022.
2021.10: Our paper is accepted by AAAI, 2022.

Research

At Amazon, I work on image generation, 3D scene reconstruction, and large language models (LLMs) to support Virtual Product Placement for Prime Video contents. Our work on image compositing and structure-from-motion was published at CVPR'23 and CVPR'22, respectively. I also have solid knowledge of, and practical experience with, large language models (LLMs), such as aligning LLMs with human preferences via direct preference optimization (DPO) and retrieval augmented generation (RAG). Additionally, I proposed and co-led the development of a computer vision solution for real-time virtual product placement on Twitch.

My Ph.D. research focused on vision-and-language (VL). Our work focused on vision-and-language pre-training, visual question answering, video captioning, visual grounding and machine translation. I also worked on neural rendering. We utilized neural radiance fields (NeRF) to learn kinematic formulas and create 3D avatars.

Selected Publications

	LEMaRT: Label Efficient Masked Region Transform for Image Harmonization Sheng Liu, Cong Phuoc Huynh, Cong Chen, Maxim Arap, Raffay Hamid CVPR, 2023 paper We designed a self-supervised pre-training method for image harmonization that outperforms existing methods while using less than 50% of the labeled training data they require.
	DepthSfM: Depth Guided Sparse Structure from Motion Sheng Liu, Xiaohan Nie, Raffay Hamid CVPR, 2022 paper / video / poster / data Leveraging depth priors enables DepthSfM to faithfully reconstruct 3D scene structures from movies, TV shows, and Internet photo collections. DepthSfM can reconstruct parts of Los Angeles from a 2-second clip of Bosch. Check out this video to see the results.
	OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Pre-training Sheng Liu, Kevin Lin, Lijuan Wang, Junsong Yuan, Zicheng Liu AAAI, 2022 paper / video / data We introduced open-vocabulary visual instance search (OVIS) which aims to search for and localize visual instances using arbitrary textual queries, and developed a large-scale vision-and-language pre-trained model for OVIS. Check out this 1-minute demo video where we compare our model with Google and Bing. Our model uses only the visual information of images, while Google and Bing also leverage textual metadata. It's quite intriguing! 😜.
	Learning Kinematic Formulas from Multiple View Videos Liangchen Song* Sheng Liu, Celong Liu, Zhong Li, Yuqi Ding, Yi Xu, Junsong Yuan ACM MM, 2022 indicates equal contribution paper We proposed a novel framework capable of learning kinematic formulas, e.g., kinematic equations for objects in free fall, in an unsupervised manner by leveraging neural radiance fields (NeRF).
	NeCH: Neural Clothed Human Model Sheng Liu, Liangchen Song, Yi Xu, Junsong Yuan VCIP, 2022 * indicates equal contribution paper We proposed a neural clothed human model, which learns neural radiance fields (NeRF) to represent 3D animatable avatars
	SibNet: Sibling Convolutional Encoder for Video Captioning Sheng Liu, Zhou Ren, Junsong Yuan TPAMI, 2021 paper This is the journal version of our paper presented at ACM MM'18. We designed a two-branch visual encoder for video captioning: one branch encodes high-level semantic information, while the other encodes low-level content information of videos.
	SibNet: Sibling Convolutional Encoder for Video Captioning Sheng Liu, Zhou Ren, Junsong Yuan ACM MM, 2018 (Oral) paper We designed a two-branch visual encoder for video captioning: one branch encodes high-level semantic information, while the other encodes low-level content information of videos.

Services

Conference reviewer: NeurIPS, ICLR, CVPR, ECCV, ICCV, AAAI
Journal reviewer: TPAMI, TIP, TCSVT