Skip to main content

Allen School student Mohit Shridhar earns NVIDIA Fellowship for his work in grounding language for vision-based robots

Mohit Shridhar in front of a mountain

Mohit Shridhar, a Ph.D. student working with Allen School professor Dieter Fox, has been named a 2022-2023 NVIDIA Graduate Fellow for his research in building generalizable systems for human-robot collaboration. Shridhar’s work is focused on connecting language to perception and action for vision-based robotics.

Shridhar aims to use deep learning to connect abstract concepts to concrete physical actions with long-term reasoning to develop robot butlers. The Fellowship will help him continue his work in building robots that learn through embodied interactions rather than from static datasets. Using his own creation CLIPort, a language-conditioned imitation-learning agent, will advance precise spatial reasoning and learning generalizable semantic representations for vision and language. Shridhar’s framework combines two-streams with semantic and spatial pathways, where the semantic stream uses an internet pre-trained vision language model to bootstrap learning. This end-to-end framework can solve a variety of language-specified tabletop tasks, from packing unseen objects to folding clothes with centimeter-level precision.

“Mohit’s CLIPort work is the first to show the power of combining general language and image understanding models with fine-grained robot manipulation capabilities,” said Fox, who leads the Allen School’s Robotics & State Estimation Lab and is senior director of robotics research at NVIDIA..

In order to communicate with the butlers, Shridhar developed the Action Learning From Realistic Environments and Directives dataset (ALFRED). This is a dataset for agents to learn mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED consists of 25,000 natural language directives, including high-level instructions like “rinse off a mug and place it in the coffee maker” and lower-level language directions like “walk to the coffee maker on the right.” Tasks given to ALFRED are more complex in terms of sequence length, action space and language than previous vision-and-language task datasets.

Taking the next step beyond communicating tasks to the robots, Shridhar wants the robots to think about long-term actions without directly dealing with the complexities of the physical world. An example he gives is telling an agent to make an appetizer with sliced apples. Without any physical interactions, ALFWorld, a simulator that enables agents to learn abstract, “textual” policies in an interactive TextWorld, will train the robot to check the fruit bowl for apples and look in the drawers for a knife to make the appetizer. Before ALFWorld, agents did not have the infrastructure necessary for both reasoning abstractly and executing concretely. 

Shridhar intends to deploy ALFRED-trained models in household environments where a mobile manipulator can be commanded to perform tasks such as putting two plates on the dining table.

“I hope to build collaborative butler robots that aid and better human living,” Shridhar said.

Before coming to the Allen School, Shridhar received his Bachelor’s in Engineering from the National University of Singapore. He has interned at Microsoft Research, NVIDIA and an augmented reality startup. 

Shridhar is only one of 10 students recognized by the Graduate Fellowship Program based on their innovative research in Graphics Processing Unit (GPU) computing. Previous Allen School recipients of the NVIDIA Fellowship include Anqi Li (2020) and Daniel Gordon (2019).

Read more about the 2022-2023 NVIDIA Graduate Fellowship awards here.

Congratulations, Mohit!