PhD Research Intern, Visual Object Understanding
Zoox
Foster City, CA, USA
USD 9,500-9,500 / month
About Our Internship Program
Zoox’s internship program offers hands-on experience with cutting-edge technology, mentorship from some of the industry’s brightest minds, and the opportunity to make meaningful contributions to real projects. We seek interns who demonstrate strong academic performance, engagement beyond the classroom, intellectual curiosity, and a genuine interest in Zoox’s mission.
Project Overview
The Perception Attributes team builds the agent semantics layer of Zoox's perception stack. Our models classify what obstacles mean — detecting emergency vehicle lights, pedestrian gestures, turn signals, and dozens of other behavioral signals that inform how the AV responds. This work sits at the intersection of safety-critical autonomy and cutting-edge ML: our models run on every Zoox vehicle, and our outputs directly influence decisions like yielding to emergency vehicles and interacting with construction workers. The team is small, moves fast, and collaborates closely with ML researchers across the AI org.
During the internship, you will work on one of the most exciting open problems in AV perception: using modern foundation models — large vision-language models, multimodal transformers, and audio-visual architectures — to dramatically expand the semantic understanding of our perception stack. Current approaches require months of data collection and labeling to add a single new attribute class. The research goal is to change that fundamentally, using VLMs and language-aligned representations to make our models more generalizable, queryable, and data-efficient. The work spans dataset construction, model design, and evaluation — with direct implications for how Zoox handles novel emergency vehicles, complex pedestrian behavior, and safety-critical edge cases as we scale to new cities.
Requirements:
Currently working towards a Ph.D., or advanced degree in a relevant engineering program
Good academic standing
Able to commit to a 12-week internship beginning in late May or June of 2026.
At least one previous industry internship, co-op, or project completed in a relevant area
Ability to relocate to the Bay Area, California (or Boston, Massachusetts) for the duration of the internship
Interns at Zoox may not use any proprietary information they are working on as part of their thesis, any published work with their university, or to be distributed to anyone outside of Zoox
Qualifications (It’s helpful if you meet a majority of the following qualifications, but it isn’t a requirement):
- Strong background in computer vision and deep learning
- Experience training and evaluating ML models in PyTorch
- Familiarity with vision transformers, contrastive learning, or knowledge distillation
Bonus Qualifications:
- Experience with vision-language models (CLIP, SigLIP, LLaVA, or similar)
- Familiarity with knowledge distillation or multimodal learning
- Experience with large-scale dataset construction or data pipelines