MIT and IBM researchers have developed a context-based training method that enables AI models to identify personalized objects—like one’s own pet—by teaching them to localize based on situational cues rather than memorized labels, improving accuracy by up to 21% and expanding real-world applications from assistive tech to environmental monitoring. (Source: Image by RR)

Study Highlights Contextual Learning as Path Toward Human-Like Perception

Researchers at MIT, the MIT-IBM Watson AI Lab, and the Weizmann Institute of Science have developed a new training method that teaches vision-language models (VLMs) to identify and localize personalized objects—a capability current models like GPT-5 struggle with. While today’s multimodal AIs can easily detect general categories like “a dog,” they often fail to distinguish between unique instances, such as one’s own pet. The new approach helps AI systems track individual objects, like a specific French Bulldog named Bowser, by using contextual understanding rather than rote memorization.

The team built a specialized dataset from video-tracking data showing the same object across multiple frames, forcing models to rely on environmental context rather than preexisting knowledge. To prevent “cheating,” researchers replaced object labels with pseudonyms—changing “tiger” to “Charlie,” for example—so the AI couldn’t simply match known visual patterns to names. This technique, as noted in news.mit.edu, encourages the model to focus on visual and situational cues to locate the target across varying backgrounds, lighting and perspectives.

Results were striking: models retrained using this dataset achieved up to 21% higher accuracy in personalized object localization while maintaining their general vision-language performance. This breakthrough enables AIs to recognize specific items or beings—like a child’s backpack, a species of animal in the wild, or a misplaced household object—without retraining from scratch. The research also has potential applications in assistive technology, allowing systems to help visually impaired users identify personal belongings or track specific individuals.

MIT postdoc Jehanzeb Mirza, the paper’s senior author, said the work “reframes few-shot personalized object localization as an instruction-tuning problem.” The study, to be presented at the International Conference on Computer Vision, highlights how AI can be taught to learn from context more like humans do—bridging the gap between general recognition and personal awareness.

read more at news.mit.edu