A few years ago, robots needed detailed maps and precise commands to navigate, but now vision language models like Google’s Gemini, trained on images, video, and text, allow robots to interpret visual and spoken instructions, such as following a whiteboard sketch to reach a new destination. (Source: Image by RR)

Gemini Language Model Allows Robot to Understand, Execute Complex Commands

Google DeepMind has revealed that their robot, equipped with the latest version of the Gemini large language model, has been successfully performing tasks such as guiding people to specific locations and assisting around the office in Mountain View, California. The Gemini model allows the robot to understand and execute commands by integrating video and text processing, enabling it to navigate and interact with its environment more effectively. According to a story in wired.com, this advancement showcases the potential of large language models to extend beyond digital applications and perform useful physical tasks.

Demis Hassabis, CEO of Google DeepMind, noted that Gemini’s multimodal capabilities, which include handling both visual and auditory inputs, could unlock new abilities for robots. In a new research paper, the team reported that their robot achieved up to 90 percent reliability in navigation tasks, even with complex commands. This improvement in human-robot interaction highlights the enhanced usability and naturalness brought by integrating advanced AI with robotics.

The use of large language models like Gemini in robotics is part of a broader trend in academic and industry research aimed at enhancing robot capabilities. At the International Conference on Robotics and Automation, numerous papers discussed the application of vision language models to robotics. This growing interest is further fueled by significant investments in startups, such as Physical Intelligence and Skild AI, which aim to combine AI advances with real-world training to develop robots with general problem-solving abilities.

Traditionally, robots required detailed maps and specific commands to navigate their environment. However, large language models trained on images, video, and text—known as vision language models—provide useful information about the physical world, allowing robots to answer perceptual questions and follow visual instructions. The researchers plan to expand their testing to different types of robots and explore more complex interactions, demonstrating the continued evolution and potential of AI-powered robotics.

read more at wired.com