Google DeepMind’s AI Robot Takes the Stage

A few years ago, robots needed detailed maps and precise commands to navigate, but now vision language models like Google’s Gemini, trained on images, video, and text, allow robots to interpret visual and spoken instructions, such as following a whiteboard sketch to reach a new destination. (Source: Image by RR)

Gemini Language Model Allows Robot to Understand, Execute Complex Commands

Google DeepMind has revealed that their robot, equipped with the latest version of the Gemini large language model, has been successfully performing tasks such as guiding people to specific locations and assisting around the office in Mountain View, California. The Gemini model allows the robot to understand and execute commands by integrating video and text processing, enabling it to navigate and interact with its environment more effectively. According to a story in wired.com, this advancement showcases the potential of large language models to extend beyond digital applications and perform useful physical tasks.

Demis Hassabis, CEO of Google DeepMind, noted that Gemini’s multimodal capabilities, which include handling both visual and auditory inputs, could unlock new abilities for robots. In a new research paper, the team reported that their robot achieved up to 90 percent reliability in navigation tasks, even with complex commands. This improvement in human-robot interaction highlights the enhanced usability and naturalness brought by integrating advanced AI with robotics.

The use of large language models like Gemini in robotics is part of a broader trend in academic and industry research aimed at enhancing robot capabilities. At the International Conference on Robotics and Automation, numerous papers discussed the application of vision language models to robotics. This growing interest is further fueled by significant investments in startups, such as Physical Intelligence and Skild AI, which aim to combine AI advances with real-world training to develop robots with general problem-solving abilities.

Traditionally, robots required detailed maps and specific commands to navigate their environment. However, large language models trained on images, video, and text—known as vision language models—provide useful information about the physical world, allowing robots to answer perceptual questions and follow visual instructions. The researchers plan to expand their testing to different types of robots and explore more complex interactions, demonstrating the continued evolution and potential of AI-powered robotics.

About the Author: Roque Ramirez

Leave A Comment Cancel reply

Our Company Mission

Seeflection.AI / Seeflection.com is focused in two areas, which provide synergies to each other. First, Seeflection.com provides AI news, information and e- learning and associated development resources. Second, we provide AI-based development and support services to companies focused in AI, quantum-AI and AI-enabled blockchain development. We have a rapidly growing set of affiliations with a range of corporate and non-profit Artificial Intelligence laboratories and research centers-- as well as individuals in various AI specialties. We are active in both primary and applied AI research and development programs, as well as AI applied to medicine, robotics, media and related markets.

Our Philosophy

Create synergy through applying technology to address long-term problems and create lasting opportunities for people.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Gemini Language Model Allows Robot to Understand, Execute Complex Commands

About the Author: Roque Ramirez

Thor Chip Gives Robots 7x More Power

Google Details Gemini AI Energy Use

China’s White Rhino Outpaces Competition

Robotic Sheet Crawls, Grips, Adapts

MIT Unveils Self-Observing Robots

Leave A Comment Cancel reply

Our Company Mission

Our Philosophy

Google DeepMind’s AI Robot Takes the Stage