AI Gets Interpretive with the Dall-E Image Producing Program

Rob Toews of wrote a column last week on AI that ended with this statement: “Things are only going to get more amazing from here.”

That is an understatement based on OpenAI’s new program model.

Earlier this month, OpenAI—the research organization behind last summer’s much-hyped language model GPT-3—released a new AI model named DALL-E. While it has generated less buzz than GPT-3 did, DALL-E has even more profound implications for the future of AI.

DALL-E uses text captions as input and produces original images as output. (The name is a tribute to the surrealist artist Salvador Dalí and the adorable Pixar robot WALL-E.)

For instance, when fed phrases as diverse as “a pentagonal green clock,” “a sphere made of fire” or “a mural of a blue pumpkin on the side of a building,” DALL-E is able to generate shockingly accurate visual renderings. (It is worth taking a few minutes to play around with some examples yourself.)

Why is DALL-E important?

As the first step towards a new AI paradigm known as “multimodal AI,” DALL-E represents the future of AI. Multimodal AI systems can interpret, synthesize, and translate between multiple informational modalities—in DALL-E’s case, language and imagery. DALL-E is not the first, but is the most advanced.

OpenAI co-founder Ilya Sutskever summed it up well:

“The world isn’t just text. Humans don’t just talk: we also see. A lot of important context comes from looking.”

Most AI systems in existence today deal with only one type of data. NLP models (e.g., GPT-3) handle only text; computer vision models (e.g., facial recognition systems) handle only images. This is a far less rich form of intelligence than what the human brain achieves effortlessly.

Humans continuously receive and integrate information from not one but five senses—we understand the world around us through a combination of sight, sound, touch, smell and taste. And we communicate information back to the world in a variety of ways—speech, text, body language, facial expression, music.

By pairing an understanding of natural language with an ability to generate corresponding visual representations—in other words, by being able to both “read” and “see”—DALL-E is a powerful demonstration of multimodal AI’s potential.

Toews predicts that future AI systems will “engage seamlessly across audio, video, speech, images, written text, haptics, and beyond.”