Google was caught flat-footed when OpenAI suddenly released ChatGPT to the public a year ago, and the search giant has been furiously playing catch-up ever since. On Wednesday, Google announced its powerful new Gemini large language model (LLM), which it says is the first built to process not just words but also sounds and images. Gemini was developed in part by the formidable brains at Google DeepMind, with involvement across the organization. Based on what I saw in a press briefing earlier this week, Gemini could put Google back at the front of the current AI arms race.
Google is releasing a family of Gemini models: a large Ultra model for complex AI tasks, a midsize Pro model for more general work, and a smaller Nano model that’s designed to run on mobile phones and the like. (In fact, Google plans to build Gemini into the Android OS for one of its phones next year.) The Ultra model “exceeds current state-of-the-art results” on 30 of the 32 benchmarks commonly used to test LLMs, Google says. It also scored a 90% on a harder test called the Massive Multitask Language Understanding, which assesses a model’s comprehension ability in 57 subject areas, including math, physics, history, and medicine. Google says it’s the first LLM to score better than most humans on the test.
The models were pretrained (allowed to process large amounts of training data on their own) using images, audio, and code. A Google spokesperson tells me the new models were pretrained using “data” from YouTube but didn’t say if they were pretrained by actually “watching” videos, which would be a major breakthrough. (OpenAI’s GPT-4 model is multimodal and can accept image and voice prompts.)
Models that can see and hear are a big step forward, in terms of functionality. When running on an Android phone, Gemini (the Nano version) could use the device’s camera and microphones to process images and sounds from the real world. Or, if Nano performs something like the larger models, it might be used to identify and reason about real-world objects it “sees” through the lenses of a future augmented reality headset (developed by Google or one of its hardware partners). That’s something Apple’s iPhone and Vision Pro VR headset probably won’t be able to deliver next year, though Meta is hard at work on XR headsets that perform this sort of visual computing.
During the press briefing Tuesday, Google screened a video showing Gemini reasoning over a set of images. In the video, a person placed an orange and a fidget toy on the table in front of a lens connected to Gemini. Gemini immediately identified both objects and responded with a clever commonality between the two items: “Citrus can be calming and so can the spin of a fidget toy,” it said aloud. In another video, Gemini is shown a math test where a user has handwritten their calculations to a problem. Gemini then identifies and explains the errors in the student’s calculations.
In the near term, Gemini’s powers can be experienced through Google’s Bard chatbot. Google says Bard will now be powered by the mid-tier Gemini Pro model, which it expects will give the chatbot better learning and reasoning skills. Bard will upgrade to the more powerful Gemini Ultra model next year, says Sissie Hsiao, Google’s VP/GM of Assistant and Bard. Developers and enterprise customers will be able to access and build on Gemini Pro via an API served from the Google Cloud starting December 13, a spokesperson said.