When Google announced Gemini Omni at I/O on Monday, I rolled my eyes a little. We have been hearing about the next great AI video model every six weeks for the last two years, and most of them turn out to be impressive demos that fall apart the moment you try to use them for real work. I almost did not bother trying this one.
Then a friend who runs a small video production studio messaged me at midnight saying I had to see what he had generated. He sent me a fifteen-second clip of a coffee being poured in slow motion, shot from three different angles, with a soft jazz score underneath. I asked him how long the actual production took. He said about four minutes. That got my attention.
I spent the next three days putting Gemini Omni through everything I could think of, from quick social clips to more involved creative experiments. Here is what I actually found, the good and the not so good.
Day One: The Honeymoon Phase
The first afternoon was, frankly, magical. I started with simple text prompts because that is the easiest way to get a sense of any new model. A cinematic shot of rain on a window with a city skyline in the background. A drone flyover of a Greek island village at sunset. A close up of someone's hands kneading bread dough.
All three came back looking like they had been pulled from a high-budget travel show. The lighting was consistent. The physics looked right. The camera moves felt intentional rather than glitchy.
This is the part where most new AI tools have already lost me, because the demo magic vanishes after the third or fourth generation when you start asking for anything specific. Gemini Omni held up. I generated about thirty clips that first afternoon and would estimate that maybe four of them had obvious problems. The rest ranged from "usable with a small edit" to "I would put this in a paid client deliverable."
What Actually Surprised Me
A few things took me off guard, in a good way.
The conversational editing is the headline feature in every press writeup, but you do not really understand how much it changes the workflow until you use it. I generated a clip of a person walking down a street and then just typed "make it golden hour instead of midday." The model returned essentially the same shot, same person, same street, with the lighting completely changed. Then I asked it to add a slow push-in on the subject. Same result. No regenerating from scratch, no losing the parts I liked.
The multimodal input is also more useful than I expected. I tried something a bit unusual: I uploaded a photo of my office, then a thirty-second voice memo describing the mood of a video I wanted, then asked the model to combine the two. It generated a video of the office that visually matched my space but had the energy and pacing I described in the audio. This is not how I have ever prompted a creative tool before. It took some adjustment.
The third thing was the speed. I am used to AI video tools where you submit a generation, get a coffee, come back, refine, get another coffee. Gemini Omni Flash is fast enough that the feedback loop felt like real-time creative work. You think, you prompt, you see, you adjust. The whole afternoon felt like sketching, not waiting.
Where It Falls Apart
I want to be honest because there are real limitations, and the early reviews glossing over them are doing nobody any favours.
Text inside the video is still a disaster. I tried generating a clip of a storefront with a specific business name on the sign. Every attempt produced gibberish lettering that looked like a dyslexic dream. This is not unique to Gemini Omni, every AI video model has this problem in May 2026, but it does mean you cannot use the tool to generate anything that needs readable on-screen text. You have to add that in post.
Generation length is also limited. Anything beyond about ten seconds starts to degrade noticeably. The lighting drifts, character consistency breaks down, and the camera move can start to feel like it is improvising. For short-form social content this is fine, but for anything longer you are stitching clips together in a normal editor, and the moment you do that, the seams between AI-generated clips become visible.
Character consistency across multiple generations is the biggest practical problem. If I generate a clip of "a woman in a red coat walking through a market" and then ask for a follow-up clip of "the same woman entering a cafe," the second clip will feature a noticeably different woman. Same outfit, similar build, but different face. This makes it very hard to use Gemini Omni for any kind of narrative work involving a recurring character.
And the SynthID watermark, while invisible, is permanent. Every clip is detectably AI-generated by anyone who runs it through Google's detection tools. I think this is the right ethical default, but if you were hoping to pass off AI footage as real for any reason, that ship has sailed.
The Thing Nobody Is Talking About
Here is what I found most interesting after three days of testing, and what I have not seen anyone write about yet.
The model is best when you treat it as a collaborator rather than a tool. The prompts that produced my worst results were the ones where I had a very specific output in my head and tried to instruct the model precisely to that output. The prompts that produced my best results were the ones where I gave it a starting direction and then iterated with it, letting it surprise me, taking what it gave back and pushing in directions I had not originally planned.
This is a different relationship to a creative tool than most of us have had with software. It is closer to working with another person who has their own opinions than to operating a piece of equipment.
That probably sounds woolly, but it changes the workflow in real ways. I stopped writing detailed multi-paragraph prompts and started writing short intent statements. I let the model take a swing and then directed from there. The output quality went up noticeably.
If you want to compare notes with what other early users are finding, the running compilation at Gemini Omni has been gathering hands-on reports, benchmarks, and prompt examples from people in the first wave of testing. Looking through it, I noticed many of my observations were echoed by others, which made me feel a little less like I was imagining things.
Final Take After Three Days
Would I use Gemini Omni for client work tomorrow? For certain types of client work, yes. Mood reels, concept pitches, paid social variants, internal training content, product visualisations from a single reference photo. These all became significantly cheaper to produce in the last week.
Would I use it to replace a real production? No, and I do not think anyone serious would. The character consistency limitations alone rule out most narrative work, and the watermarking rules out any application that depends on passing the footage off as real.
What this tool actually changes is the floor. A solo creator can now produce work that a year ago would have required a small team. A small marketing department can now generate variant testing material that previously needed an outside production agency. The very top end of video production is unchanged. The middle and bottom of the market just shifted under everyone's feet.
I will keep using it. I have a long list of experiments I did not get to. The interesting question is not whether this version of Gemini Omni is good enough for your use case. The interesting question is what the version six months from now will be capable of, and what that means for everyone who makes a living adjacent to video work.
Sign in to leave a comment.