Thirty Seconds of Scene 1 - Rex Marks The Spot

Back in March I wrote about cracking character consistency — generate one start frame, let a video model carry it into motion, and the end frame comes free because both frames come from the same generation. That fixed storyboard panels. It did not give me a scene.

A scene is shots. Shots are cuts. And every cut is a fresh chance for a character to forget what her face looks like.

So this week I tried two tools I hadn't touched before — mitte.ai and Google Flow — with one question: can either of them hold a character across a whole scene, not just a single clip?

Short answer: better than I expected, and still not enough.

mitte.ai

mitte is a front-end bolted onto a pile of current video models — Seedance 2, Veo, Kling, a dozen others. I fed it Scene 1, the one where Gabe and Nina are trying to leave for date night while the storm rolls in and the kids stall them at the door.

I gave Seedance 2 the four character turnarounds and the storyboard panels, one shot at a time, and asked for six: the establishing wide, Nina speed-walking through the room half-dressed, the two-shot of the parents, the over-the-shoulder of the kids, Mia's close-up, and Gabe finally caving. Then I stitched them.

Six shots, about thirty seconds, Seedance 2 via mitte.ai. No audio — that gets added later.

Watching it back: Mia is Mia in every shot. Leo is Leo. The living room is the same living room — same lamp, same window, same dinosaur toys on the floor. That has never been true before. Six months ago this was the exact thing that broke every time.

It is not free of problems. The tooling is a website — there's no API, so "rendering a scene" means me clicking through a browser, uploading the same reference images over and over, one shot at a time. Seedance's auto-generated audio tripped a content filter twice before I learned to turn it off. And it caps you at two generations at once, so a six-shot scene is a slow afternoon. But the shots themselves hold up. That's the part that matters.

Google Flow

Flow is Google's own thing — same Veo 3.1 under the hood, but built like an actual production tool. It has a Characters panel: upload a character once and it stays in the project. After a session of re-uploading turnarounds into mitte, that felt like a real workflow instead of a chore.

The first shot came out gorgeous and wrong. I'd given Flow the characters but not the room, and Veo decided the scene didn't need a television — a problem when the whole point of the shot is three kids watching one.

Beautiful. Also missing the entire premise.

The fix was to hand it the storyboard panel as the literal first frame instead of a loose reference. Do that and it behaves — same room, same blocking, the camera holds.

Same shot, with the storyboard panel handed in as the first frame. The TV comes back.

So Flow can do it. The catch is the meter. Veo's good setting burns credits fast — about four shots before Google told me I was out. There's a cheaper, faster model, Omni Flash, that's genuinely close in quality for a quarter the cost, and I finished the parents' two-shot on it.

Gabe and Nina, rendered on Omni Flash — a quarter the cost of the premium Veo setting, and hard to tell apart.

The trade: the expensive Veo models stamp a "Veo" watermark into the corner of every frame. Omni Flash only leaves a small sparkle. Neither is something I want in a finished film.

Where this actually leaves us

I want to be careful here, because the easy version of this post is "we did it." We didn't.

What I have is thirty seconds. Six clips taped together. It's the best-looking thirty seconds this project has produced, and the consistency problem that ate the whole spring looks, for the first time, basically handled. But thirty seconds of an animatic is not a film, and the gap between the two is most of the actual work — performances that carry a scene, timing, edits that mean something, shots that talk to each other instead of just sitting next to each other.

The tools still take liberties I didn't ask for. They still can't be scripted — everything is a person clicking. They still flinch at the kids: Veo's safety filter rejected my first prompt outright for describing children too specifically, and I had to launder the language to get a frame out of it. None of that is film-ready.

But I've been doing this since February, and I can see the slope. In February, rigging a single character was a four-day fight. In March, two consecutive panels wouldn't agree on hair color. This week I got six shots of four characters to hold together across cuts, on tools that didn't exist in usable form when I started. I'm not going to pretend that's a movie. I'm also not going to pretend it isn't moving.

Next: pushing past the single scene, and figuring out which of these — or which combination — is worth committing to.