The hardest part of making this movie has not been generating individual shots. Modern video models do that reasonably well. The hardest part is consistency. I would render a scene, watch it back, and find that Mia had different hair in every cut, that the living room had rearranged itself between two shots, that the TV had drifted across the room, or that a character who was supposed to be in the scene was just not in frame. I was finding all of this by eye, then patching it by hand. Every patch is more credits, more time, and more attention burned.
So I sat down and rebuilt the whole thing as a self-correcting pipeline. The shape is simple. Lock the source of truth at the top. Pass every intermediate asset through a validator before the next stage is allowed to use it. Cap spend so a bad loop cannot run away. Here is what that turned into, the failure modes that bit me along the way, and what it cost.
The shape of the problem
A scene is a sequence of shots cut together. Each shot is generated from a storyboard panel plus a motion prompt. The panel anchors the composition and the characters; the video model animates motion on top. So when a shot looks wrong, the wrongness has come from one of three places:
- The character references themselves are wrong, drifted, or missing.
- The storyboard panel was drawn off-model in the first place.
- The video model misinterpreted the panel during generation.
All three used to silently happen, and I would discover them with my eyes after the fact.
The architecture
Four pieces, each one validated against the piece above it.
1. Asset Bible
The single locked source of truth. Seven character turnarounds, four location plates, and a per-shot manifest in JSON that says, for every shot, which characters are present, what each one is wearing, the key props that must survive into the final image, the camera intent, and the storyboard panel URL. Lives at r2:rex-assets/asset-bible/.
The cast:
The location plates (family living room, magic minivan, Jurassic swamp, cave hideout):
These are what every downstream stage gets measured against.
2. Storyboard panels
One per shot, generated with image-to-image so the locked turnarounds are literally the visual reference. Per-shot composition, lighting, characters in place, props in place. This is the composition layer of the scene.
3. Shot validator
A vision model that scores every asset against the bible. For a still panel or a video keyframe it produces structured scores for character identity (face, hair, and build versus the turnaround, decoupled from clothing), wardrobe (versus the manifest, since the same character wears different clothes in different scenes), location (versus the location plate), continuity with the adjacent shot, and visible artifacts.
Each call costs three or four cents using Gemini Flash 2.5. The whole nine-shot Scene 1 audit was about three cents total. Catching a bad shot with a four-cent vision check before stitching, instead of after, is the entire economic argument for this system.
4. Cost governor and orchestrator
The closed loop. The orchestrator runs gated stages per shot. The cost governor wraps every paid action with a hard cap: a per-run budget cap, a per-shot retry cap, a no-progress detector that aborts a shot that keeps failing without improving, and a kill-switch file at scripts/pipeline/STOP that halts everything.
The budget-halt demo: a scene whose up-front estimate was $3.51, configured with a per-run cap of $0.50, stopped cleanly at $0.39 spent with the message "per-run cap $0.50 would be exceeded, halted." Zero of nine shots got past the cap. The reason this matters is the explicit failure mode I am insuring against, which is an automated pipeline running away to a thousand dollars while I am asleep. Not a class of bug that exists in this one.
The validation principle
The mistake I made earlier in this project was that I built a validator and only pointed it at the end of the line, at video shots. Storyboard panels rolled through, trusted because they existed. So when Mia was drawn wrong in the panel, the video model faithfully animated the wrong Mia.
The corrected principle, now codified in CLAUDE.md and enforced in the orchestrator's code, is this: validate at every stage. References, then panels, then video, then stitched scene, and at every arrow between stages there is a blocking gate. Output cannot move forward until it has passed validation against the layer above. If it fails, it gets regenerated within the cost governor's retry cap. If it cannot pass after a few attempts, the orchestrator escalates rather than continuing.
The hard rule: do not generate from an artifact you have not validated. A storyboard panel is not trusted because it exists. It is trusted because the validator confirmed it matches the locked references.
What broke
Four real problems worth telling you about.
Jenny was missing from the canonical roster
I had been operating from a project doc that listed six characters. The team had actually locked seven turnarounds back on February 7. Jenny, the teenage babysitter, was on R2 the whole time. The first build of the Asset Bible left her out because I trusted the wrong list. The fix was adding Jenny back to the canonical roster, and the standing rule is now to enumerate characters by running rclone lsd r2:rex-assets/characters/, never from memory.
Ruben and Jetplane were generated instead of looked up
Same root cause, twice over. The canonical approved turnarounds for both of them lived at r2:rex-assets/characters/<name>/<name>_turnaround_APPROVED.png. The bible build never checked those paths. It generated a substitute Ruben (an AI re-interpretation that looked sort of right) and rendered a pale low-poly Jetplane straight from the rigging .blend file. Both ended up in the bible. The fix was a quick R2-to-R2 copy of the canonical files into place.
The lesson is the same as Jenny: source of truth lives at one specific place; check that place; do not generate a substitute when the real thing exists.
The task daemon was killing live agents
This was the most painful bug. Every agent I launched would die after one to five minutes. I thought it was flakiness for a while. I finally tailed the daemon log and found this signature repeating every five minutes:
Cleaning up inactive done task task=32
Terminated Claude process task=32 pid=3125131
Task finished id=323 success=false
Task #32 was a website tweak that had been marked done 109 days ago. The daemon was running a cleanup pass on it every five minutes, and the cleanup was killing whatever live process happened to currently have the recycled PID that #32's record was still pointing at. Every five minutes a live agent would be killed and the kill would be logged against #32.
The permanent fix is policy. Delete finished tasks; do not let them sit in the "done" state with stale PID records. Closed tasks stay as kill triggers; deleted tasks are gone for good.
The storyboard panels were off-model
This was the most consequential one. I had locked the right turnarounds, built the orchestrator, gated the validator on every video shot, and rendered the first real Scene 1 shot 1A. Mia did not look like Mia. She was drawn as a pajama-clad toddler with her hair in a tight bun. The locked turnaround shows her as an older child with voluminous curly dark hair worn down and a magenta casual top and jeans. Two different characters.
The old (v3) Scene 1 panel for shot 1A:
The video had faithfully reproduced the wrong Mia. The error was baked in at the panel stage. I had been validating the video, not the panels. So I had been generating video off panels nobody had checked.
This was the deepest moment of the build. The bug was not in the video model. It was that the entire video stage was running on a base nobody had verified.
The pipeline now has a panel validation gate. Every panel must pass identity-against-turnaround, wardrobe-against-manifest, location, and characters-present before any video stage is allowed to touch it. I regenerated all nine Scene 1 panels with image-to-image, using the locked turnarounds as references, and ran the validator on each. Eight passed clean. One (1I, the parents at the front door) failed for "location drift," which turned out to be a manifest bug, not a panel error: the shot is at the front entryway, not the main living room, and the manifest had it labeled living_room. The agent caught it, fixed the manifest, the panel revalidated, and now all nine pass.
The same shot 1A panel, regenerated on-model:
The validator's audit on the v4 panel set:
| Shot | Pass | Identity | Wardrobe | Location | Continuity |
|---|---|---|---|---|---|
| 1A | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1B | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1C | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1D | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1E | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1F | PASS | 1.00 | n/a | 1.00 | 1.00 |
| 1G | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1H | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
| 1I | PASS | 1.00 | 1.00 | 1.00 | 1.00 |
Total cost of that audit: about three cents.
A real shot, finally
The first render off the broken v3 panel, with my early "static, subtle, calm" prompting:
Mia is wrong, nothing happens, the TV has imagery on its back face. It reproduces the panel faithfully, because the panel itself was just bad.
The render after the panel was regenerated on-model and the prompt was rewritten to actually describe motion:
Mia has curly hair worn down, the magenta star top, and jeans. Leo has the green dinosaur pajamas. Jenny is on her phone in the armchair. Nina and Gabe are in date-night formalwear in the background. The TV reads as a normal TV. The camera does a slow cinematic push-in over twelve seconds; the kids shift on the couch, lightning flashes outside the window, the dinosaur toys hold their positions on the rug.
It is one shot, not a full scene, and the character action is subtle rather than dramatic. Image-to-video from a single start frame inherently limits motion to small in-frame action and camera moves. To get a kid lunging for a toy or the parents dramatically crossing the room I will need to use the start-plus-end-frame mode (both mitte and Flow call it "Frames mode") and generate matching end panels.
But this is the first piece of footage on the project that was generated off a foundation that had actually been verified. And it shows.
Costs
Direct API and credit spend across the build:
| Item | Cost |
|---|---|
| Cost governor (code only) | $0.00 |
| Asset Bible build, including Gemini gens for Ruben and Jetplane, plus four location selections | ~$0.50 |
| Validator build and first proof run scoring Scene 1 video | ~$0.04 |
| Validator fix run (wardrobe and continuity rework) | ~$0.04 |
| Orchestrator build, plus budget-halt demo and closed-loop demo against existing clips | $7.02 of a $15 cap (recorded against existing clips, so representative rather than fully real) |
| Panel regeneration v4 (nine panels, Gemini image-to-image) | ~$0.40 |
| Panel 1I fix (turned out to be a manifest edit) | ~$0.00 |
| Validator audit on the v4 panel set | ~$0.03 |
| Mitte Seedance video renders (1A v1 off the broken panel, v2 with TV-correction prompting, v3 on the validated v4 panel) | 8,709 mitte credits, about $8.71 |
| Total direct, validatable spend | ~$16.75 |
That figure does not include the Anthropic API cost of running the agents themselves (the agents are what write the code, generate the panels, drive the validator, and so on). That cost is real but routed through a separate billing line. The table above does fully capture every dollar that hit a generation or vision API and every mitte credit spent.
The cost-per-shot story is worth pulling out. A video render is about $2.90 at twelve seconds, 720p, 16:9, Seedance 2 on Fast. A validator pass on a panel is about $0.04. A panel regeneration in Gemini is about $0.04. Catching a bad panel with two $0.04 validator calls and re-rolling it for $0.04, instead of finding the problem after $2.90 of video has been generated, is the move. That economic argument is what makes a self-validating pipeline actually pay off instead of just being a code aesthetic.
What's next
The foundation is sound for the first time. Every panel under the video stage has passed validation. The orchestrator structurally cannot run video generation on an unvalidated panel. The bible is correct (all seven characters, four location plates), the manifest is correct (front entryway labeled correctly), the validator decouples identity from wardrobe so a tuxedo does not get flagged as wrong because the turnaround happens to show a t-shirt.
The next thing to add is start-plus-end-frame mode so character action becomes a thing the model can actually render. That requires generating matching end panels for each shot, which is more Gemini image-to-image work, and switching the orchestrator's generator wrapper to pass both the start and end panel to mitte.
I will write that step up when it lands.