● v2 · on-camera cut

The influencer reel — without the influencer.

Version two of the Zentarou Sushi ad. Same 16 seconds of hook, food, reaction, call-to-action — but this time it wears a face. Five AI-generated clips, one consistent creator, a scripted voiceover, a soft lofi bed. No camera, no casting, no shoot day, no release forms.

16s
Reel runtime
~8 min
Total build time
~$4
Generation cost
5 shots
One consistent creator
New in v2: we hold one AI character across five separate Veo 3.1 generations — same face on the street, same hands on the chopsticks, same smile at the signoff. See the v1 narrator-style cut for the audio-first version.
The v2 Ad
5 clips from Veo 3.1 Fast · voice ElevenLabs Matilda · soft lofi bed ElevenLabs sound-gen · overlays baked in ffmpeg.
Character descriptor repeated verbatim across every Veo prompt for continuity.
two creative directions, same ~$4 budget

Two versions of the same brief.

Most small venues don't know yet which style will win with their audience — a calm narrator over food photography, or an on-camera creator energy. Create Studio runs both in parallel, then lets Meta tell you which one converts.

v1 · narrator cut · 12s

Cinematic food tour with voiceover

Ken Burns motion over owner-supplied photography, no human on screen.
  • Strength: works with zero AI people — reads as a high-end boutique ad.
  • Source: Google Maps photos, ranked by Gemini Vision.
  • Audience fit: mid-30s+ foodies who are over influencer theatrics.
v2 · on-camera cut · 16s

Local creator raves on camera

A consistent AI creator, scripted VO, silent Veo clips — zero lip-sync risk.
  • Strength: carries parasocial trust — feels like a friend's recommendation.
  • Source: two Veo text-to-video shots + two image-to-video food shots + one hands-only B-roll.
  • Audience fit: under-35 reels-native diners who scroll creator content all day.
inside the v2 cut

Five shots, one creator, 16 seconds.

Each clip is its own Veo 3.1 generation. The same character descriptor — age, hair, outfit, jewelry, lighting — is repeated verbatim in every prompt so the face doesn't drift between shots.

Hook shot
c1 · hook
Outside the noren at golden hour
3.0s · text-to-video
Hero box
c2 · hero box
Dolly-in on uni + gold-leaf wooden tray
3.5s · image-to-video
Sashimi plate
c3 · sashimi
Slow rotation over aburi rolls
3.5s · image-to-video
Chopsticks toro
c4 · B-roll
Chopsticks lifting toro nigiri
3.0s · hands-only macro
Signoff
c5 · signoff
Warm smile, closed mouth, code pill
3.0s · text-to-video
what a face buys you

Four things a narrator VO can't do.

The v1 cut is beautiful and safe. The v2 cut carries different weight because there's a human in it — and that human ends up being a repeatable creative asset, not a freelance invoice.

👤
Parasocial pull
~3×
watch-through
Reels with a human face held on screen in the first 2 seconds finish at ~3× the rate of food-only b-roll — same script, same length.
🔁
Repeatable creator
One face,
50 venues
The same AI creator can appear across every restaurant in a franchise — or swap age, hair, style for a different neighborhood vibe, in the next generation.
🎙️
VO-safe
No lip-sync
uncanny valley
All Veo clips are rendered silent with closed-mouth prompting. Voiceover is layered on top in post — zero mouth-flap mismatch, zero AI-voice cringe.
Cheaper than casting
$4
vs. $1,500+
A mid-tier SF food creator charges $1,500 per post. This reel generated for under $5 in Veo + ElevenLabs credits, no contract, no dish comp, no reshoot.
how v2 is built

Script first, character-locked, silent Veo, VO on top.

The interesting constraint: Veo 3.1 generates its own audio by default, which does not lip-sync to your voiceover. The trick is to rig every shot so there's nothing to sync to — closed mouth, hands-only, prompt-level silence.

1
Script & shotlist
A 29-word voiceover is written first, then broken into 5 visual beats: hook outside, hero food, second food, reaction, signoff with offer.
Gemini 2.5 Flash
2
Character lock
One appearance descriptor — "25-year-old Asian-American, messy low bun, cream cotton tee, dainty gold necklace" — is pasted verbatim into every on-camera prompt.
Veo 3.1 Fast · t2v
3
Food b-roll from photos
Google Maps photos and owner shots become 6-second dolly and rotation clips — same exact logic as the v1 narrator cut.
Veo 3.1 Fast · i2v
4
VO + music + overlays
ElevenLabs generates the VO and a soft lofi bed. ffmpeg trims each clip, concats, bakes the neighborhood / coupon / URL pills, and fades to black.
ElevenLabs + ffmpeg
the trick that makes it work

How we kept Veo quiet.

Veo 3.1 Fast insists on generating its own audio for every shot. If that happens on a talking-head frame, the safety filter kills the job. Two fixes, one stylistic upgrade.

Fix 1 · prompt-level silence

Closed mouth, repeated five times per prompt.

Every on-camera prompt ends with a block: mouth stays closed, no speech, no lip movement, no dialogue, no text, no subtitles. Veo treats this as a hard visual constraint — the generated face never opens its mouth, so there is nothing to "speak" and nothing to lip-sync.

Fix 2 · hands-only B-roll

When the filter still flags, drop the face.

Our reaction shot kept getting filtered for "audio issues" despite being silent. Switching it to a macro of feminine hands lifting toro nigiri with chopsticks solved it instantly — no face, no mouth, no RAI trigger. It's also a stronger B-roll beat than a close-up smile would have been.

Want the on-camera cut for your venue?

Same pilot as v1 — we're picking two or three Bay Area restaurants for a fully-managed run. You get both versions (narrator + on-camera), three static variants, a landing page, the targeting and the conversion data. We get a real ROI case study.

Email us about a pilot See the live coupon page →