● v2 · on-camera cut

The influencer reel — without the influencer.

Version two of the Zentarou Sushi ad. Same 16 seconds of hook, food, reaction, call-to-action — but this time it wears a face. Five AI-generated clips, one consistent creator, a scripted voiceover, a soft lofi bed. No camera, no casting, no shoot day, no release forms.

16s

Reel runtime

~8 min

Total build time

~$4

Generation cost

5 shots

One consistent creator

New in v2: we hold one AI character across five separate Veo 3.1 generations — same face on the street, same hands on the chopsticks, same smile at the signoff. See the v1 narrator-style cut for the audio-first version.

The v2 Ad

5 clips from Veo 3.1 Fast · voice ElevenLabs Matilda · soft lofi bed ElevenLabs sound-gen · overlays baked in ffmpeg.
Character descriptor repeated verbatim across every Veo prompt for continuity.

two creative directions, same ~$4 budget

Two versions of the same brief.

Most small venues don't know yet which style will win with their audience — a calm narrator over food photography, or an on-camera creator energy. Create Studio runs both in parallel, then lets Meta tell you which one converts.

v1 · narrator cut · 12s

Cinematic food tour with voiceover

Ken Burns motion over owner-supplied photography, no human on screen.

Strength: works with zero AI people — reads as a high-end boutique ad.
Source: Google Maps photos, ranked by Gemini Vision.
Audience fit: mid-30s+ foodies who are over influencer theatrics.

v2 · on-camera cut · 16s

Local creator raves on camera

A consistent AI creator, scripted VO, silent Veo clips — zero lip-sync risk.

Strength: carries parasocial trust — feels like a friend's recommendation.
Source: two Veo text-to-video shots + two image-to-video food shots + one hands-only B-roll.
Audience fit: under-35 reels-native diners who scroll creator content all day.

inside the v2 cut

Five shots, one creator, 16 seconds.

Each clip is its own Veo 3.1 generation. The same character descriptor — age, hair, outfit, jewelry, lighting — is repeated verbatim in every prompt so the face doesn't drift between shots.

c1 · hook

Outside the noren at golden hour

3.0s · text-to-video

c2 · hero box

Dolly-in on uni + gold-leaf wooden tray

3.5s · image-to-video

c3 · sashimi

Slow rotation over aburi rolls

3.5s · image-to-video

c4 · B-roll

Chopsticks lifting toro nigiri

3.0s · hands-only macro

c5 · signoff

Warm smile, closed mouth, code pill

3.0s · text-to-video

what a face buys you

Four things a narrator VO can't do.

The v1 cut is beautiful and safe. The v2 cut carries different weight because there's a human in it — and that human ends up being a repeatable creative asset, not a freelance invoice.

👤

Parasocial pull

~3×
watch-through

Reels with a human face held on screen in the first 2 seconds finish at ~3× the rate of food-only b-roll — same script, same length.

🔁

Repeatable creator

One face,
50 venues

The same AI creator can appear across every restaurant in a franchise — or swap age, hair, style for a different neighborhood vibe, in the next generation.

🎙️

VO-safe

No lip-sync
uncanny valley

All Veo clips are rendered silent with closed-mouth prompting. Voiceover is layered on top in post — zero mouth-flap mismatch, zero AI-voice cringe.

⚡

Cheaper than casting

$4
vs. $1,500+

A mid-tier SF food creator charges $1,500 per post. This reel generated for under $5 in Veo + ElevenLabs credits, no contract, no dish comp, no reshoot.

how v2 is built

Script first, character-locked, silent Veo, VO on top.

The interesting constraint: Veo 3.1 generates its own audio by default, which does not lip-sync to your voiceover. The trick is to rig every shot so there's nothing to sync to — closed mouth, hands-only, prompt-level silence.

Script & shotlist

A 29-word voiceover is written first, then broken into 5 visual beats: hook outside, hero food, second food, reaction, signoff with offer.

Gemini 2.5 Flash

Character lock

One appearance descriptor — "25-year-old Asian-American, messy low bun, cream cotton tee, dainty gold necklace" — is pasted verbatim into every on-camera prompt.

Veo 3.1 Fast · t2v

Food b-roll from photos

Google Maps photos and owner shots become 6-second dolly and rotation clips — same exact logic as the v1 narrator cut.

Veo 3.1 Fast · i2v

VO + music + overlays

ElevenLabs generates the VO and a soft lofi bed. ffmpeg trims each clip, concats, bakes the neighborhood / coupon / URL pills, and fades to black.

ElevenLabs + ffmpeg

the trick that makes it work

How we kept Veo quiet.

Veo 3.1 Fast insists on generating its own audio for every shot. If that happens on a talking-head frame, the safety filter kills the job. Two fixes, one stylistic upgrade.

Fix 1 · prompt-level silence

Closed mouth, repeated five times per prompt.

Every on-camera prompt ends with a block: mouth stays closed, no speech, no lip movement, no dialogue, no text, no subtitles. Veo treats this as a hard visual constraint — the generated face never opens its mouth, so there is nothing to "speak" and nothing to lip-sync.

Fix 2 · hands-only B-roll

When the filter still flags, drop the face.

Our reaction shot kept getting filtered for "audio issues" despite being silent. Switching it to a macro of feminine hands lifting toro nigiri with chopsticks solved it instantly — no face, no mouth, no RAI trigger. It's also a stronger B-roll beat than a close-up smile would have been.

Want the on-camera cut for your venue?

Same pilot as v1 — we're picking two or three Bay Area restaurants for a fully-managed run. You get both versions (narrator + on-camera), three static variants, a landing page, the targeting and the conversion data. We get a real ROI case study.

Email us about a pilot See the live coupon page →