Back to blog
Technical11 min read

How We Built a Face-Insertion AI Video Pipeline for Kids

How we built a kids video pipeline with IP-Adapter face insertion, FFmpeg assembly, and queues that survive traffic spikes.

By Subham Mahapatra

Co-founder & Engineering, Brixloop

Personalised kids video sounds simple on a pitch deck. Parents upload a photo, type a prompt, and get a story with their child in it. Safe for kids, fast enough for a normal Tuesday, and somehow still standing when a TikTok mention sends traffic up 40x. We built that for Tiny Tales Videos. This post is the architecture we actually run in production, not the diagram we drew in week one.

What we were actually building

The product promise was narrow on purpose: a parent describes a story, the child's face stays consistent across scenes, and the output feels age-appropriate. That narrow promise hides a wide engineering surface. You need scripting, image generation with identity preservation, narration, video assembly, billing boundaries, and a queue that does not melt when render time spikes.

The failure mode we cared about most was not bad art. It was bad operations: parents waiting without status, chargebacks from unclear previews, and margin disappearing when every viral spike hit the same FIFO queue.

The core pipeline

At a high level the system is linear, with human decision points where trust matters:

  1. Parent prompt and child photo intake
  2. Story and scene scripting (structured, not open-ended)
  3. Image synthesis with identity-preserving face insertion per scene
  4. Frame assembly, transitions, and encoding
  5. Narration matched to scene pacing
  6. Preview delivery, parent confirmation, then paid final render

Each stage is independently queueable. That matters because image steps and video encode steps have different cost profiles and different failure rates. You do not want a slow FFmpeg job blocking a cheap preview path.

Stack we shipped with

  • GPT-4 for story structure and scene breakdown from parent prompts
  • Stable Diffusion with IP-Adapter for face insertion across scenes
  • FFmpeg for assembly, transitions, and output encoding
  • ElevenLabs for narration tuned to scene length
  • Next.js dashboard for preview, confirm, and paid render flows
  • AWS-backed async workers with provider fallbacks

Face insertion without the uncanny valley

Generic face-swap approaches broke quickly on children's content. Proportions drifted, identity shifted between scenes, and safety got harder to reason about when the model had too much creative freedom.

IP-Adapter gave us a reference-image conditioning path: the child's photo is a constraint in generation, not a post-processing sticker on top of a generic character. We paired that with a constrained scene library so outputs stay on-brand and age-appropriate instead of fully open-ended diffusion.

Practically, that meant locking visual motifs early, limiting pose variance per scene type, and testing identity consistency on a fixed evaluation set of child photos before we touched parent-facing UI polish.

Parent trust is a product feature

We separated preview from final render early. Parents see a lower-cost preview path first, confirm what they are buying, then trigger the full pipeline. That single product decision did more for chargeback rate than any model upgrade.

  • Clear progress states (queued, rendering scenes, assembling, ready)
  • Explicit ETA ranges instead of silent spinners
  • A visible diff between preview quality and final render quality
  • One-click support context (job ID, stage, last error) for ops

Support tickets dropped when uncertainty dropped. Most parents are fine waiting if they know the system is working.

Throughput and cost under viral load

The hardest production problem was queue design under spike traffic. Naive FIFO queueing collapses margin when provider rate limits kick in mid-campaign. Everyone waits longer, retries stack, and you pay twice for the same scene.

What worked for us:

  • Parallel workers per slow stage, not one global queue
  • Asset caching for reusable backgrounds, props, and scene templates
  • Provider fallbacks when upstream APIs throttle
  • Graceful degradation: preview tier stays available even when final-render capacity is saturated
  • Per-job cost ceilings so a single bad prompt cannot burn unlimited GPU

We instrumented per-stage latency before launch. Bottlenecks that look theoretical in staging show up immediately under real traffic. The first viral curve taught us more than three weeks of local load testing.

Safety and child-appropriate output

Kids products need conservative defaults. We constrained story templates, blocked open-ended character generation, and added automated checks on script content before any image job started. Human review hooks exist for edge cases, but the goal was to make unsafe paths rare in the first place.

If you are scoping something similar, treat safety constraints as pipeline inputs, not a moderation screen at the end.

What we would do differently

  • Separate preview and production model tiers from day one (do not share queues)
  • Build parent-facing progress UX in week two, not week five
  • Define a golden set of child photos for regression testing before marketing spend
  • Negotiate provider rate limits with expected viral multiples, not average Tuesday traffic

Where to start if you are scoping a build

Start with queue architecture and billing boundaries before you polish the prompt UI. Models improve every quarter. Margin structure and parent trust do not fix themselves later.

If you want help scoping a generative media product, read the Tiny Tales Videos case study, our services page, or send an inquiry with your volume targets and timeline.