🐎 Coming soonHappy Horse 1.0 by Alibaba · #1 on Artificial Analysis · open-source · API landing late April
VideoGenAI
← All posts
EngineeringΒ·February 20, 2026Β·8 min read

Why we built our own inference stack (and why it's cheaper)

Running your own GPU pool looks insane on day 1 and obvious by day 180. Here's the honest behind-the-scenes of why hosted APIs couldn't meet our margin targets β€” and what it bought us on the way out.

VideoGenAI Team
Engineering Β· Infra
!TL;DR
β†’Day-1 plan was hosted APIs + markup. It lasted six weeks before the math stopped working.
β†’Self-hosting pushed utilisation 2-3Γ— above what hosted APIs charge for.
β†’The non-obvious win wasn't cost β€” it was freedom to *route between models* per prompt.
β†’We'd skip the hosted-API phase entirely if we did it again.

When we started in December 2024 the assumption was simple: pay a hosted video API, resell at a markup, focus the team on product and marketing. That plan lasted about six weeks. Here's what happened, with actual numbers.

The math didn't work

Hosted APIs price per generated second. To stay competitive with Kling at our target retail, we'd have needed a negative margin β€” you can do that for six weeks as a marketing spend, but you can't do it for a business.

The math
$3.60Kling retail, 30s 1080p
-$2.80Our hosted-API cost at the same settings
= $0.80Apparent margin

Looks fine at a glance. But pricing on a hosted API is the retail price β€” you also owe support, payment fees, failed-render refunds, idle time on our workers, observability. When we added those in:

!Heads up

Actual contribution margin on a hosted-API reseller model was -6% in our first month. Each render lost us money. A free tier on top of that would have been lethal.

Owning the pool

We spin up GPUs in Frankfurt, pool them across tenants, and batch aggressively. Our utilisation runs 2-3Γ— what a hosted API charges for β€” because we never pay for idle time.

24%
Hosted API avg utilisation
(what the vendor prices in)
71%
Our pool utilisation
Batching + queue shaping
+2.9Γ—
Effective cost delta
Same hardware, better packing

Two things make that work:

  1. Queue shaping. Short clips and long clips go into different lanes so one 60-second render doesn't hold up eight 6-second ones.
  2. Warm weights. We keep the top 3-4 models resident. A cold-start penalty kills utilisation; we pay that penalty almost never.

The non-obvious benefit: routing

Self-hosting also let us do something hosted APIs *structurally cannot* β€” route each prompt to the best-fit model. On a hosted API you're stuck with whatever model that vendor serves.

We benchmark the pool weekly against a rotating set of 60 prompts and swap silently underneath you. If Kling ships an update that makes it the new best for a category, your renders use it that same week, you didn't do anything, and your bill didn't go up.

β€œ
The model is the product. The routing is the business.
β€” Our infra lead, roughly every Monday

What we'd do differently

We'd skip the hosted-API phase entirely. The setup cost for self-hosting was real β€” two months, one engineer full-time, $14k of first-month GPU reservations before a single paying user β€” but the unit economics never looked back and the routing layer is now our moat.

βœ“Owning inference turned a -6% margin into a +52% margin
βœ“Routing is the feature hosted APIs can't copy
βœ“Frankfurt region bought us EU data residency on day one
Didn't need 8Γ—H100s from day one β€” two nodes were plenty to start
Didn't need bespoke model training β€” routing pre-trained ones is 80% of the win

What's next

We're experimenting with speculative decoding for short clips and fractional-step distillation for the common 8-second format. If the early numbers hold, we expect another 30-40% off the *already* cheapest entry-tier clip in the industry, by Q3 2026.

βœ“Win

Users don't need to do anything to benefit β€” these roll out silently and your existing token balance stretches further.

Boring answer: the fastest way to see the difference is to render five of your real prompts on our free tokens and compare to whatever you're paying now.

Run a benchmark on your own footage

Keep reading