Why we built our own inference stack (and why it's cheaper)
Running your own GPU pool looks insane on day 1 and obvious by day 180. Here's the honest behind-the-scenes of why hosted APIs couldn't meet our margin targets β and what it bought us on the way out.
When we started in December 2024 the assumption was simple: pay a hosted video API, resell at a markup, focus the team on product and marketing. That plan lasted about six weeks. Here's what happened, with actual numbers.
The math didn't work
Hosted APIs price per generated second. To stay competitive with Kling at our target retail, we'd have needed a negative margin β you can do that for six weeks as a marketing spend, but you can't do it for a business.
Looks fine at a glance. But pricing on a hosted API is the retail price β you also owe support, payment fees, failed-render refunds, idle time on our workers, observability. When we added those in:
Actual contribution margin on a hosted-API reseller model was -6% in our first month. Each render lost us money. A free tier on top of that would have been lethal.
Owning the pool
We spin up GPUs in Frankfurt, pool them across tenants, and batch aggressively. Our utilisation runs 2-3Γ what a hosted API charges for β because we never pay for idle time.
Two things make that work:
- Queue shaping. Short clips and long clips go into different lanes so one 60-second render doesn't hold up eight 6-second ones.
- Warm weights. We keep the top 3-4 models resident. A cold-start penalty kills utilisation; we pay that penalty almost never.
The non-obvious benefit: routing
Self-hosting also let us do something hosted APIs *structurally cannot* β route each prompt to the best-fit model. On a hosted API you're stuck with whatever model that vendor serves.
We benchmark the pool weekly against a rotating set of 60 prompts and swap silently underneath you. If Kling ships an update that makes it the new best for a category, your renders use it that same week, you didn't do anything, and your bill didn't go up.
βThe model is the product. The routing is the business.β Our infra lead, roughly every Monday
What we'd do differently
We'd skip the hosted-API phase entirely. The setup cost for self-hosting was real β two months, one engineer full-time, $14k of first-month GPU reservations before a single paying user β but the unit economics never looked back and the routing layer is now our moat.
What's next
We're experimenting with speculative decoding for short clips and fractional-step distillation for the common 8-second format. If the early numbers hold, we expect another 30-40% off the *already* cheapest entry-tier clip in the industry, by Q3 2026.
Users don't need to do anything to benefit β these roll out silently and your existing token balance stretches further.
Boring answer: the fastest way to see the difference is to render five of your real prompts on our free tokens and compare to whatever you're paying now.