In May 2025 Stripe reported that its detection rate for card-testing fraud on large businesses went from 59% to 97% — overnight, with no increase in false positives. The thing that moved the number wasn’t a better fraud model. It was a foundation model trained on tens of billions of transactions that compresses each payment into a single representation, capturing (in their words) signals “no human, and no previous model, could track.”
Two months earlier, Netflix described retiring hundreds of specialized recommendation models in favor of one foundation model, explicitly inspired by how large language models learn.
We may be quick to think that they did this to save engineering time. However, there’s a deeper reason: they were all bystanders to the bitter lesson. Training one model on all of the data beats training many models on slices of it. That is the whole idea, and most enterprises are still on the other side of it.
What feature engineering throws away
Here’s the old way, the one nearly every data team runs today. A question arrives — who’s going to churn? The team pulls the relevant tables, then engineers features: days since last login, count of support tickets, rolling 30-day spend. A handful of numbers, chosen by hand, meant to summarize a customer well enough to predict one thing.
Every one of those features is a human guess about what matters, and every guess is lossy. The team is compressing a customer’s entire history into a dozen columns. And then we do it again, differently, for the fraud model, and again for the credit model. The same underlying reality is re-summarized by hand, once per question.
The signal that doesn’t fit into someone’s hand-built feature never makes it into the model, something that can’t be fixed with a better feature store. It’s the ceiling of doing it by hand instead of letting the data speak for itself.
Model the business, not the question
The alternative is to stop modeling questions and model the business.
Train one model on all of your operational data, and let it learn its own representation of your customers, accounts, and events, instead of being handed a dozen features someone guessed at.
What this means in practice is a foundation model trained across your data learns a dense representation of each entity from the raw event stream: every interaction, transaction and state change, in sequence. Churn, fraud, and credit risk are then read-outs from that shared representation, not separate models built from separate hand-made features. The expensive, valuable part — learning what your data actually encodes — happens once, over everything, and every question draws on all of it.
This is why the representation beats the feature set. It isn’t limited to what a person thought to compute. It can encode an interaction between two events that no analyst would join, because it learned that the interaction predicts something.
The LLM parallel, made precise
The analogy to language models is real, but it’s worth being exact about it, because the exact version is the one that holds.
LLMs generalize because they’re trained on the distribution of essentially all language. Not on a task, but on the underlying structure, from which tasks fall out. The claim here is not that Lua is an LLM, or that some model pretrained on the public internet will understand your business. It’s that your operational data is its own vast distribution, and the same approach of modelling the whole thing and getting the tasks for free works on it.
The objection writes itself: one company doesn’t have the internet’s worth of data. Often it does. A large enterprise’s transaction logs alone can run to hundreds of billions or trillions of tokens, which is the scale of the corpora used to train general-purpose language models. Stripe’s tens of billions of transactions is one company’s payments. The substrate has been sitting in the warehouse the whole time. Almost nobody has been modeling it as one distribution.
So you end up with one model, not one per question
Now the consolidation everyone notices is just the consequence. When one model underlies everything, you maintain one model, and not a fleet of pipelines decaying on separate schedules. Improve the base and every use case improves at once. This is the part Netflix wrote about plainly: hundreds of models became one, because the costs and inconsistencies of the old way became impossible to carry.
It also changes the economics of asking the next question. The first use case pays the cost of learning your data. Every use case after is a fine-tune of a model that already understands your business, so the second prediction is cheaper than the first, and the tenth is cheaper (and better) still.
Where this goes next
One model per company is the foundation the rest of our writing builds on. From here we go deeper: what a model actually learns from tabular and event data, how we ran the benchmark above so you can judge it yourself, and how the whole thing runs inside your VPC without data egress.
The data was always the asset. The old way just kept handing the model a summary of it.