
29/01/2025
Why everyone is freaking out about DeepSeek?
It took about a month for the finance world to start freaking out about DeepSeek, but when it did, it took more than half a trillion dollars — or one entire Stargate — off Nvidia’s market cap. It wasn’t just Nvidia, either: Tesla, Google, Amazon, and Microsoft tanked.
DeepSeek’s two AI models, released in quick succession, put it on par with the best available from American labs, according to Alexandr Wang, Scale AI CEO. And DeepSeek seems to be working within constraints that mean it trained much more cheaply than its American peers. One of its recent models is said to cost just $5.6 million in the final training run, which is about the salary an American AI expert can command. Last year, Anthropic CEO Dario Amodei said the cost of training models ranged from $100 million to $1 billion. OpenAI’s GPT-4 cost more than $100 million, according to CEO Sam Altman. DeepSeek seems to have just upended our idea of how much AI costs, with potentially enormous implications across the industry.
This has all happened over just a few weeks. On Christmas Day, DeepSeek released a reasoning model (v3) that caused a lot of buzz. Its second model, R1, released last week, has been called “one of the most amazing and impressive breakthroughs I’ve ever seen” by Marc Andreessen, VC and adviser to President Donald Trump. The advances from DeepSeek’s models show that “the AI race will be very competitive,” says Trump’s AI and crypto czar David Sacks. Both models are partially open source, minus the training data.
DeepSeek’s successes call into question whether billions of dollars in compute are actually required to win the AI race. The conventional wisdom has been that big tech will dominate AI simply because it has the spare cash to chase advances. Now, it looks like big tech has simply been lighting money on fire. Figuring out how much the models actually cost is a little tricky because, as Scale AI’s Wang points out, DeepSeek may not be able to speak honestly about what kind and how many GPUs it has — as the result of sanctions.
Even if critics are correct and DeepSeek isn’t being truthful about what GPUs it has on hand (napkin math suggests the optimization techniques used means they are being truthful), it won’t take long for the open-source community to find out, according to Hugging Face’s head of research, Leandro von Werra. His team started working over the weekend to replicate and open-source the R1 recipe, and once researchers can create their own version of the model, “we’re going to find out pretty quickly if numbers add up.”
What is DeepSeek?
Led by CEO Liang Wenfeng, the two-year-old DeepSeek is China’s premier AI startup. It spun out from a hedge fund founded by engineers from Zhejiang University and is focused on “potentially game-changing architectural and algorithmic innovations” to build artificial general intelligence (AGI) — or at least, that’s what Liang says. Unlike OpenAI, it also claims to be profitable.
In 2021, Liang started buying thousands of Nvidia GPUs (just before the US put sanctions on chips) and launched DeepSeek in 2023 with the goal to “explore the essence of AGI,” or AI that’s as intelligent as humans. Liang follows a lot of the same lofty talking points as OpenAI CEO Altman and other industry leaders. “Our destination is AGI,” Liang said in an interview, “which means we need to study new model structures to realize stronger model capability with limited resources.”
So, that’s exactly what DeepSeek did. With a few innovative technical approaches that allowed its model to run more efficiently, the team claims its final training run for R1 cost $5.6 million. That’s a 95 percent cost reduction from OpenAI’s o1. Instead of starting from scratch, DeepSeek built its AI by using existing open-source models as a starting point — specifically, researchers used Meta’s Llama model as a foundation. While the company’s training data mix isn’t disclosed, DeepSeek did mention it used synthetic data, or artificially generated information (which might become more important as AI labs seem to hit a data wall).
Without the training data, it isn’t exactly clear how much of a “copy” this is of o1
Without the training data, it isn’t exactly clear how much of a “copy” this is of o1 — did DeepSeek use o1 to train R1? Around the time that the first paper was released in December, Altman posted that “it is (relatively) easy to copy something that you know works” and “it is extremely hard to do something new, risky, and difficult when you don’t know if it will work.” So the claim is that DeepSeek isn’t going to create new frontier models; it’s simply going to replicate old models. OpenAI investor Joshua Kushner also seemed to say that DeepSeek “was trained off of leading US frontier models.”
R1 used two key optimization tricks, former OpenAI policy researcher Miles Brundage told The Verge: more efficient pre-training and reinforcement learning on chain-of-thought reasoning. DeepSeek found smarter ways to use cheaper GPUs to train its AI, and part of what helped was using a new-ish technique for requiring the AI to “think” step by step through problems using trial and error (reinforcement learning) instead of copying humans. This combination allowed the model to achieve o1-level performance while using way less computing power and money.
“DeepSeek v3 and also DeepSeek v2 before that are basically the same sort of models as GPT-4, but just with more clever engineering tricks to get more bang for their buck in terms of GPUs,” Brundage said.
To be clear, other labs employ these techniques (DeepSeek used “mixture of experts,” which only activates parts of the model for certain queries. GPT-4 did that, too). The DeepSeek version innovated on this concept by creating more finely tuned expert categories and developing a more efficient way for them to communicate, which made the training process itself more efficient. The DeepSeek team also developed something called DeepSeekMLA (Multi-Head Latent Attention), which dramatically reduced the memory required to run AI models by compressing how the model stores and retrieves information.
What is shocking the world isn’t just the architecture that led to these models but the fact that it was able to so rapidly replicate OpenAI’s achievements within months, rather than the year-plus gap typically seen between major AI advances, Brundage added.
OpenAI positioned itself as uniquely capable of building advanced AI, and this public image just won the support of investors to build the world’s biggest AI data center infrastructure. But DeepSeek’s quick replication shows that technical advantages don’t last long — even when companies try to keep their methods secret.
“These close sourced companies, to some degree, they obviously live off people thinking they’re doing the greatest things and that’s how they can maintain their valuation. And maybe they overhyped a little bit to raise more money or build more projects,” von Werra says. “Whether they overclaimed what they have internally, nobody knows, obviously it’s to their advantage.”