On January 27, 2025, Nvidia lost $600 billion in market value. One day. Six hundred billion. The largest single-company decline in U.S. stock market history. The cause wasn't a recession, a scandal, or a product failure. It was a research paper from a company in Hangzhou, China, that most Americans had never heard of. DeepSeek, a company with 160 employees funded by a Chinese hedge fund, had released a model that performed comparably to OpenAI's o1 on reasoning benchmarks. The reported training cost: roughly $6 million. OpenAI spent an estimated $100 million training GPT-4. Meta used approximately ten times the compute for Llama 3.1. The implications hit Wall Street like a truck. If you could build a competitive AI model for 1/17th the cost, what was all that GPU spending for? Why would anyone pay $30,000 for an H100 if the future was efficiency, not brute force? The entire thesis behind Nvidia's trillion-dollar valuation wobbled. This is the story of how a hedge fund quantitative trader, working under U.S. chip export restrictions with older hardware that American labs would consider obsolete, built one of the most consequential AI labs in the world. ## The Hedge Fund Origins You can't understand DeepSeek without understanding High-Flyer, the hedge fund that created it. Liang Wenfeng co-founded High-Flyer in February 2016 in Hangzhou. Liang had been trading since 2008, when he was still at Zhejiang University, right when the financial crisis was rewriting the rules of global markets. By 2016, he'd concluded that quantitative trading's future was AI. High-Flyer started using GPU-dependent deep learning models for stock trading on October 21, 2016. Before that, they'd used CPU-based linear models, the standard approach for quant funds at the time. By the end of 2017, most of the firm's trading was AI-driven. By 2021, it was all AI, all the time. This matters because High-Flyer's core competency wasn't just running AI models. It was running them efficiently. In quantitative trading, compute cost directly eats into returns. Every dollar spent on GPUs is a dollar not earned on trades. Liang's team had years of practice squeezing maximum performance out of minimum hardware. In 2019, High-Flyer built its first computing cluster, Fire-Flyer, at a cost of 200 million yuan (roughly $28 million). It contained 1,100 GPUs interconnected at 200 gigabits per second. They ran it hard, retired it after just 18 months, and started building Fire-Flyer 2 in 2021 with a budget of 1 billion yuan (roughly $140 million). That 2021 timing was critical. Before the U.S. restricted chip sales to China, Liang reportedly acquired 10,000 Nvidia A100 GPUs. These weren't the top-of-the-line DGX versions. They were PCIe A100s, the cheaper variant with lower interconnect bandwidth. But Liang had 10,000 of them, and he was about to put them to a use nobody expected. ## From Trading Floor to AGI Lab On April 14, 2023, High-Flyer announced it was launching an AGI research lab. Three months later, on July 17, 2023, that lab was spun off into an independent company: DeepSeek. The name is revealing. "Deep" for deep learning. "Seek" for searching, exploring, seeking truth. The Chinese name translates more literally: "deep search." Liang Wenfeng became CEO of both High-Flyer and DeepSeek. Venture capitalists passed. They didn't think a hedge fund's AI side project could compete with OpenAI, Google, and Meta. They couldn't see a quick exit. So Liang funded DeepSeek entirely through High-Flyer's profits. No outside investors. No board pressure. No quarterly expectations. That independence turned out to be DeepSeek's biggest advantage. Without VCs demanding growth metrics and commercialization timelines, the team could focus entirely on research. Liang explicitly stated that DeepSeek wouldn't pursue immediate commercialization. The company was a research lab first, a product company never (or at least not yet). ## The Hiring Strategy DeepSeek's hiring approach breaks every Silicon Valley convention. They don't require years of experience. They recruit fresh graduates from top Chinese universities, people with raw talent and energy but no industry credentials. They also recruit outside traditional computer science. Mathematicians, physicists, even poets, to broaden the knowledge domains the models could draw from. The New York Times reported that dozens of DeepSeek researchers have affiliations with People's Liberation Army laboratories and the "Seven Sons of National Defence," the seven Chinese universities with close military ties. This has raised concerns in Washington, but the practical reality is more mundane: China's best engineering talent often flows through these institutions because they're where the best AI research programs are. By 2025, DeepSeek had about 160 employees total. For comparison, OpenAI has over 1,500. Anthropic has around 1,000. Google DeepMind has over 2,500. DeepSeek is doing comparable work with a fraction of the headcount. How? The team publishes prolifically and openly. Their papers appear on arXiv regularly, often with surprisingly detailed technical descriptions of their methods. While Western labs have become increasingly secretive (OpenAI publishing almost nothing about GPT-4's architecture), DeepSeek shares everything. Their research papers read like tutorials: here's what we did, here's how we did it, here's the code. ## The Architecture: Why Mixture of Experts Changes Everything DeepSeek's technical breakthrough centers on the Mixture of Experts (MoE) architecture. Understanding MoE is essential to understanding why DeepSeek's models are so cheap to run. A standard transformer model activates all its parameters for every token it processes. GPT-4, with its estimated 1.8 trillion parameters, runs all 1.8 trillion parameters for every word you type. That's like turning on every light in a skyscraper to find your keys in one room. MoE models work differently. They contain many "expert" sub-networks, each specialized in different types of knowledge or reasoning. When a token comes in, a routing mechanism decides which experts are relevant and only activates those. The rest stay dormant. DeepSeek-V3, released in December 2024, has 671 billion total parameters but only activates 37 billion per token. That's 5.5% of the model active at any given time. You get the knowledge breadth of a 671B parameter model with the compute cost of a 37B parameter model. The efficiency gains are enormous. But MoE by itself isn't new. Google's Switch Transformer paper introduced the concept in 2021. What DeepSeek did differently was make MoE training stable and efficient at unprecedented scale. Their technical papers describe a multi-head latent attention mechanism that reduces the key-value cache during inference, cutting memory requirements dramatically. They use FP8 mixed-precision training, running at 8-bit floating point instead of the standard 16-bit, which halves memory usage and roughly doubles throughput. They developed custom communication libraries (hfreduce) that outperform Nvidia's standard NCCL for their specific cluster topology. These optimizations, individually incremental, compound. The result: training DeepSeek-V3 cost approximately $5.6 million in GPU hours. For a model that benchmarks within striking distance of GPT-4. ## The Chip Constraint That Became an Advantage This is the most counterintuitive part of the story. The U.S. export restrictions on advanced AI chips to China were supposed to slow Chinese AI development. The logic was straightforward: without access to Nvidia's latest H100 and A100 chips, Chinese labs would fall behind. DeepSeek turned the constraint into a feature. Their Fire-Flyer 2 cluster used 5,000 PCIe A100 GPUs, not the high-end DGX variants with NVLink interconnects that Western labs prefer. The PCIe version has lower memory bandwidth, lower interconnect speed, and less compute per chip. American researchers would consider it suboptimal hardware. But because DeepSeek couldn't throw hardware at the problem, they were forced to innovate on software. Their distributed training infrastructure, including custom file systems (3FS), custom communication libraries (hfreduce), and optimized data pipelines, squeezed more out of each GPU than most Western labs get from better hardware. DeepSeek's published research on their Fire-Flyer infrastructure reported 96% GPU utilization across the cluster. Most training clusters in the West achieve 30-50% utilization. The gap isn't about hardware. It's about how much effort you put into using the hardware you have. This raises a question that's making American policymakers nervous: are chip export restrictions actually slowing Chinese AI, or are they just making Chinese AI more efficient? ## The R1 Moment DeepSeek-R1, released on January 20, 2025, was the model that changed everything. R1 is a reasoning model, comparable to OpenAI's o1. It "thinks" through problems step by step, showing its reasoning chain before arriving at an answer. On mathematical reasoning benchmarks, it scored competitively with o1. On coding tasks, it performed comparably. On general reasoning, it was close. Released under the MIT License. Fully open weights. No usage restrictions. Free. The model shot to the top of the iOS App Store in the United States, surpassing ChatGPT. Within a week, it had millions of downloads. The January 27 stock market reaction followed immediately. But the R1 moment wasn't just about one model. It was about what that model represented: proof that you don't need $100 billion datacenters and exclusive access to the latest chips to build competitive AI. You need smart people, efficient software, and enough hardware to get started. ## What Came After DeepSeek didn't stop at R1. In March 2025, they released DeepSeek-V3-0324, an updated version of their base model. In May 2025, DeepSeek-R1-0528 followed. In August 2025, DeepSeek V3.1 dropped, featuring a hybrid architecture with thinking and non-thinking modes and surpassing prior models by over 40% on benchmarks like SWE-bench. V3.1-Terminus came in September, followed by V3.2-Exp, which introduced DeepSeek Sparse Attention, a more efficient attention mechanism. The pace of releases would be impressive for any lab. For a 160-person operation? It's remarkable. DeepSeek also expanded its reach globally, establishing presence in Africa where its affordable, less power-hungry AI solutions found natural demand. Startups in Nairobi started building on DeepSeek models. The company's open-weight approach and low compute requirements made it accessible in regions where Western AI APIs are expensive and reliable internet connectivity isn't guaranteed. ## The Censorship Complication This wouldn't be an honest assessment without addressing the elephant in the room. DeepSeek's models comply with Chinese government censorship requirements. Ask about Tiananmen Square, Taiwan's sovereignty, or Xinjiang, and the models either refuse or give Party-line responses. The R1-0528 release in May 2025 was noted for more tightly following Chinese Communist Party ideology in its responses than prior versions. The censorship is baked into the model's RLHF training, not just a system prompt filter that developers can remove. For many Western developers and enterprises, this is a dealbreaker. For research and code generation, it's largely irrelevant, the models don't censor math or Python. But for any application involving natural language on politically sensitive topics, you need to know what you're working with. Some developers have addressed this by fine-tuning DeepSeek's base models with their own RLHF data, effectively removing the censorship for their deployments. The MIT license permits this. Whether the resulting models perform as well on general tasks is debated, but the option exists. ## What DeepSeek Means for the Industry DeepSeek's impact goes beyond one company or one country. It proved that the relationship between compute spending and model quality is logarithmic, not linear. Doubling your GPU budget doesn't double your model quality. At some point, better algorithms matter more than bigger clusters. DeepSeek found that point earlier than anyone expected. It showed that open publication of methods accelerates the entire field. While Western labs hoarded their techniques, DeepSeek published everything. And the global research community built on it. Fine-tuned derivatives of DeepSeek models now power applications on every continent. It forced a strategic reassessment in Washington. Chip export controls were supposed to be America's trump card in the AI race. DeepSeek suggested the card might not be as powerful as assumed. The debate about export policy continues, but the certainty is gone. And it gave every startup founder in the world a new benchmark for efficiency. If 160 people with constrained hardware can build models that compete with labs 10x their size, what's your excuse? Liang Wenfeng started with a simple observation: AI models could be trained more efficiently than anyone bothered to try. He was right. And the entire industry is still reckoning with what that means.