On December 26, 2024, Chinese AI startup DeepSeek, which has gained recognition for its competition against major AI firms through innovative open-source solutions, introduced its latest ultra-large model, DeepSeek-V3. This new model, which is accessible through Hugging Face under the company’s licensing terms, features a substantial 671 billion parameters. It employs a mixture-of-experts architecture that activates only certain parameters tailored for specific tasks, allowing for both precision and efficiency. Benchmarks released by DeepSeek indicate that this model surpasses top open-source models—such as Meta's Llama 3.1-405B—and closely rivals the performance of proprietary models from Anthropic and OpenAI.
DeepSeek’s release of this model reinforces their objective to bridge the gap between closed and open-source AI technologies. Established as a spin-off from the Chinese quantitative hedge fund High-Flyer Capital Management, DeepSeek aims to advance toward artificial general intelligence (AGI), enabling models to understand and learn any intellectual task that a human can.
DeepSeek-V3 carries forward the fundamental architecture of its predecessor, DeepSeek-V2, with its reliance on multi-head latent attention (MLA) and the DeepSeekMoE framework. This maintains efficient training and inference by activating 37 billion parameters from the total 671 billion for each token processed.
In addition to its foundational architecture, DeepSeek-V3 introduces two significant features. The first is an auxiliary loss-free load-balancing strategy, which optimizes the distribution of workload among experts without degrading overall performance. The second innovation is multi-token prediction (MTP), which allows the model to generate predictions for several future tokens at once, enhancing training efficiency and tripling its output rate to 60 tokens per second.
Throughout its pre-training phase, DeepSeek-V3 utilized 14.8 trillion high-quality tokens and underwent a two-stage extension of context length, first to 32,000 and then to 128,000 tokens. Post-training adjustments included Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fine-tune the model to human preferences while balancing accuracy and output length. The training process benefited from various hardware and algorithmic improvements and was completed using approximately 2788K GPU hours, costing around $5.57 million—a fraction of the hundreds of millions typically spent to pre-train large language models.
DeepSeek-V3 is now touted as the most powerful open-source model available. In comparative performance evaluations, it outperformed other prominent open models like Llama-3.1-405B and Qwen 2.5-72B, with most benchmarks revealing that it even surpassed the proprietary GPT-4o—except in specific cases where OpenAI's model led.
The substantial achievements of DeepSeek demonstrate the advancing capabilities of open-source AI, suggesting highly competitive performance similar to that of proprietary models. This trend benefits the industry by reducing the likelihood of a single dominant AI entity, while providing businesses with a range of options for their technological needs.
Currently, the code for DeepSeek-V3 is accessible on GitHub under an MIT license, while the model is licensed directly by the company. Enterprises can experiment with the model through DeepSeek Chat, akin to ChatGPT, and can utilize the API for commercial purposes. Up until February 8, the API pricing remains the same as that of DeepSeek-V2, subsequently changing to $0.27 per million input tokens and $1.10 per million output tokens.
DeepSeek’s release of this model reinforces their objective to bridge the gap between closed and open-source AI technologies. Established as a spin-off from the Chinese quantitative hedge fund High-Flyer Capital Management, DeepSeek aims to advance toward artificial general intelligence (AGI), enabling models to understand and learn any intellectual task that a human can.
DeepSeek-V3 carries forward the fundamental architecture of its predecessor, DeepSeek-V2, with its reliance on multi-head latent attention (MLA) and the DeepSeekMoE framework. This maintains efficient training and inference by activating 37 billion parameters from the total 671 billion for each token processed.
In addition to its foundational architecture, DeepSeek-V3 introduces two significant features. The first is an auxiliary loss-free load-balancing strategy, which optimizes the distribution of workload among experts without degrading overall performance. The second innovation is multi-token prediction (MTP), which allows the model to generate predictions for several future tokens at once, enhancing training efficiency and tripling its output rate to 60 tokens per second.
Throughout its pre-training phase, DeepSeek-V3 utilized 14.8 trillion high-quality tokens and underwent a two-stage extension of context length, first to 32,000 and then to 128,000 tokens. Post-training adjustments included Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fine-tune the model to human preferences while balancing accuracy and output length. The training process benefited from various hardware and algorithmic improvements and was completed using approximately 2788K GPU hours, costing around $5.57 million—a fraction of the hundreds of millions typically spent to pre-train large language models.
DeepSeek-V3 is now touted as the most powerful open-source model available. In comparative performance evaluations, it outperformed other prominent open models like Llama-3.1-405B and Qwen 2.5-72B, with most benchmarks revealing that it even surpassed the proprietary GPT-4o—except in specific cases where OpenAI's model led.
The substantial achievements of DeepSeek demonstrate the advancing capabilities of open-source AI, suggesting highly competitive performance similar to that of proprietary models. This trend benefits the industry by reducing the likelihood of a single dominant AI entity, while providing businesses with a range of options for their technological needs.
Currently, the code for DeepSeek-V3 is accessible on GitHub under an MIT license, while the model is licensed directly by the company. Enterprises can experiment with the model through DeepSeek Chat, akin to ChatGPT, and can utilize the API for commercial purposes. Up until February 8, the API pricing remains the same as that of DeepSeek-V2, subsequently changing to $0.27 per million input tokens and $1.10 per million output tokens.