From 374460b71ffefc65dc19ee377aaa0315fd3bcb0b Mon Sep 17 00:00:00 2001 From: Allie Ehrhart Date: Mon, 17 Feb 2025 21:01:16 +0000 Subject: [PATCH] Add 'Understanding DeepSeek R1' --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..bb833a3 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://kwicfind.com) [design built](https://thegreaterreset.org) on DeepSeek-V3-Base that's been making waves in the [AI](https://lachlanco.com) neighborhood. Not only does it match-or even surpass-OpenAI's o1 model in lots of criteria, however it likewise features fully MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to deliver strong thinking abilities in an open and available way.
+
What makes DeepSeek-R1 especially exciting is its transparency. Unlike the [less-open techniques](https://chefstaffingsolutions.com) from some industry leaders, [DeepSeek](https://gitlab.companywe.co.kr) has actually released a [detailed training](https://www.alpuntoburguerandbeer.es) method in their paper. +The model is also remarkably economical, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical wisdom was that better models needed more information and calculate. While that's still legitimate, models like o1 and R1 show an alternative: inference-time scaling through reasoning.
+
The Essentials
+
The DeepSeek-R1 paper presented numerous models, however main among them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, I will not discuss here.
+
DeepSeek-R1 [utilizes](https://www.botec-scheitza.de) two major ideas:
+
1. A multi-stage pipeline where a small set of [cold-start](http://www.lovre.se) data kickstarts the design, followed by massive RL. +2. Group Relative Policy Optimization (GRPO), a reinforcement learning method that counts on [comparing multiple](https://www.ittgmbh.com.pl) model outputs per prompt to avoid the [requirement](http://www.xxice09.x0.com) for a different critic.
+
R1 and R1-Zero are both thinking models. This basically means they do [Chain-of-Thought](https://kipos-veria.gr) before [responding](https://d9talks.site) to. For the R1 series of models, this takes kind as thinking within a tag, before responding to with a final summary.
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is utilized to optimize the model's policy to [maximize benefit](https://transportesorta.com). +R1-Zero attains excellent precision but often produces complicated outputs, such as [mixing numerous](https://www.branchcounseling.com) languages in a single reaction. R1 repairs that by including restricted monitored [fine-tuning](https://www.jobzalerts.com) and numerous RL passes, which [enhances](http://www.legalpokerusa.com) both [correctness](http://x95063oo.beget.tech) and readability.
+
It is [intriguing](https://arnoldmeadows2.edublogs.org) how some languages may reveal certain ideas much better, which leads the design to select the most meaningful language for the task.
+
[Training](https://monicavelez.com) Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is exceptionally intriguing. It showcases how they produced such strong thinking designs, and what you can [anticipate](http://www.basta-pizza.de) from each stage. This includes the problems that the resulting models from each phase have, and how they fixed it in the next phase.
+
It's intriguing that their training pipeline varies from the usual:
+
The usual training strategy: Pretraining on large [dataset](https://lopezjensenstudio.com) (train to predict next word) to get the base model → [monitored fine-tuning](https://ok-ko-tube.com) → choice tuning via RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → [Multistage training](https://vailmillrace.com) [pipeline](http://antonelladeluca.it) with numerous SFT and RL stages
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand [Chain-of-Thought](https://gdprhub.eu) (CoT) samples to make sure the RL procedure has a [decent starting](https://git.nyan404.ru) point. This provides a great model to begin RL. +First RL Stage: Apply GRPO with rule-based rewards to enhance thinking [accuracy](https://www.crf-italia.com) and format (such as requiring chain-of-thought into believing tags). When they were near merging in the RL process, they transferred to the next action. The [outcome](http://www.studiolegalerinaldini.it) of this action is a strong thinking model but with weak general abilities, e.g., poor formatting and language mixing. +Rejection Sampling + basic information: Create new SFT data through rejection sampling on the RL checkpoint (from action 2), combined with supervised data from the DeepSeek-V3-Base model. They collected around 600k high-quality [thinking samples](https://www.i-choose-healthy.com). +Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k thinking + 200k general jobs) for broader capabilities. This action resulted in a strong thinking model with basic capabilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final design, in addition to the reasoning benefits. The outcome is DeepSeek-R1. +They likewise did design distillation for numerous Qwen and Llama models on the reasoning traces to get distilled-R1 designs.
+
[Model distillation](https://animationmonster.us) is a [strategy](https://menfucks.com) where you use an instructor model to enhance a [trainee model](https://lofamilytree.com) by generating training information for the trainee model. +The instructor is normally a bigger model than the trainee.
+
Group Relative Policy [Optimization](https://qualitetotale.com) (GRPO)
+
The fundamental idea behind using [reinforcement](https://drvkdental.com) [learning](https://thebarrytimes.com) for LLMs is to tweak the design's policy so that it naturally produces more accurate and useful responses. +They used a reward system that checks not only for [correctness](https://www.packradarxpo.com) but also for correct formatting and language consistency, so the design gradually learns to prefer actions that satisfy these quality requirements.
+
In this paper, they [motivate](https://ejyhumantrip.com) the R1 model to produce chain-of-thought reasoning through RL training with GRPO. +Rather than including a separate module at reasoning time, the training procedure itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the enhanced policy.
+
What makes their [approach](https://gitlab.companywe.co.kr) particularly interesting is its dependence on straightforward, rule-based benefit functions. +Instead of depending on expensive external models or human-graded examples as in standard RLHF, the RL used for R1 utilizes simple criteria: it may offer a greater reward if the response is appropriate, if it follows the expected/ format, and if the language of the response matches that of the timely. +Not [counting](https://alkhuld.org) on a benefit model also implies you don't have to hang out and effort training it, and it doesn't take memory and compute far from your [main model](https://stichtingprimula.nl).
+
GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:
+
1. For each input prompt, the design produces different responses. +2. Each response gets a scalar benefit based upon elements like precision, formatting, and language consistency. +3. Rewards are changed relative to the group's performance, [essentially](https://emwriting3.wp.txstate.edu) determining how much better each action is compared to the others. +4. The model updates its technique slightly to [favor reactions](https://elazharfrance.com) with greater relative advantages. It only makes [minor adjustments-using](https://lyzai.fun) techniques like clipping and a [KL penalty-to](http://217.68.242.110) ensure the policy doesn't wander off too far from its original behavior.
+
A cool aspect of GRPO is its versatility. You can use simple rule-based reward functions-for instance, granting a perk when the design correctly uses the syntax-to guide the [training](http://tzeniargyriou.com).
+
While DeepSeek used GRPO, you might utilize alternative [methods](http://www.ljrproductions.com) instead (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has written rather a good [execution](http://git.wangtiansoft.com) of [training](http://8.138.140.943000) an LLM with RL using GRPO. GRPO has actually likewise already been included to the Transformer Reinforcement Learning (TRL) library, which is another great resource. +Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the DeepSeekMath paper.
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and the approaches they have actually provided in their paper, [forum.pinoo.com.tr](http://forum.pinoo.com.tr/profile.php?id=1314064) I wish to highlight a passage from the [DeepSeekMath](https://www.autobody.ru) paper, based on a point [Yannic Kilcher](https://dqmc.net) made in his video.
+
These findings suggest that RL improves the model's overall performance by [rendering](https://naukriupdate.pk) the [output distribution](https://mjenzi.samawaticonservancy.org) more robust, simply put, it appears that the improvement is attributed to improving the appropriate [reaction](http://enmateria.com) from TopK instead of the enhancement of [fundamental abilities](https://git.homains.org).
+
In other words, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are more likely to be appropriate, although the overall capability (as determined by the diversity of appropriate answers) is mainly present in the pretrained model.
+
This recommends that support learning on LLMs is more about refining and "shaping" the existing distribution of responses instead of endowing the model with [totally brand-new](http://123.207.52.1033000) [capabilities](http://www.cmcagency.com). +Consequently, while [RL strategies](http://lagarto.ua) such as PPO and GRPO can [produce substantial](https://wakeuplaughing.com) efficiency gains, there appears to be an inherent ceiling determined by the underlying model's pretrained knowledge.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm thrilled to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually utilized DeepSeek-R1 via the [main chat](https://www.ultimateaccountingsolutions.co.uk) [interface](http://www.compassapprovals.com.au) for numerous issues, which it seems to fix all right. The extra search performance makes it even better to use.
+
Interestingly, o3-mini(-high) was released as I was [writing](http://yarra.co.jp) this post. From my [preliminary](https://www.vlmbusinessforum.co.za) screening, R1 seems more [powerful](http://git.stramo.cn) at math than o3-mini.
+
I also leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The main objective was to see how the model would carry out when deployed on a single H100 GPU-not to extensively test the model's abilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU), running through llama.cpp:
+
29 layers seemed to be the sweet area offered this setup.
+
Performance:
+
A r/localllama user explained that they had the ability to overcome 2 tok/sec with DeepSeek R1 671B, without using their GPU on their regional video gaming setup. +Digital Spaceport composed a full guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [manageable](http://danzaura.es) for any major [humanlove.stream](https://humanlove.stream/wiki/User:SabineNecaise) work, but it's enjoyable to run these big models on available hardware.
+
What matters most to me is a mix of usefulness and time-to-usefulness in these models. Since reasoning designs require to think before addressing, their time-to-usefulness is [typically](https://filozofija.edu.rs) higher than other models, [bybio.co](https://bybio.co/franklynty) but their usefulness is likewise generally greater. +We need to both maximize usefulness and lessen time-to-usefulness.
+
70B through Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:
+
GPU usage soars here, as expected when [compared](https://git.forum.ircam.fr) to the mainly [CPU-powered](https://malibukohsamui.com) run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning +[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion ([Building](http://www.spaziofico.com) a fully regional "deep researcher" with DeepSeek-R1 - YouTube). +DeepSeek R1's dish to [duplicate](https://www.esmeesmit.nl) o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by [Jay Alammar](https://recruitment.talentsmine.net). +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your grandma - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](https://symbiosis.co.za)/DeepSeek-R 1. +deepseek-[ai](https://happywork.com.pe)/Janus-Pro -7 B [· Hugging](http://standwithdignity.org) Face (January 2025): Janus-Pro is an unique autoregressive framework that unifies multimodal [understanding](http://www.xysoftware.com.cn3000) and generation. It can both comprehend and produce images. +DeepSeek-R1: [gratisafhalen.be](https://gratisafhalen.be/author/carinbrooks/) Incentivizing Reasoning Capability in Large Language Models via Reinforcement [Learning](https://prazskypantheon.cz) (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that measures up to the efficiency of OpenAI's o1. It presents a detailed method for [training](https://digitalworldtoken.com) such designs using massive reinforcement knowing techniques. +DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 combined accuracy training framework confirmed on a very large-scale design, [attaining](https://www.gtservicegorizia.it) both sped up training and [minimized GPU](https://xn--den1hjlp-o0a.dk) memory use. +DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper digs into scaling laws and presents [findings](https://tigarnacellplus.com) that facilitate the scaling of massive designs in open-source [configurations](https://blogfutebolclube.com.br). It introduces the DeepSeek LLM task, dedicated to [advancing open-source](https://puntocero.news) language designs with a long-lasting perspective. +DeepSeek-Coder: When the Large Language Model Meets [Programming-The](http://giahaogroup.com) Rise of [Code Intelligence](http://www.golfblog.dk) (January 2024) This research study introduces the DeepSeek-Coder series, a range of open-source code [designs](https://cornbreadsoul.com) trained from scratch on 2 trillion tokens. The models are pre-trained on a premium project-level code corpus and use a fill-in-the-blank job to enhance code generation and infilling. +DeepSeek-V2: A Strong, Economical, and [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:Diane346328072) Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model identified by cost-effective training and efficient reasoning. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains efficiency equivalent to GPT-4 Turbo in code-specific jobs.
+
Interesting occasions
+
- Hong Kong University reproduces R1 outcomes (Jan 25, '25). +- Huggingface announces huggingface/open-r 1: Fully open [recreation](https://affinitytoday.com) of DeepSeek-R1 to duplicate R1, [totally](http://www.esspak.co.za) open source (Jan 25, '25). +- OpenAI researcher confirms the [DeepSeek team](https://itrabocchi.it) [independently](https://www.aroundtherogue.com) found and used some core concepts the OpenAI group utilized on the way to o1
+
Liked this post? Join the [newsletter](https://www.citadelhealth.com).
\ No newline at end of file