Add 'Understanding DeepSeek R1'

1 year ago · 374460b71f
parent e9fb6f06cd
commit 374460b71f
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](https://kwicfind.com) [design built](https://thegreaterreset.org) on DeepSeek-V3-Base that's been making waves in the [AI](https://lachlanco.com) neighborhood. Not only does it match-or even surpass-OpenAI's o1 model in lots of criteria, however it likewise features fully MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to deliver strong thinking abilities in an open and available way.<br>
+<br>What makes DeepSeek-R1 especially exciting is its transparency. Unlike the [less-open techniques](https://chefstaffingsolutions.com) from some industry leaders, [DeepSeek](https://gitlab.companywe.co.kr) has actually released a [detailed training](https://www.alpuntoburguerandbeer.es) method in their paper.
+The model is also remarkably economical, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the typical wisdom was that better models needed more information and calculate. While that's still legitimate, models like o1 and R1 show an alternative: inference-time scaling through reasoning.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented numerous models, however main among them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, I will not discuss here.<br>
+<br>DeepSeek-R1 [utilizes](https://www.botec-scheitza.de) two major ideas:<br>
+<br>1. A multi-stage pipeline where a small set of [cold-start](http://www.lovre.se) data kickstarts the design, followed by massive RL.
+2. Group Relative Policy Optimization (GRPO), a reinforcement learning method that counts on [comparing multiple](https://www.ittgmbh.com.pl) model outputs per prompt to avoid the [requirement](http://www.xxice09.x0.com) for a different critic.<br>
+<br>R1 and R1-Zero are both thinking models. This basically means they do [Chain-of-Thought](https://kipos-veria.gr) before [responding](https://d9talks.site) to. For the R1 series of models, this takes kind as thinking within a tag, before responding to with a final summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is utilized to optimize the model's policy to [maximize benefit](https://transportesorta.com).
+R1-Zero attains excellent precision but often produces complicated outputs, such as [mixing numerous](https://www.branchcounseling.com) languages in a single reaction. R1 repairs that by including restricted monitored [fine-tuning](https://www.jobzalerts.com) and numerous RL passes, which [enhances](http://www.legalpokerusa.com) both [correctness](http://x95063oo.beget.tech) and readability.<br>
+<br>It is [intriguing](https://arnoldmeadows2.edublogs.org) how some languages may reveal certain ideas much better, which leads the design to select the most meaningful language for the task.<br>
+<br>[Training](https://monicavelez.com) Pipeline<br>
+<br>The training pipeline that DeepSeek published in the R1 paper is exceptionally intriguing. It showcases how they produced such strong thinking designs, and what you can [anticipate](http://www.basta-pizza.de) from each stage. This includes the problems that the resulting models from each phase have, and how they fixed it in the next phase.<br>
+<br>It's intriguing that their training pipeline varies from the usual:<br>
+<br>The usual training strategy: Pretraining on large [dataset](https://lopezjensenstudio.com) (train to predict next word) to get the base model → [monitored fine-tuning](https://ok-ko-tube.com) → choice tuning via RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](https://vailmillrace.com) [pipeline](http://antonelladeluca.it) with numerous SFT and RL stages<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand [Chain-of-Thought](https://gdprhub.eu) (CoT) samples to make sure the RL procedure has a [decent starting](https://git.nyan404.ru) point. This provides a great model to begin RL.
+First RL Stage: Apply GRPO with rule-based rewards to enhance thinking [accuracy](https://www.crf-italia.com) and format (such as requiring chain-of-thought into believing tags). When they were near merging in the RL process, they transferred to the next action. The [outcome](http://www.studiolegalerinaldini.it) of this action is a strong thinking model but with weak general abilities, e.g., poor formatting and language mixing.
+Rejection Sampling + basic information: Create new SFT data through rejection sampling on the RL checkpoint (from action 2), combined with supervised data from the DeepSeek-V3-Base model. They collected around 600k high-quality [thinking samples](https://www.i-choose-healthy.com).
+Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k thinking + 200k general jobs) for broader capabilities. This action resulted in a strong thinking model with basic capabilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final design, in addition to the reasoning benefits. The outcome is DeepSeek-R1.
+They likewise did design distillation for numerous Qwen and Llama models on the reasoning traces to get distilled-R1 designs.<br>
+<br>[Model distillation](https://animationmonster.us) is a [strategy](https://menfucks.com) where you use an instructor model to enhance a [trainee model](https://lofamilytree.com) by generating training information for the trainee model.
+The instructor is normally a bigger model than the trainee.<br>
+<br>Group Relative Policy [Optimization](https://qualitetotale.com) (GRPO)<br>
+<br>The fundamental idea behind using [reinforcement](https://drvkdental.com) [learning](https://thebarrytimes.com) for LLMs is to tweak the design's policy so that it naturally produces more accurate and useful responses.
+They used a reward system that checks not only for [correctness](https://www.packradarxpo.com) but also for correct formatting and language consistency, so the design gradually learns to prefer actions that satisfy these quality requirements.<br>
+<br>In this paper, they [motivate](https://ejyhumantrip.com) the R1 model to produce chain-of-thought reasoning through RL training with GRPO.
+Rather than including a separate module at reasoning time, the training procedure itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the enhanced policy.<br>
+<br>What makes their [approach](https://gitlab.companywe.co.kr) particularly interesting is its dependence on straightforward, rule-based benefit functions.
+Instead of depending on expensive external models or human-graded examples as in standard RLHF, the RL used for R1 utilizes simple criteria: it may offer a greater reward if the response is appropriate, if it follows the expected/ format, and if the language of the response matches that of the timely.
+Not [counting](https://alkhuld.org) on a benefit model also implies you don't have to hang out and effort training it, and it doesn't take memory and compute far from your [main model](https://stichtingprimula.nl).<br>
+<br>GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:<br>
+<br>1. For each input prompt, the design produces different responses.
+2. Each response gets a scalar benefit based upon elements like precision, formatting, and language consistency.
+3. Rewards are changed relative to the group's performance, [essentially](https://emwriting3.wp.txstate.edu) determining how much better each action is compared to the others.
+4. The model updates its technique slightly to [favor reactions](https://elazharfrance.com) with greater relative advantages. It only makes [minor adjustments-using](https://lyzai.fun) techniques like clipping and a [KL penalty-to](http://217.68.242.110) ensure the policy doesn't wander off too far from its original behavior.<br>
+<br>A cool aspect of GRPO is its versatility. You can use simple rule-based reward functions-for instance, granting a perk when the design correctly uses the syntax-to guide the [training](http://tzeniargyriou.com).<br>
+<br>While DeepSeek used GRPO, you might utilize alternative [methods](http://www.ljrproductions.com) instead (PPO or PRIME).<br>
+<br>For those aiming to dive deeper, Will Brown has written rather a good [execution](http://git.wangtiansoft.com) of [training](http://8.138.140.943000) an LLM with RL using GRPO. GRPO has actually likewise already been included to the Transformer Reinforcement Learning (TRL) library, which is another great resource.
+Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the DeepSeekMath paper.<br>
+<br>Is RL on LLMs the path to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the approaches they have actually provided in their paper,  [forum.pinoo.com.tr](http://forum.pinoo.com.tr/profile.php?id=1314064) I wish to highlight a passage from the [DeepSeekMath](https://www.autobody.ru) paper, based on a point [Yannic Kilcher](https://dqmc.net) made in his video.<br>
+<br>These findings suggest that RL improves the model's overall performance by [rendering](https://naukriupdate.pk) the [output distribution](https://mjenzi.samawaticonservancy.org) more robust, simply put, it appears that the improvement is attributed to improving the appropriate [reaction](http://enmateria.com) from TopK instead of the enhancement of [fundamental abilities](https://git.homains.org).<br>
+<br>In other words, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are more likely to be appropriate, although the overall capability (as determined by the diversity of appropriate answers) is mainly present in the pretrained model.<br>
+<br>This recommends that support learning on LLMs is more about refining and "shaping" the existing distribution of responses instead of endowing the model with [totally brand-new](http://123.207.52.1033000) [capabilities](http://www.cmcagency.com).
+Consequently, while [RL strategies](http://lagarto.ua) such as PPO and GRPO can [produce substantial](https://wakeuplaughing.com) efficiency gains, there appears to be an inherent ceiling determined by the underlying model's pretrained knowledge.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm thrilled to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually utilized DeepSeek-R1 via the [main chat](https://www.ultimateaccountingsolutions.co.uk) [interface](http://www.compassapprovals.com.au) for numerous issues, which it seems to fix all right. The extra search performance makes it even better to use.<br>
+<br>Interestingly, o3-mini(-high) was released as I was [writing](http://yarra.co.jp) this post. From my [preliminary](https://www.vlmbusinessforum.co.za) screening, R1 seems more [powerful](http://git.stramo.cn) at math than o3-mini.<br>
+<br>I also leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The main objective was to see how the model would carry out when deployed on a single H100 GPU-not to extensively test the model's abilities.<br>
+<br>671B via Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S)  by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU), running through llama.cpp:<br>
+<br>29 layers seemed to be the sweet area offered this setup.<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they had the ability to overcome 2 tok/sec with DeepSeek R1 671B, without using their GPU on their regional video gaming setup.
+Digital Spaceport composed a full guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite [manageable](http://danzaura.es) for any major  [humanlove.stream](https://humanlove.stream/wiki/User:SabineNecaise) work, but it's enjoyable to run these big models on available hardware.<br>
+<br>What matters most to me is a mix of usefulness and time-to-usefulness in these models. Since reasoning designs require to think before addressing, their time-to-usefulness is [typically](https://filozofija.edu.rs) higher than other models,  [bybio.co](https://bybio.co/franklynty) but their usefulness is likewise generally greater.
+We need to both maximize usefulness and lessen time-to-usefulness.<br>
+<br>70B through Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
+<br>GPU usage soars here, as expected when [compared](https://git.forum.ircam.fr) to the mainly [CPU-powered](https://malibukohsamui.com) run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 - Notion ([Building](http://www.spaziofico.com) a fully regional "deep researcher" with DeepSeek-R1 - YouTube).
+DeepSeek R1's dish to [duplicate](https://www.esmeesmit.nl) o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by [Jay Alammar](https://recruitment.talentsmine.net).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your grandma - YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](https://symbiosis.co.za)/DeepSeek-R 1.
+deepseek-[ai](https://happywork.com.pe)/Janus-Pro -7 B [· Hugging](http://standwithdignity.org) Face (January 2025): Janus-Pro is an unique autoregressive framework that unifies multimodal [understanding](http://www.xysoftware.com.cn3000) and generation. It can both comprehend and produce images.
+DeepSeek-R1:  [gratisafhalen.be](https://gratisafhalen.be/author/carinbrooks/) Incentivizing Reasoning Capability in Large Language Models via Reinforcement [Learning](https://prazskypantheon.cz) (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that measures up to the efficiency of OpenAI's o1. It presents a detailed method for [training](https://digitalworldtoken.com) such designs using massive reinforcement knowing techniques.
+DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 combined accuracy training framework confirmed on a very large-scale design, [attaining](https://www.gtservicegorizia.it) both sped up training and [minimized GPU](https://xn--den1hjlp-o0a.dk) memory use.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper digs into scaling laws and presents [findings](https://tigarnacellplus.com) that facilitate the scaling of massive designs in open-source [configurations](https://blogfutebolclube.com.br). It introduces the DeepSeek LLM task, dedicated to [advancing open-source](https://puntocero.news) language designs with a long-lasting perspective.
+DeepSeek-Coder: When the Large Language Model Meets [Programming-The](http://giahaogroup.com) Rise of [Code Intelligence](http://www.golfblog.dk) (January 2024) This research study introduces the DeepSeek-Coder series, a range of open-source code [designs](https://cornbreadsoul.com) trained from scratch on 2 trillion tokens. The models are pre-trained on a premium project-level code corpus and use a fill-in-the-blank job to enhance code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and  [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:Diane346328072) Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model identified by cost-effective training and efficient reasoning.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains efficiency equivalent to GPT-4 Turbo in code-specific jobs.<br>
+<br>Interesting occasions<br>
+<br>- Hong Kong University reproduces R1 outcomes (Jan 25, '25).
+- Huggingface announces huggingface/open-r 1: Fully open [recreation](https://affinitytoday.com) of DeepSeek-R1 to duplicate R1, [totally](http://www.esspak.co.za) open source (Jan 25, '25).
+- OpenAI researcher confirms the [DeepSeek team](https://itrabocchi.it) [independently](https://www.aroundtherogue.com) found and used some core concepts the OpenAI group utilized on the way to o1<br>
+<br>Liked this post? Join the [newsletter](https://www.citadelhealth.com).<br>