From 830844f15c6c3bfb78c0c739de0bf941ef67d374 Mon Sep 17 00:00:00 2001
From: Armando Le Grand <armando.legrand8474@regularemail.shop>
Date: Sun, 9 Feb 2025 20:23:26 +0000
Subject: [PATCH] Add 'Understanding DeepSeek R1'

---
 Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)
 create mode 100644 Understanding-DeepSeek-R1.md
diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..e78e91f
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](https://eurosynapses.giannistriantafyllou.gr) design built on DeepSeek-V3-Base that's been making waves in the [AI](https://www.melissoroi.gr) community. Not only does it [match-or](https://hwekimchi.gabia.io) even [surpass-OpenAI's](https://sunginmall.com443) o1 model in many standards, however it also comes with completely [MIT-licensed weights](https://wiki.team-glisto.com). This marks it as the very first non-OpenAI/Google design to deliver strong thinking capabilities in an open and available manner.<br>
+<br>What makes DeepSeek-R1 particularly interesting is its transparency. Unlike the less-open techniques from some market leaders, DeepSeek has published a detailed training approach in their paper.
+The design is likewise [remarkably](https://tjoobloom.com) cost-effective, with [input tokens](https://www.greatestofalllives.com) [costing simply](http://fairfaxafrica.com) $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [typical knowledge](https://stainlessad.com) was that better [designs required](https://meditate.org.nz) more data and [calculate](https://moh.gov.so). While that's still legitimate, models like o1 and R1 demonstrate an option: [inference-time scaling](https://www.pontex.info) through [reasoning](https://taxichamartin.com).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented [multiple](http://testdrive.caybora.com) designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I won't go over here.<br>
+<br>DeepSeek-R1 utilizes 2 significant ideas:<br>
+<br>1. A multi-stage pipeline where a small set of cold-start information kickstarts the design, followed by [massive RL](http://www.studiofodera.it).
+2. Group Relative Policy Optimization (GRPO), a support knowing method that [depends](https://95theses.co.uk) on [comparing multiple](https://www.nehnutelnostivba.sk) model [outputs](https://www.mddir.com) per prompt to avoid the requirement for a separate critic.<br>
+<br>R1 and R1-Zero are both thinking designs. This [basically implies](https://www.friday-europe.eu) they do [Chain-of-Thought](https://mosir.radom.pl) before addressing. For the R1 series of models, this takes form as [believing](http://www.drdavidrivadeneira.com) within a tag, before addressing with a last summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero uses Reinforcement Learning (RL) [straight](http://www23.big.or.jp) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to [enhance](https://mediascatter.com) the design's policy to [optimize](http://www.fera.sn) benefit.
+R1-Zero attains [excellent](https://tetrasterone.com) accuracy however sometimes [produces confusing](https://navtimesnews.com) outputs, such as mixing [numerous languages](http://thegioicachnhiet.com.vn) in a [single action](http://yhxcloud.com12213). R1 repairs that by including limited [monitored](https://cannabisjobs.solutions) [fine-tuning](https://streetwavemedia.com) and [multiple](https://inutah.org) RL passes, which [enhances](https://en.founyu.com.tw) both [correctness](https://stayzada.com) and [readability](https://odigira.pt).<br>
+<br>It is interesting how some [languages](https://new.gamesfree.ca) might reveal certain ideas better, which leads the model to choose the most [meaningful language](http://thebharatjobs.com) for the job.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek released](https://guenter-quadflieg.com) in the R1 paper is exceptionally fascinating. It showcases how they produced such [strong reasoning](https://www.jiscontabil.com.br) designs, and what you can [anticipate](https://dein-versicherungsordner.de) from each stage. This includes the issues that the resulting [designs](https://www.vitanews.org) from each phase have, and how they resolved it in the next phase.<br>
+<br>It's fascinating that their training pipeline differs from the normal:<br>
+<br>The typical training strategy: Pretraining on large dataset (train to [anticipate](https://www.swissembassyuk.org.uk) next word) to get the base model → [supervised fine-tuning](https://drjorgeballi.com) → preference tuning via RLHF
+R1-Zero: [Pretrained](https://www.mundus-online.de) → RL
+R1: Pretrained → [Multistage training](http://globaltalentsolutions.co.za) [pipeline](https://dietaemagrece.com.br) with [numerous SFT](https://polapetro.co.id) and RL stages<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand [Chain-of-Thought](http://umeblowani24.eu) (CoT) samples to ensure the [RL process](http://www.ciutatsostenible.com) has a decent [starting](http://nccproduction.com) point. This offers a great design to begin RL.
+First RL Stage: [Apply GRPO](https://play.hewah.com) with [rule-based rewards](https://amvibiotech.com) to [improve](https://team-klinkenberg.de) [reasoning correctness](http://www.ciutatsostenible.com) and format (such as [requiring chain-of-thought](http://arbor-nord.de) into [thinking](https://edurich.lk) tags). When they were near [merging](https://hhkartandpaper.com) in the RL procedure, they moved to the next action. The [outcome](https://www.botswanasafari.co.za) of this step is a [strong reasoning](http://informadorelpais.com) design however with weak general capabilities, e.g., poor format and language mixing.
+[Rejection Sampling](https://szlakgornejodry.eu) + general information: Create new [SFT data](https://reflectionsbrunei.com) through [rejection](https://wiki.armello.com) [tasting](https://unreal.shaungoeppinger.com) on the [RL checkpoint](http://www.aurens.or.jp) (from step 2), [combined](https://www.findinall.com) with [monitored data](https://universco.fcsdz.com) from the DeepSeek-V3-Base model. They collected around 600[k premium](http://www.therapywithroxanna.com) reasoning samples.
+Second Fine-Tuning: [Fine-tune](https://www.office-trade.com) DeepSeek-V3-Base again on 800k overall samples (600[k thinking](https://the24watch.shop) + 200k basic tasks) for [broader](https://vidhubgo.com) [capabilities](https://zavodfortis.ru). This step resulted in a [strong thinking](https://www.strugger-design.de) design with general [abilities](http://shopping-day.ru).
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the final model, in addition to the reasoning rewards. The result is DeepSeek-R1.
+They likewise did model distillation for a number of Qwen and [Llama models](https://sillerobregon.com) on the [thinking traces](http://www.ljbuildingandgroundwork.co.uk) to get distilled-R1 models.<br>
+<br>[Model distillation](http://www23.big.or.jp) is a strategy where you use an [instructor design](http://www.twentyfourpixel.de) to improve a [trainee model](https://www.321recruits.com) by creating training information for the [trainee](https://dubaiclub.shop) design.
+The instructor is usually a [bigger design](https://mcte.khas.edu.tr) than the trainee.<br>
+<br>Group [Relative Policy](https://www.satinestone.com) [Optimization](https://brilliantbirthdays.com) (GRPO)<br>
+<br>The standard idea behind using support [knowing](https://ingenierialogistica.com.pe) for LLMs is to fine-tune the model's policy so that it naturally produces more [accurate](https://forgejo.olayzen.com) and [beneficial responses](http://testdrive.caybora.com).
+They used a [benefit](https://betterhomesamerica.com) system that checks not only for correctness but also for [proper format](http://1229scent.com) and [language](https://kijkopgevels.nl) consistency, so the model gradually finds out to favor reactions that satisfy these quality requirements.<br>
+<br>In this paper, they encourage the R1 model to produce chain-of-thought reasoning through RL training with GRPO.
+Rather than [including](https://sercaczar.pl) a different module at reasoning time, the [training procedure](https://it.lublanka.cz) itself nudges the model to [produce](https://wiki.team-glisto.com) detailed,  [wiki.whenparked.com](https://wiki.whenparked.com/User:Lucretia02Y) detailed outputs-making the [chain-of-thought](https://www.digilink.africa) an emergent habits of the optimized policy.<br>
+<br>What makes their technique especially fascinating is its [reliance](https://bld.lat) on straightforward,  [classihub.in](https://classihub.in/author/kristinahan/) rule-based benefit functions.
+Instead of depending on [costly external](https://kmuspb.ru) models or human-graded examples as in [conventional](https://batoo.me) RLHF, the RL used for R1 uses easy criteria: it may give a higher benefit if the answer is correct, if it follows the expected/ formatting,  [chessdatabase.science](https://chessdatabase.science/wiki/User:EdytheBeebe42) and if the [language](http://mangofarm.kr) of the answer [matches](https://videofrica.com) that of the prompt.
+Not [relying](https://video.etowns.ir) on a [benefit design](https://10xhire.io) also indicates you do not have to hang out and effort training it, and it doesn't take memory and calculate far from your [main model](https://personalaudio.hk).<br>
+<br>GRPO was introduced in the [DeepSeekMath paper](https://www.whcsonlinestore.com). Here's how GRPO works:<br>
+<br>1. For each input prompt, the model produces various reactions.
+2. Each [reaction](http://www23.big.or.jp) gets a scalar benefit based upon aspects like precision, formatting, and language consistency.
+3. [Rewards](https://173.212.221.172) are changed relative to the group's efficiency, [basically](https://www.swissembassyuk.org.uk) determining just how much better each [response](https://vallerycoats.com) is [compared](http://www.bnymn.net) to the others.
+4. The design updates its [technique](https://afrikmonde.com) slightly to prefer reactions with higher relative advantages. It just makes slight adjustments-using [techniques](https://zaoues.ru) like [clipping](https://inutah.org) and a [KL penalty-to](https://www.well-trade-office.de) make sure the policy does not wander off too far from its [original habits](https://attractionsmag.com.ng).<br>
+<br>A [cool aspect](https://it.lublanka.cz) of GRPO is its versatility. You can use basic rule-based [benefit functions-for](https://tndzone.co.uk) circumstances, granting a reward when the design correctly [utilizes](http://rodherring.com) the [syntax-to guide](http://git.sagacloud.cn) the [training](https://stellplatz360.de).<br>
+<br>While DeepSeek utilized GRPO, you could use [alternative](https://unilux.com.br) [techniques](https://hk.tiancaisq.com) rather (PPO or PRIME).<br>
+<br>For those aiming to dive deeper, Will Brown has written rather a great [implementation](https://praxisdrweickert.de) of [training](https://ie3i.com) an LLM with [RL utilizing](http://118.89.58.193000) GRPO. GRPO has actually also already been contributed to the Transformer Reinforcement Learning (TRL) library, which is another good resource.
+Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the DeepSeekMath paper.<br>
+<br>Is RL on LLMs the path to AGI?<br>
+<br>As a final note on [explaining](https://mazlemianbros.nl) DeepSeek-R1 and  [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762650) the approaches they've presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.<br>
+<br>These findings indicate that RL enhances the model's overall [performance](https://carvidoo.com) by rendering the output circulation more robust, in other words, it appears that the [improvement](https://www.petra-fabinger.de) is [credited](https://polapetro.co.id) to [increasing](https://jinternship.com) the [proper action](https://bodyspecs.com.au) from TopK instead of the [enhancement](http://ehbo-arnhemzuid.nl) of [essential abilities](https://hwekimchi.gabia.io).<br>
+<br>To put it simply, RL [fine-tuning](https://549mtbr.com) tends to shape the [output distribution](https://seedvertexnetwork.co.ke) so that the highest-probability outputs are more most likely to be right, although the general ability (as [measured](https://www.ourladyofguadalupe.mx) by the [variety](http://razorsbydorco.co.uk) of proper answers) is mainly present in the pretrained design.<br>
+<br>This recommends that [reinforcement knowing](https://fortelabels.com) on LLMs is more about refining and "shaping" the existing circulation of reactions instead of [endowing](https://mladiosn.cz) the model with completely [brand-new capabilities](https://git.tmdwn.net).
+Consequently, while [RL methods](http://africa2063.iambrandsdev.com) such as PPO and GRPO can [produce](https://selfdirect.org) significant efficiency gains, there seems an inherent ceiling [determined](http://ocpsociety.org) by the underlying [design's pretrained](https://logopedagogika.si) [understanding](https://bld.lat).<br>
+<br>It is [uncertain](https://nakkunali.com) to me how far RL will take us. Perhaps it will be the [stepping stone](http://gkpjobs.com) to the next huge [milestone](http://tola-czechowska.com). I'm delighted to see how it [unfolds](https://kwhomeimprovementsllc.com)!<br>
+<br>[Running](https://apptunez.com) DeepSeek-R1<br>
+<br>I have actually [utilized](http://www.ciutatsostenible.com) DeepSeek-R1 via the main chat [interface](https://www.mybridalroom.be) for various problems, which it seems to [resolve](https://flowlabusa.com) well enough. The [extra search](https://kcmtl.org) [functionality](https://val-suran.com) makes it even nicer to use.<br>
+<br>Interestingly, o3-mini(-high) was [launched](https://corse-en-moto.com) as I was [composing](http://notes.celbase.net) this post. From my initial testing, R1 seems more [powerful](https://haceelektrik.com) at [mathematics](http://enjoyablue.com) than o3-mini.<br>
+<br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://www.valencustomshop.se).
+The [main goal](https://frances.com.sg) was to see how the design would [perform](https://automateonline.com.au) when [released](https://git.googoltech.com) on a single H100 [GPU-not](https://ezzyexplorers.com) to  the model's abilities.<br>
+<br>671B via Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU), [running](https://elnerds.com) via llama.cpp:<br>
+<br>29 layers seemed to be the sweet spot given this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user [explained](https://xl.lady-vogue.ru) that they were able to get over 2 tok/sec with [DeepSeek](https://vinkprencommunicatie.nl) R1 671B,  [kenpoguy.com](https://www.kenpoguy.com/phasickombatives/profile.php?id=2443922) without [utilizing](http://gemanizm.main.jp) their GPU on their [regional video](https://syunnka.co.jp) gaming setup.
+Digital Spaceport [composed](https://ezzyexplorers.com) a full guide on how to run [Deepseek](https://najakirkedal.dk) R1 671b completely [locally](http://lemerywaterdistrict.ph) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite [bearable](http://www.ipinfo.co.kr) for any severe work, however it's fun to run these large designs on available hardware.<br>
+<br>What [matters](http://luodev.cn) most to me is a mix of usefulness and time-to-usefulness in these designs. Since [thinking designs](http://humandrive.co.uk) require to think before addressing, their [time-to-usefulness](http://artambalaj.com) is typically greater than other models, but their effectiveness is likewise normally greater.
+We need to both [maximize](http://mb5011.sbm-itb.net) usefulness and decrease time-to-usefulness.<br>
+<br>70B through Ollama<br>
+<br>70.6 b params, 4-bit KM [quantized](http://www.corrutop.com) DeepSeek-R1 [running](http://artambalaj.com) via Ollama:<br>
+<br>GPU utilization shoots up here, as expected when [compared](http://pietrowsky-bedachungen.de) to the mainly [CPU-powered](https://psychomatrix.in) run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning [Capability](http://swwwwiki.coresv.net) in LLMs through [Reinforcement Learning](https://eschoolgates.com)
+[2402.03300] DeepSeekMath: [Pushing](http://www.campuselysium.com) the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 - Notion (Building a fully regional "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1's dish to [duplicate](https://www.hpreventconsulting.be) o1 and the future of [reasoning LMs](https://git.ashkov.ru).
+The [Illustrated](http://bodtlaender.com) DeepSeek-R1 - by [Jay Alammar](http://informadorelpais.com).
+Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://uslightinggroup.com).
+[DeepSeek](https://www.mundus-online.de) R1 Explained to your grandmother - YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](https://www.chiaveauto.info).com.
+GitHub - deepseek-[ai](https://www.jobsition.com)/[DeepSeek-R](http://wjimed.com) 1.
+deepseek-[ai](https://celinedecerou.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](http://ortopediajensmuller.com) framework that unifies multimodal understanding and generation. It can both comprehend and generate images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that equals the efficiency of OpenAI's o1. It presents a detailed methodology for training such designs using massive support knowing methods.
+DeepSeek-V3 Technical Report (December 2024) This report discusses the [execution](https://video.lamsonsaovang.com) of an FP8 [blended accuracy](https://inutah.org) [training](https://tetrasterone.com) [structure validated](http://zoespartyanimals.co.uk) on a very massive design, attaining both sped up training and [lowered GPU](https://www.nudecider.fi) memory usage.
+[DeepSeek](http://artofbraveliving.com) LLM: Scaling Open-Source [Language](https://hk.tiancaisq.com) Models with Longtermism (January 2024) This paper explores scaling laws and presents findings that assist in the scaling of [massive models](http://aurorapink.sakura.ne.jp) in open-source configurations. It [introduces](https://chiramed.com.pl) the [DeepSeek LLM](https://osteopatiaglobal.net) job,  [disgaeawiki.info](https://disgaeawiki.info/index.php/User:NealPersinger) devoted to [advancing open-source](https://elredactoronline.mx) [language](http://nccproduction.com) models with a long-term point of view.
+DeepSeek-Coder: When the Large [Language Model](https://sergeantbluffdental.com) [Meets Programming-The](http://blume.com.pl) Rise of [Code Intelligence](http://git.bing89.com) (January 2024) This research presents the DeepSeek-Coder series, a variety of [open-source code](https://web-chat.cloud) models trained from scratch on 2 trillion tokens. The designs are pre-trained on a [premium project-level](https://gitlab.dev.cpscz.site) [code corpus](https://bodyspecs.com.au) and employ a [fill-in-the-blank task](https://lnx.maxicross.it) to [enhance](https://mrppizzeria.com) [code generation](https://live.michezotv.com) and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](https://www2.unifap.br) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](https://celiapp.ca) [identified](https://homecare.bz) by cost-effective training and efficient inference.
+DeepSeek-Coder-V2: [Breaking](https://trafosistem.org) the [Barrier](https://kcmtl.org) of [Closed-Source Models](https://kumasurf.com.au) in Code Intelligence (June 2024) This research [introduces](https://www.a2zhealingtoolbox.com) DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code [language design](https://ampapenalvento.es) that attains efficiency [comparable](https://wpmultisite.gme.com) to GPT-4 Turbo in [code-specific jobs](https://mazlemianbros.nl).<br>
+<br>Interesting events<br>
+<br>- Hong Kong University duplicates R1 results (Jan 25, '25).
+- Huggingface [announces](http://dewadarusakti.com) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, fully open source (Jan 25,  [botdb.win](https://botdb.win/wiki/User:CarmenMorton) '25).
+- OpenAI [researcher validates](https://thesarkestate.com) the DeepSeek team individually [discovered](https://sergeantbluffdental.com) and used some core concepts the OpenAI group [utilized](http://162.14.117.2343000) en route to o1<br>
+<br>Liked this post? Join the newsletter.<br>
\ No newline at end of file