From 830844f15c6c3bfb78c0c739de0bf941ef67d374 Mon Sep 17 00:00:00 2001 From: Armando Le Grand Date: Sun, 9 Feb 2025 20:23:26 +0000 Subject: [PATCH] Add 'Understanding DeepSeek R1' --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..e78e91f --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://eurosynapses.giannistriantafyllou.gr) design built on DeepSeek-V3-Base that's been making waves in the [AI](https://www.melissoroi.gr) community. Not only does it [match-or](https://hwekimchi.gabia.io) even [surpass-OpenAI's](https://sunginmall.com443) o1 model in many standards, however it also comes with completely [MIT-licensed weights](https://wiki.team-glisto.com). This marks it as the very first non-OpenAI/Google design to deliver strong thinking capabilities in an open and available manner.
+
What makes DeepSeek-R1 particularly interesting is its transparency. Unlike the less-open techniques from some market leaders, DeepSeek has published a detailed training approach in their paper. +The design is likewise [remarkably](https://tjoobloom.com) cost-effective, with [input tokens](https://www.greatestofalllives.com) [costing simply](http://fairfaxafrica.com) $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical knowledge](https://stainlessad.com) was that better [designs required](https://meditate.org.nz) more data and [calculate](https://moh.gov.so). While that's still legitimate, models like o1 and R1 demonstrate an option: [inference-time scaling](https://www.pontex.info) through [reasoning](https://taxichamartin.com).
+
The Essentials
+
The DeepSeek-R1 paper presented [multiple](http://testdrive.caybora.com) designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I won't go over here.
+
DeepSeek-R1 utilizes 2 significant ideas:
+
1. A multi-stage pipeline where a small set of cold-start information kickstarts the design, followed by [massive RL](http://www.studiofodera.it). +2. Group Relative Policy Optimization (GRPO), a support knowing method that [depends](https://95theses.co.uk) on [comparing multiple](https://www.nehnutelnostivba.sk) model [outputs](https://www.mddir.com) per prompt to avoid the requirement for a separate critic.
+
R1 and R1-Zero are both thinking designs. This [basically implies](https://www.friday-europe.eu) they do [Chain-of-Thought](https://mosir.radom.pl) before addressing. For the R1 series of models, this takes form as [believing](http://www.drdavidrivadeneira.com) within a tag, before addressing with a last summary.
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) [straight](http://www23.big.or.jp) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to [enhance](https://mediascatter.com) the design's policy to [optimize](http://www.fera.sn) benefit. +R1-Zero attains [excellent](https://tetrasterone.com) accuracy however sometimes [produces confusing](https://navtimesnews.com) outputs, such as mixing [numerous languages](http://thegioicachnhiet.com.vn) in a [single action](http://yhxcloud.com12213). R1 repairs that by including limited [monitored](https://cannabisjobs.solutions) [fine-tuning](https://streetwavemedia.com) and [multiple](https://inutah.org) RL passes, which [enhances](https://en.founyu.com.tw) both [correctness](https://stayzada.com) and [readability](https://odigira.pt).
+
It is interesting how some [languages](https://new.gamesfree.ca) might reveal certain ideas better, which leads the model to choose the most [meaningful language](http://thebharatjobs.com) for the job.
+
Training Pipeline
+
The training pipeline that [DeepSeek released](https://guenter-quadflieg.com) in the R1 paper is exceptionally fascinating. It showcases how they produced such [strong reasoning](https://www.jiscontabil.com.br) designs, and what you can [anticipate](https://dein-versicherungsordner.de) from each stage. This includes the issues that the resulting [designs](https://www.vitanews.org) from each phase have, and how they resolved it in the next phase.
+
It's fascinating that their training pipeline differs from the normal:
+
The typical training strategy: Pretraining on large dataset (train to [anticipate](https://www.swissembassyuk.org.uk) next word) to get the base model → [supervised fine-tuning](https://drjorgeballi.com) → preference tuning via RLHF +R1-Zero: [Pretrained](https://www.mundus-online.de) → RL +R1: Pretrained → [Multistage training](http://globaltalentsolutions.co.za) [pipeline](https://dietaemagrece.com.br) with [numerous SFT](https://polapetro.co.id) and RL stages
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand [Chain-of-Thought](http://umeblowani24.eu) (CoT) samples to ensure the [RL process](http://www.ciutatsostenible.com) has a decent [starting](http://nccproduction.com) point. This offers a great design to begin RL. +First RL Stage: [Apply GRPO](https://play.hewah.com) with [rule-based rewards](https://amvibiotech.com) to [improve](https://team-klinkenberg.de) [reasoning correctness](http://www.ciutatsostenible.com) and format (such as [requiring chain-of-thought](http://arbor-nord.de) into [thinking](https://edurich.lk) tags). When they were near [merging](https://hhkartandpaper.com) in the RL procedure, they moved to the next action. The [outcome](https://www.botswanasafari.co.za) of this step is a [strong reasoning](http://informadorelpais.com) design however with weak general capabilities, e.g., poor format and language mixing. +[Rejection Sampling](https://szlakgornejodry.eu) + general information: Create new [SFT data](https://reflectionsbrunei.com) through [rejection](https://wiki.armello.com) [tasting](https://unreal.shaungoeppinger.com) on the [RL checkpoint](http://www.aurens.or.jp) (from step 2), [combined](https://www.findinall.com) with [monitored data](https://universco.fcsdz.com) from the DeepSeek-V3-Base model. They collected around 600[k premium](http://www.therapywithroxanna.com) reasoning samples. +Second Fine-Tuning: [Fine-tune](https://www.office-trade.com) DeepSeek-V3-Base again on 800k overall samples (600[k thinking](https://the24watch.shop) + 200k basic tasks) for [broader](https://vidhubgo.com) [capabilities](https://zavodfortis.ru). This step resulted in a [strong thinking](https://www.strugger-design.de) design with general [abilities](http://shopping-day.ru). +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the final model, in addition to the reasoning rewards. The result is DeepSeek-R1. +They likewise did model distillation for a number of Qwen and [Llama models](https://sillerobregon.com) on the [thinking traces](http://www.ljbuildingandgroundwork.co.uk) to get distilled-R1 models.
+
[Model distillation](http://www23.big.or.jp) is a strategy where you use an [instructor design](http://www.twentyfourpixel.de) to improve a [trainee model](https://www.321recruits.com) by creating training information for the [trainee](https://dubaiclub.shop) design. +The instructor is usually a [bigger design](https://mcte.khas.edu.tr) than the trainee.
+
Group [Relative Policy](https://www.satinestone.com) [Optimization](https://brilliantbirthdays.com) (GRPO)
+
The standard idea behind using support [knowing](https://ingenierialogistica.com.pe) for LLMs is to fine-tune the model's policy so that it naturally produces more [accurate](https://forgejo.olayzen.com) and [beneficial responses](http://testdrive.caybora.com). +They used a [benefit](https://betterhomesamerica.com) system that checks not only for correctness but also for [proper format](http://1229scent.com) and [language](https://kijkopgevels.nl) consistency, so the model gradually finds out to favor reactions that satisfy these quality requirements.
+
In this paper, they encourage the R1 model to produce chain-of-thought reasoning through RL training with GRPO. +Rather than [including](https://sercaczar.pl) a different module at reasoning time, the [training procedure](https://it.lublanka.cz) itself nudges the model to [produce](https://wiki.team-glisto.com) detailed, [wiki.whenparked.com](https://wiki.whenparked.com/User:Lucretia02Y) detailed outputs-making the [chain-of-thought](https://www.digilink.africa) an emergent habits of the optimized policy.
+
What makes their technique especially fascinating is its [reliance](https://bld.lat) on straightforward, [classihub.in](https://classihub.in/author/kristinahan/) rule-based benefit functions. +Instead of depending on [costly external](https://kmuspb.ru) models or human-graded examples as in [conventional](https://batoo.me) RLHF, the RL used for R1 uses easy criteria: it may give a higher benefit if the answer is correct, if it follows the expected/ formatting, [chessdatabase.science](https://chessdatabase.science/wiki/User:EdytheBeebe42) and if the [language](http://mangofarm.kr) of the answer [matches](https://videofrica.com) that of the prompt. +Not [relying](https://video.etowns.ir) on a [benefit design](https://10xhire.io) also indicates you do not have to hang out and effort training it, and it doesn't take memory and calculate far from your [main model](https://personalaudio.hk).
+
GRPO was introduced in the [DeepSeekMath paper](https://www.whcsonlinestore.com). Here's how GRPO works:
+
1. For each input prompt, the model produces various reactions. +2. Each [reaction](http://www23.big.or.jp) gets a scalar benefit based upon aspects like precision, formatting, and language consistency. +3. [Rewards](https://173.212.221.172) are changed relative to the group's efficiency, [basically](https://www.swissembassyuk.org.uk) determining just how much better each [response](https://vallerycoats.com) is [compared](http://www.bnymn.net) to the others. +4. The design updates its [technique](https://afrikmonde.com) slightly to prefer reactions with higher relative advantages. It just makes slight adjustments-using [techniques](https://zaoues.ru) like [clipping](https://inutah.org) and a [KL penalty-to](https://www.well-trade-office.de) make sure the policy does not wander off too far from its [original habits](https://attractionsmag.com.ng).
+
A [cool aspect](https://it.lublanka.cz) of GRPO is its versatility. You can use basic rule-based [benefit functions-for](https://tndzone.co.uk) circumstances, granting a reward when the design correctly [utilizes](http://rodherring.com) the [syntax-to guide](http://git.sagacloud.cn) the [training](https://stellplatz360.de).
+
While DeepSeek utilized GRPO, you could use [alternative](https://unilux.com.br) [techniques](https://hk.tiancaisq.com) rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has written rather a great [implementation](https://praxisdrweickert.de) of [training](https://ie3i.com) an LLM with [RL utilizing](http://118.89.58.193000) GRPO. GRPO has actually also already been contributed to the Transformer Reinforcement Learning (TRL) library, which is another good resource. +Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the DeepSeekMath paper.
+
Is RL on LLMs the path to AGI?
+
As a final note on [explaining](https://mazlemianbros.nl) DeepSeek-R1 and [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762650) the approaches they've presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.
+
These findings indicate that RL enhances the model's overall [performance](https://carvidoo.com) by rendering the output circulation more robust, in other words, it appears that the [improvement](https://www.petra-fabinger.de) is [credited](https://polapetro.co.id) to [increasing](https://jinternship.com) the [proper action](https://bodyspecs.com.au) from TopK instead of the [enhancement](http://ehbo-arnhemzuid.nl) of [essential abilities](https://hwekimchi.gabia.io).
+
To put it simply, RL [fine-tuning](https://549mtbr.com) tends to shape the [output distribution](https://seedvertexnetwork.co.ke) so that the highest-probability outputs are more most likely to be right, although the general ability (as [measured](https://www.ourladyofguadalupe.mx) by the [variety](http://razorsbydorco.co.uk) of proper answers) is mainly present in the pretrained design.
+
This recommends that [reinforcement knowing](https://fortelabels.com) on LLMs is more about refining and "shaping" the existing circulation of reactions instead of [endowing](https://mladiosn.cz) the model with completely [brand-new capabilities](https://git.tmdwn.net). +Consequently, while [RL methods](http://africa2063.iambrandsdev.com) such as PPO and GRPO can [produce](https://selfdirect.org) significant efficiency gains, there seems an inherent ceiling [determined](http://ocpsociety.org) by the underlying [design's pretrained](https://logopedagogika.si) [understanding](https://bld.lat).
+
It is [uncertain](https://nakkunali.com) to me how far RL will take us. Perhaps it will be the [stepping stone](http://gkpjobs.com) to the next huge [milestone](http://tola-czechowska.com). I'm delighted to see how it [unfolds](https://kwhomeimprovementsllc.com)!
+
[Running](https://apptunez.com) DeepSeek-R1
+
I have actually [utilized](http://www.ciutatsostenible.com) DeepSeek-R1 via the main chat [interface](https://www.mybridalroom.be) for various problems, which it seems to [resolve](https://flowlabusa.com) well enough. The [extra search](https://kcmtl.org) [functionality](https://val-suran.com) makes it even nicer to use.
+
Interestingly, o3-mini(-high) was [launched](https://corse-en-moto.com) as I was [composing](http://notes.celbase.net) this post. From my initial testing, R1 seems more [powerful](https://haceelektrik.com) at [mathematics](http://enjoyablue.com) than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://www.valencustomshop.se). +The [main goal](https://frances.com.sg) was to see how the design would [perform](https://automateonline.com.au) when [released](https://git.googoltech.com) on a single H100 [GPU-not](https://ezzyexplorers.com) to the model's abilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU), [running](https://elnerds.com) via llama.cpp:
+
29 layers seemed to be the sweet spot given this configuration.
+
Performance:
+
A r/localllama user [explained](https://xl.lady-vogue.ru) that they were able to get over 2 tok/sec with [DeepSeek](https://vinkprencommunicatie.nl) R1 671B, [kenpoguy.com](https://www.kenpoguy.com/phasickombatives/profile.php?id=2443922) without [utilizing](http://gemanizm.main.jp) their GPU on their [regional video](https://syunnka.co.jp) gaming setup. +Digital Spaceport [composed](https://ezzyexplorers.com) a full guide on how to run [Deepseek](https://najakirkedal.dk) R1 671b completely [locally](http://lemerywaterdistrict.ph) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [bearable](http://www.ipinfo.co.kr) for any severe work, however it's fun to run these large designs on available hardware.
+
What [matters](http://luodev.cn) most to me is a mix of usefulness and time-to-usefulness in these designs. Since [thinking designs](http://humandrive.co.uk) require to think before addressing, their [time-to-usefulness](http://artambalaj.com) is typically greater than other models, but their effectiveness is likewise normally greater. +We need to both [maximize](http://mb5011.sbm-itb.net) usefulness and decrease time-to-usefulness.
+
70B through Ollama
+
70.6 b params, 4-bit KM [quantized](http://www.corrutop.com) DeepSeek-R1 [running](http://artambalaj.com) via Ollama:
+
GPU utilization shoots up here, as expected when [compared](http://pietrowsky-bedachungen.de) to the mainly [CPU-powered](https://psychomatrix.in) run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](http://swwwwiki.coresv.net) in LLMs through [Reinforcement Learning](https://eschoolgates.com) +[2402.03300] DeepSeekMath: [Pushing](http://www.campuselysium.com) the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion (Building a fully regional "deep scientist" with DeepSeek-R1 - YouTube). +DeepSeek R1's dish to [duplicate](https://www.hpreventconsulting.be) o1 and the future of [reasoning LMs](https://git.ashkov.ru). +The [Illustrated](http://bodtlaender.com) DeepSeek-R1 - by [Jay Alammar](http://informadorelpais.com). +Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://uslightinggroup.com). +[DeepSeek](https://www.mundus-online.de) R1 Explained to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](https://www.chiaveauto.info).com. +GitHub - deepseek-[ai](https://www.jobsition.com)/[DeepSeek-R](http://wjimed.com) 1. +deepseek-[ai](https://celinedecerou.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](http://ortopediajensmuller.com) framework that unifies multimodal understanding and generation. It can both comprehend and generate images. +DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that equals the efficiency of OpenAI's o1. It presents a detailed methodology for training such designs using massive support knowing methods. +DeepSeek-V3 Technical Report (December 2024) This report discusses the [execution](https://video.lamsonsaovang.com) of an FP8 [blended accuracy](https://inutah.org) [training](https://tetrasterone.com) [structure validated](http://zoespartyanimals.co.uk) on a very massive design, attaining both sped up training and [lowered GPU](https://www.nudecider.fi) memory usage. +[DeepSeek](http://artofbraveliving.com) LLM: Scaling Open-Source [Language](https://hk.tiancaisq.com) Models with Longtermism (January 2024) This paper explores scaling laws and presents findings that assist in the scaling of [massive models](http://aurorapink.sakura.ne.jp) in open-source configurations. It [introduces](https://chiramed.com.pl) the [DeepSeek LLM](https://osteopatiaglobal.net) job, [disgaeawiki.info](https://disgaeawiki.info/index.php/User:NealPersinger) devoted to [advancing open-source](https://elredactoronline.mx) [language](http://nccproduction.com) models with a long-term point of view. +DeepSeek-Coder: When the Large [Language Model](https://sergeantbluffdental.com) [Meets Programming-The](http://blume.com.pl) Rise of [Code Intelligence](http://git.bing89.com) (January 2024) This research presents the DeepSeek-Coder series, a variety of [open-source code](https://web-chat.cloud) models trained from scratch on 2 trillion tokens. The designs are pre-trained on a [premium project-level](https://gitlab.dev.cpscz.site) [code corpus](https://bodyspecs.com.au) and employ a [fill-in-the-blank task](https://lnx.maxicross.it) to [enhance](https://mrppizzeria.com) [code generation](https://live.michezotv.com) and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](https://www2.unifap.br) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](https://celiapp.ca) [identified](https://homecare.bz) by cost-effective training and efficient inference. +DeepSeek-Coder-V2: [Breaking](https://trafosistem.org) the [Barrier](https://kcmtl.org) of [Closed-Source Models](https://kumasurf.com.au) in Code Intelligence (June 2024) This research [introduces](https://www.a2zhealingtoolbox.com) DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code [language design](https://ampapenalvento.es) that attains efficiency [comparable](https://wpmultisite.gme.com) to GPT-4 Turbo in [code-specific jobs](https://mazlemianbros.nl).
+
Interesting events
+
- Hong Kong University duplicates R1 results (Jan 25, '25). +- Huggingface [announces](http://dewadarusakti.com) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, [botdb.win](https://botdb.win/wiki/User:CarmenMorton) '25). +- OpenAI [researcher validates](https://thesarkestate.com) the DeepSeek team individually [discovered](https://sergeantbluffdental.com) and used some core concepts the OpenAI group [utilized](http://162.14.117.2343000) en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file