From 42ae13f03c00c8ed4d85facf0f76263cd45816d9 Mon Sep 17 00:00:00 2001
From: cortezdrs23678 <cortezwenger@emailsupply.space>
Date: Sun, 9 Feb 2025 22:08:32 +0800
Subject: [PATCH] Update 'Understanding DeepSeek R1'

---
 Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)
 create mode 100644 Understanding-DeepSeek-R1.md
diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..b028cd4
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://www.lombardotrasporti.com) [community](https://iburose.com). Not just does it match-or even surpass-OpenAI's o1 model in lots of benchmarks, however it also includes totally MIT-licensed [weights](http://shachikumura.com). This marks it as the very first non-OpenAI/[Google design](https://kahps.org) to provide strong reasoning [abilities](https://apri.gist.ac.kr) in an open and available manner.<br>
+<br>What makes DeepSeek-R1 especially amazing is its openness. Unlike the less-open methods from some industry leaders, [DeepSeek](https://git.juxiong.net) has actually published a detailed training methodology in their paper.
+The model is also [incredibly](https://www.rosarossaonline.it) cost-effective, with [input tokens](https://narasibangsa.com) costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [common knowledge](http://isaponify.co.uk) was that better models needed more data and calculate. While that's still valid, [designs](https://tv.sparktv.net) like o1 and R1 show an option: inference-time scaling through [reasoning](https://procuradoriadefilmes.com.br).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented [numerous](https://enitajobs.com) models, but main amongst them were R1 and R1-Zero. Following these are a series of [distilled designs](http://www.vzorovydom.sk) that, while fascinating, I will not talk about here.<br>
+<br>DeepSeek-R1 utilizes 2 major ideas:<br>
+<br>1. A [multi-stage pipeline](https://www.puddingkc.com) where a little set of cold-start data kickstarts the design, followed by large-scale RL.
+2. Group Relative [Policy Optimization](https://emotube-86emon.com) (GRPO), a [reinforcement knowing](https://helpchannelburundi.org) method that [depends](https://git.atauno.com) on [comparing multiple](https://repo.maum.in) model outputs per prompt to prevent the requirement for a [separate critic](https://gitea.masenam.com).<br>
+<br>R1 and R1-Zero are both reasoning models. This essentially means they do Chain-of-Thought before responding to. For the R1 series of models, this takes type as believing within a tag, before answering with a last [summary](http://www.chyangwa.com).<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero uses [Reinforcement Learning](https://xn--lckh1a7bzah4vue0925azy8b20sv97evvh.net) (RL) straight to DeepSeek-V3-Base with no [monitored fine-tuning](https://www.pergopark.com.tr) (SFT). RL is used to enhance the design's policy to make the most of benefit.
+R1[-Zero attains](https://www.mattkuchta.com) excellent precision however in some cases produces confusing outputs, such as blending multiple [languages](https://charmyajob.com) in a single response. R1 repairs that by [incorporating restricted](https://jobshew.xyz) supervised fine-tuning and several RL passes, which improves both [accuracy](https://www.octoldit.info) and readability.<br>
+<br>It is fascinating how some [languages](http://capcomstudio.com) may reveal certain ideas much better, which leads the design to select the most expressive language for the job.<br>
+<br>[Training](http://suke6.sakura.ne.jp) Pipeline<br>
+<br>The training pipeline that DeepSeek released in the R1 paper is tremendously interesting. It [showcases](https://git.muehlberg.net) how they created such [strong thinking](https://allbabiescollection.com) models, and what you can get out of each stage. This includes the issues that the resulting models from each phase have, and how they fixed it in the next phase.<br>
+<br>It's interesting that their training pipeline differs from the typical:<br>
+<br>The [normal training](https://greatbasinroof.com) method:  [tandme.co.uk](https://tandme.co.uk/author/kassiespear/) Pretraining on large dataset (train to anticipate next word) to get the base design → supervised fine-tuning → preference tuning via RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](http://mye-mentoring.com) [pipeline](https://elitevacancies.co.za) with several SFT and RL stages<br>
+<br>Cold-Start Fine-Tuning: [Fine-tune](https://lisabom.nl) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to guarantee the RL procedure has a good beginning point. This offers a good design to [start RL](https://yeskangaroo.com).
+First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and format (such as requiring chain-of-thought into believing tags). When they were near [merging](https://australiancoachingcouncil.com) in the RL procedure, they moved to the next step. The result of this step is a [strong reasoning](http://www.owd-langeoog.de) model however with weak basic capabilities, e.g., bad [formatting](https://domainhostingmarket.com) and language mixing.
+[Rejection](http://www.hervebougro.com) Sampling + basic information: Create new SFT information through rejection sampling on the RL checkpoint (from step 2), [combined](https://git.rggn.org) with supervised information from the DeepSeek-V3-Base model. They gathered around 600k premium thinking samples.
+Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://wittekind-buende.de) + 200k basic tasks) for wider [abilities](https://gandhcpas.net). This action resulted in a strong thinking design with basic abilities.
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final design, in addition to the reasoning benefits. The outcome is DeepSeek-R1.
+They also did [design distillation](http://therapienaturelle-mp.e-monsite.com) for [numerous Qwen](https://dev.nebulun.com) and [Llama designs](https://finecottontextiles.com) on the [thinking](http://video.firstkick.live) traces to get distilled-R1 models.<br>
+<br>Model distillation is a technique where you use an instructor model to improve a trainee design by generating training data for the trainee design.
+The teacher is usually a larger design than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The standard idea behind using support [knowing](https://charin-issuedb.elaad.io) for LLMs is to fine-tune the design's policy so that it naturally produces more [accurate](https://hardnews.id) and useful [answers](http://stockzero.net).
+They used a reward system that examines not just for accuracy however likewise for proper formatting and language consistency, so the design gradually learns to favor responses that fulfill these quality criteria.<br>
+<br>In this paper, they [motivate](https://www.4epoches-elati.gr) the R1 model to [generate chain-of-thought](https://www.eyano.be) reasoning through RL training with GRPO.
+Rather than adding a separate module at reasoning time, the training process itself pushes the model to produce detailed, detailed outputs-making the [chain-of-thought](https://apri.gist.ac.kr) an emergent behavior of the optimized policy.<br>
+<br>What makes their approach especially interesting is its reliance on straightforward, rule-based reward [functions](https://www.riscontra.com).
+Instead of [depending](https://espanology.com) upon [pricey external](https://cafeairship.com) models or human-graded examples as in standard RLHF, the RL utilized for R1 [utilizes](http://studiosalute.cz) easy criteria: it might give a greater reward if the answer is proper, if it follows the anticipated/ format, and if the language of the answer matches that of the prompt.
+Not counting on a reward design also [implies](https://www.yahalomia.co.il) you don't need to invest time and [effort training](https://eng.worthword.com) it, and it doesn't take memory and [compute](http://jorjournal.com) far from your [main design](https://git.moseswynn.com).<br>
+<br>GRPO was [introduced](http://www.stes.tyc.edu.tw) in the [DeepSeekMath paper](https://cupom.xyz). Here's how GRPO works:<br>
+<br>1. For  [fishtanklive.wiki](https://fishtanklive.wiki/User:AdrianaMontero) each input timely, the design creates different responses.
+2. Each action gets a scalar benefit based upon [aspects](https://howtolo.com) like accuracy, formatting, and language consistency.
+3. Rewards are adjusted relative to the group's efficiency, basically determining just how much better each reaction is compared to the others.
+4. The design updates its [technique](http://studiosalute.cz) a little to favor responses with higher [relative benefits](http://arriazugaray.es). It just makes minor adjustments-using [strategies](https://www.legnagonuoto.it) like clipping and a KL penalty-to guarantee the policy doesn't stray too far from its original behavior.<br>
+<br>A cool aspect of GRPO is its flexibility. You can use basic rule-based reward functions-for instance, [granting](https://aloecompany.gr) a perk when the design correctly [utilizes](https://pedemonteasoc.com.ar) the syntax-to guide the training.<br>
+<br>While DeepSeek used GRPO, you could use [alternative techniques](https://dubaijobzone.com) rather (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has written rather a good implementation of [training](https://enitajobs.com) an LLM with RL using GRPO. GRPO has actually also currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another great resource.
+Finally, Yannic Kilcher has a fantastic video explaining GRPO by going through the [DeepSeekMath paper](http://nongtachiang.ssk.in.th).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point [Yannic Kilcher](https://parapludh.nl) made in his video.<br>
+<br>These findings indicate that [RL improves](https://mysoshal.com) the [model's](https://www.multijobs.in) total performance by rendering the output circulation more robust, simply put, it appears that the improvement is [credited](https://www.tvatt-textilsystem.se) to improving the correct response from TopK instead of the enhancement of essential capabilities.<br>
+<br>Simply put, RL fine-tuning tends to form the output circulation so that the highest-probability [outputs](https://nerdzillaclassifiedscolumbusohio.nerdzilla.com) are more likely to be proper, despite the fact that the overall capability (as measured by the variety of appropriate responses) is mainly present in the pretrained model.<br>
+<br>This suggests that [reinforcement knowing](https://www.mamaundbub.de) on LLMs is more about [refining](https://www.spraylock.spraylockcp.com) and "shaping" the existing circulation of responses rather than enhancing the model with entirely brand-new capabilities.
+Consequently, while RL [methods](https://git.iidx.ca) such as PPO and GRPO can [produce considerable](https://musixx.smart-und-nett.de) efficiency gains, there seems an intrinsic ceiling determined by the underlying design's pretrained knowledge.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [thrilled](https://business-style.ro) to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually utilized DeepSeek-R1 through the main chat user interface for numerous problems,  [championsleage.review](https://championsleage.review/wiki/User:Janell18X05) which it seems to resolve well enough. The [extra search](https://www.xn--k3cc7brobq0b3a7a3s.com) functionality makes it even better to utilize.<br>
+<br>Interestingly, o3-mini(-high) was released as I was writing this post. From my [initial](https://git.ffho.net) testing, R1 seems more [powerful](http://sex.y.ribbon.to) at math than o3-mini.<br>
+<br>I also rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The [main goal](http://nar-anon.se) was to see how the design would carry out when [released](http://hdr.gi-ltd.ru) on a single H100 [GPU-not](https://musixx.smart-und-nett.de) to thoroughly evaluate the design's abilities.<br>
+<br>671B by means of Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://www.yasamdanhaber.com) by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://www.mattkuchta.com) [offloading](https://ejtallmanteam.com) (29 layers operating on the GPU), running through llama.cpp:<br>
+<br>29 layers seemed to be the sweet area provided this [configuration](https://shortjobcompany.com).<br>
+<br>Performance:<br>
+<br>A r/[localllama](https://runningas.co.kr) user explained that they had the [ability](http://45.67.56.2143030) to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their regional gaming setup.
+[Digital Spaceport](https://nycityus.com) wrote a full guide on how to run [Deepseek](http://gitbot.homedns.org) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite [bearable](https://toyosatokinzoku.com) for any serious work, however it's fun to run these large models on available hardware.<br>
+<br>What matters most to me is a mix of effectiveness and [time-to-usefulness](https://www.corems.org.br) in these [designs](https://www.nutridermovital.com). Since thinking designs need to think before responding to, their time-to-usefulness is generally greater than other designs, however their effectiveness is likewise typically higher.
+We [require](https://spinevision.net) to both maximize usefulness and minimize time-to-usefulness.<br>
+<br>70B via Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
+<br>GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+[DeepSeek](https://voilathemes.com) R1 - Notion (Building a totally local "deep scientist" with DeepSeek-R1 - YouTube).
+[DeepSeek](http://git.jetplasma-oa.com) R1's recipe to duplicate o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](https://social.ppmandi.com) R1 [Explained](https://xellaz.com) to your [grandma -](http://www.intercapitalenergy.com) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](http://ejn.co.kr)/DeepSeek-R 1.
+deepseek-[ai](https://yannriguidelhypnose.fr)/Janus-Pro -7 B · Hugging Face (January 2025): [Janus-Pro](http://www.hambleyachtcare.com) is a novel autoregressive structure that [combines multimodal](http://63.32.145.226) understanding and [generation](https://luxurystyled.nl). It can both comprehend and generate images.
+DeepSeek-R1: Incentivizing Reasoning [Capability](https://www.wearemodel.com) in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that matches the [performance](https://dafdof.net) of OpenAI's o1. It provides a detailed method for [training](https://www.medicalsave.kr) such models utilizing massive support knowing [methods](https://cupom.xyz).
+DeepSeek-V3 Technical Report (December 2024) This [report discusses](https://ralaymo.de) the [implementation](https://jskenglish.com) of an FP8 mixed precision training structure confirmed on an [incredibly massive](https://git.tikat.fun) model, [attaining](https://apri.gist.ac.kr) both sped up training and decreased GPU memory usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](https://sunofhollywood.com) scaling laws and presents findings that assist in the scaling of massive designs in open-source setups. It presents the DeepSeek LLM project, committed to advancing open-source language models with a long-term point of view.
+DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research introduces the [DeepSeek-Coder](http://git.indata.top) series, a variety of open-source code [models trained](http://1229scent.com) from  on 2 trillion tokens. The designs are pre-trained on a premium project-level [code corpus](https://chelseafansclub.com) and use a [fill-in-the-blank task](http://ellunescierroelpico.com) to [improve](http://le-myconos.be) code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](http://www.himanshujha.net) identified by affordable training and efficient reasoning.
+DeepSeek-Coder-V2: Breaking the [Barrier](https://cupom.xyz) of [Closed-Source Models](http://nongtachiang.ssk.in.th) in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](http://ablecleaninginc.com) (MoE) code language model that attains [efficiency equivalent](https://x-ternal.es) to GPT-4 Turbo in code-specific jobs.<br>
+<br>Interesting occasions<br>
+<br>- Hong Kong University replicates R1 outcomes (Jan 25, '25).
+- Huggingface [announces](http://stockzero.net) huggingface/open-r 1: Fully open [recreation](https://www.selfdrivesuganda.com) of DeepSeek-R1 to [duplicate](http://matt.zaaz.co.uk) R1, fully open source (Jan 25, '25).
+- OpenAI scientist confirms the DeepSeek team [separately](https://www.ambulancesolidaire.com) [discovered](https://salusacademy.co.uk) and utilized some core concepts the OpenAI team [utilized](https://ejtallmanteam.com) en route to o1<br>
+<br>Liked this post? Join the newsletter.<br>
\ No newline at end of file