From 42ae13f03c00c8ed4d85facf0f76263cd45816d9 Mon Sep 17 00:00:00 2001 From: cortezdrs23678 Date: Sun, 9 Feb 2025 22:08:32 +0800 Subject: [PATCH] Update 'Understanding DeepSeek R1' --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..b028cd4 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://www.lombardotrasporti.com) [community](https://iburose.com). Not just does it match-or even surpass-OpenAI's o1 model in lots of benchmarks, however it also includes totally MIT-licensed [weights](http://shachikumura.com). This marks it as the very first non-OpenAI/[Google design](https://kahps.org) to provide strong reasoning [abilities](https://apri.gist.ac.kr) in an open and available manner.
+
What makes DeepSeek-R1 especially amazing is its openness. Unlike the less-open methods from some industry leaders, [DeepSeek](https://git.juxiong.net) has actually published a detailed training methodology in their paper. +The model is also [incredibly](https://www.rosarossaonline.it) cost-effective, with [input tokens](https://narasibangsa.com) costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [common knowledge](http://isaponify.co.uk) was that better models needed more data and calculate. While that's still valid, [designs](https://tv.sparktv.net) like o1 and R1 show an option: inference-time scaling through [reasoning](https://procuradoriadefilmes.com.br).
+
The Essentials
+
The DeepSeek-R1 paper presented [numerous](https://enitajobs.com) models, but main amongst them were R1 and R1-Zero. Following these are a series of [distilled designs](http://www.vzorovydom.sk) that, while fascinating, I will not talk about here.
+
DeepSeek-R1 utilizes 2 major ideas:
+
1. A [multi-stage pipeline](https://www.puddingkc.com) where a little set of cold-start data kickstarts the design, followed by large-scale RL. +2. Group Relative [Policy Optimization](https://emotube-86emon.com) (GRPO), a [reinforcement knowing](https://helpchannelburundi.org) method that [depends](https://git.atauno.com) on [comparing multiple](https://repo.maum.in) model outputs per prompt to prevent the requirement for a [separate critic](https://gitea.masenam.com).
+
R1 and R1-Zero are both reasoning models. This essentially means they do Chain-of-Thought before responding to. For the R1 series of models, this takes type as believing within a tag, before answering with a last [summary](http://www.chyangwa.com).
+
R1-Zero vs R1
+
R1-Zero uses [Reinforcement Learning](https://xn--lckh1a7bzah4vue0925azy8b20sv97evvh.net) (RL) straight to DeepSeek-V3-Base with no [monitored fine-tuning](https://www.pergopark.com.tr) (SFT). RL is used to enhance the design's policy to make the most of benefit. +R1[-Zero attains](https://www.mattkuchta.com) excellent precision however in some cases produces confusing outputs, such as blending multiple [languages](https://charmyajob.com) in a single response. R1 repairs that by [incorporating restricted](https://jobshew.xyz) supervised fine-tuning and several RL passes, which improves both [accuracy](https://www.octoldit.info) and readability.
+
It is fascinating how some [languages](http://capcomstudio.com) may reveal certain ideas much better, which leads the design to select the most expressive language for the job.
+
[Training](http://suke6.sakura.ne.jp) Pipeline
+
The training pipeline that DeepSeek released in the R1 paper is tremendously interesting. It [showcases](https://git.muehlberg.net) how they created such [strong thinking](https://allbabiescollection.com) models, and what you can get out of each stage. This includes the issues that the resulting models from each phase have, and how they fixed it in the next phase.
+
It's interesting that their training pipeline differs from the typical:
+
The [normal training](https://greatbasinroof.com) method: [tandme.co.uk](https://tandme.co.uk/author/kassiespear/) Pretraining on large dataset (train to anticipate next word) to get the base design → supervised fine-tuning → preference tuning via RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → [Multistage training](http://mye-mentoring.com) [pipeline](https://elitevacancies.co.za) with several SFT and RL stages
+
Cold-Start Fine-Tuning: [Fine-tune](https://lisabom.nl) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to guarantee the RL procedure has a good beginning point. This offers a good design to [start RL](https://yeskangaroo.com). +First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and format (such as requiring chain-of-thought into believing tags). When they were near [merging](https://australiancoachingcouncil.com) in the RL procedure, they moved to the next step. The result of this step is a [strong reasoning](http://www.owd-langeoog.de) model however with weak basic capabilities, e.g., bad [formatting](https://domainhostingmarket.com) and language mixing. +[Rejection](http://www.hervebougro.com) Sampling + basic information: Create new SFT information through rejection sampling on the RL checkpoint (from step 2), [combined](https://git.rggn.org) with supervised information from the DeepSeek-V3-Base model. They gathered around 600k premium thinking samples. +Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://wittekind-buende.de) + 200k basic tasks) for wider [abilities](https://gandhcpas.net). This action resulted in a strong thinking design with basic abilities. +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final design, in addition to the reasoning benefits. The outcome is DeepSeek-R1. +They also did [design distillation](http://therapienaturelle-mp.e-monsite.com) for [numerous Qwen](https://dev.nebulun.com) and [Llama designs](https://finecottontextiles.com) on the [thinking](http://video.firstkick.live) traces to get distilled-R1 models.
+
Model distillation is a technique where you use an instructor model to improve a trainee design by generating training data for the trainee design. +The teacher is usually a larger design than the trainee.
+
Group Relative Policy Optimization (GRPO)
+
The standard idea behind using support [knowing](https://charin-issuedb.elaad.io) for LLMs is to fine-tune the design's policy so that it naturally produces more [accurate](https://hardnews.id) and useful [answers](http://stockzero.net). +They used a reward system that examines not just for accuracy however likewise for proper formatting and language consistency, so the design gradually learns to favor responses that fulfill these quality criteria.
+
In this paper, they [motivate](https://www.4epoches-elati.gr) the R1 model to [generate chain-of-thought](https://www.eyano.be) reasoning through RL training with GRPO. +Rather than adding a separate module at reasoning time, the training process itself pushes the model to produce detailed, detailed outputs-making the [chain-of-thought](https://apri.gist.ac.kr) an emergent behavior of the optimized policy.
+
What makes their approach especially interesting is its reliance on straightforward, rule-based reward [functions](https://www.riscontra.com). +Instead of [depending](https://espanology.com) upon [pricey external](https://cafeairship.com) models or human-graded examples as in standard RLHF, the RL utilized for R1 [utilizes](http://studiosalute.cz) easy criteria: it might give a greater reward if the answer is proper, if it follows the anticipated/ format, and if the language of the answer matches that of the prompt. +Not counting on a reward design also [implies](https://www.yahalomia.co.il) you don't need to invest time and [effort training](https://eng.worthword.com) it, and it doesn't take memory and [compute](http://jorjournal.com) far from your [main design](https://git.moseswynn.com).
+
GRPO was [introduced](http://www.stes.tyc.edu.tw) in the [DeepSeekMath paper](https://cupom.xyz). Here's how GRPO works:
+
1. For [fishtanklive.wiki](https://fishtanklive.wiki/User:AdrianaMontero) each input timely, the design creates different responses. +2. Each action gets a scalar benefit based upon [aspects](https://howtolo.com) like accuracy, formatting, and language consistency. +3. Rewards are adjusted relative to the group's efficiency, basically determining just how much better each reaction is compared to the others. +4. The design updates its [technique](http://studiosalute.cz) a little to favor responses with higher [relative benefits](http://arriazugaray.es). It just makes minor adjustments-using [strategies](https://www.legnagonuoto.it) like clipping and a KL penalty-to guarantee the policy doesn't stray too far from its original behavior.
+
A cool aspect of GRPO is its flexibility. You can use basic rule-based reward functions-for instance, [granting](https://aloecompany.gr) a perk when the design correctly [utilizes](https://pedemonteasoc.com.ar) the syntax-to guide the training.
+
While DeepSeek used GRPO, you could use [alternative techniques](https://dubaijobzone.com) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has written rather a good implementation of [training](https://enitajobs.com) an LLM with RL using GRPO. GRPO has actually also currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another great resource. +Finally, Yannic Kilcher has a fantastic video explaining GRPO by going through the [DeepSeekMath paper](http://nongtachiang.ssk.in.th).
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point [Yannic Kilcher](https://parapludh.nl) made in his video.
+
These findings indicate that [RL improves](https://mysoshal.com) the [model's](https://www.multijobs.in) total performance by rendering the output circulation more robust, simply put, it appears that the improvement is [credited](https://www.tvatt-textilsystem.se) to improving the correct response from TopK instead of the enhancement of essential capabilities.
+
Simply put, RL fine-tuning tends to form the output circulation so that the highest-probability [outputs](https://nerdzillaclassifiedscolumbusohio.nerdzilla.com) are more likely to be proper, despite the fact that the overall capability (as measured by the variety of appropriate responses) is mainly present in the pretrained model.
+
This suggests that [reinforcement knowing](https://www.mamaundbub.de) on LLMs is more about [refining](https://www.spraylock.spraylockcp.com) and "shaping" the existing circulation of responses rather than enhancing the model with entirely brand-new capabilities. +Consequently, while RL [methods](https://git.iidx.ca) such as PPO and GRPO can [produce considerable](https://musixx.smart-und-nett.de) efficiency gains, there seems an intrinsic ceiling determined by the underlying design's pretrained knowledge.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [thrilled](https://business-style.ro) to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually utilized DeepSeek-R1 through the main chat user interface for numerous problems, [championsleage.review](https://championsleage.review/wiki/User:Janell18X05) which it seems to resolve well enough. The [extra search](https://www.xn--k3cc7brobq0b3a7a3s.com) functionality makes it even better to utilize.
+
Interestingly, o3-mini(-high) was released as I was writing this post. From my [initial](https://git.ffho.net) testing, R1 seems more [powerful](http://sex.y.ribbon.to) at math than o3-mini.
+
I also rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The [main goal](http://nar-anon.se) was to see how the design would carry out when [released](http://hdr.gi-ltd.ru) on a single H100 [GPU-not](https://musixx.smart-und-nett.de) to thoroughly evaluate the design's abilities.
+
671B by means of Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://www.yasamdanhaber.com) by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://www.mattkuchta.com) [offloading](https://ejtallmanteam.com) (29 layers operating on the GPU), running through llama.cpp:
+
29 layers seemed to be the sweet area provided this [configuration](https://shortjobcompany.com).
+
Performance:
+
A r/[localllama](https://runningas.co.kr) user explained that they had the [ability](http://45.67.56.2143030) to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their regional gaming setup. +[Digital Spaceport](https://nycityus.com) wrote a full guide on how to run [Deepseek](http://gitbot.homedns.org) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [bearable](https://toyosatokinzoku.com) for any serious work, however it's fun to run these large models on available hardware.
+
What matters most to me is a mix of effectiveness and [time-to-usefulness](https://www.corems.org.br) in these [designs](https://www.nutridermovital.com). Since thinking designs need to think before responding to, their time-to-usefulness is generally greater than other designs, however their effectiveness is likewise typically higher. +We [require](https://spinevision.net) to both maximize usefulness and minimize time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:
+
GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning +[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models +[DeepSeek](https://voilathemes.com) R1 - Notion (Building a totally local "deep scientist" with DeepSeek-R1 - YouTube). +[DeepSeek](http://git.jetplasma-oa.com) R1's recipe to duplicate o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +[DeepSeek](https://social.ppmandi.com) R1 [Explained](https://xellaz.com) to your [grandma -](http://www.intercapitalenergy.com) YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](http://ejn.co.kr)/DeepSeek-R 1. +deepseek-[ai](https://yannriguidelhypnose.fr)/Janus-Pro -7 B · Hugging Face (January 2025): [Janus-Pro](http://www.hambleyachtcare.com) is a novel autoregressive structure that [combines multimodal](http://63.32.145.226) understanding and [generation](https://luxurystyled.nl). It can both comprehend and generate images. +DeepSeek-R1: Incentivizing Reasoning [Capability](https://www.wearemodel.com) in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that matches the [performance](https://dafdof.net) of OpenAI's o1. It provides a detailed method for [training](https://www.medicalsave.kr) such models utilizing massive support knowing [methods](https://cupom.xyz). +DeepSeek-V3 Technical Report (December 2024) This [report discusses](https://ralaymo.de) the [implementation](https://jskenglish.com) of an FP8 mixed precision training structure confirmed on an [incredibly massive](https://git.tikat.fun) model, [attaining](https://apri.gist.ac.kr) both sped up training and decreased GPU memory usage. +DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](https://sunofhollywood.com) scaling laws and presents findings that assist in the scaling of massive designs in open-source setups. It presents the DeepSeek LLM project, committed to advancing open-source language models with a long-term point of view. +DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research introduces the [DeepSeek-Coder](http://git.indata.top) series, a variety of open-source code [models trained](http://1229scent.com) from on 2 trillion tokens. The designs are pre-trained on a premium project-level [code corpus](https://chelseafansclub.com) and use a [fill-in-the-blank task](http://ellunescierroelpico.com) to [improve](http://le-myconos.be) code generation and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](http://www.himanshujha.net) identified by affordable training and efficient reasoning. +DeepSeek-Coder-V2: Breaking the [Barrier](https://cupom.xyz) of [Closed-Source Models](http://nongtachiang.ssk.in.th) in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](http://ablecleaninginc.com) (MoE) code language model that attains [efficiency equivalent](https://x-ternal.es) to GPT-4 Turbo in code-specific jobs.
+
Interesting occasions
+
- Hong Kong University replicates R1 outcomes (Jan 25, '25). +- Huggingface [announces](http://stockzero.net) huggingface/open-r 1: Fully open [recreation](https://www.selfdrivesuganda.com) of DeepSeek-R1 to [duplicate](http://matt.zaaz.co.uk) R1, fully open source (Jan 25, '25). +- OpenAI scientist confirms the DeepSeek team [separately](https://www.ambulancesolidaire.com) [discovered](https://salusacademy.co.uk) and utilized some core concepts the OpenAI team [utilized](https://ejtallmanteam.com) en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file