记录一下实操Open-R1，GRPO的实现，基于Qwen-1.5B-Instruct。

环境安装

参考Hugging Face开源的open-r1，可以实现GRPO训练。

在终端运行以下命令安装所需环境：

git clone https://github.com/huggingface/open-r1/
cd open-r1
python set_up.py --install 
```  

---

# **GRPOTrainer 简介**  

本教程主要基于 [Hugging Face](https://github.com/huggingface/trl/) 的 `trl`（类似 `transformers`），使用 `GRPOTrainer` 进行训练。  

**GRPOTrainer（Group Relative Policy Optimization，GRPO）** 是一种强化学习训练方法，最初提出于论文 [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)。  

---



# **GRPOTrainer 示例**  

以下是一个官方提供的 `GRPOTrainer` 示例：  

```python
from datasets import load_dataset
from trl import GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

def reward_func(completions, **kwargs):
    # Dummy reward function that rewards completions with more unique letters.
    return [float(len(set(completion))) for completion in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_func,
    train_dataset=dataset,
)

trainer.train()