**GRPOTrainer(Group Relative Policy Optimization,GRPO)** 是一种强化学习训练方法,最初提出于论文 [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)。
--- # **GRPOTrainer 示例**
以下是一个官方提供的 `GRPOTrainer` 示例:
```python from datasets import load_dataset from trl import GRPOTrainer
def reward_func(completions, **kwargs): # Dummy reward function that rewards completions with more unique letters. return [float(len(set(completion))) for completion in completions]