Reinforcement Learning from Human Feedback (RLHF) and Large Language Models (LLMs): The Magic Sauce behind ChatGPT

How does OpenAI train LLMs using Feedback from Human Reviewers?

Jul 16, 2023

If you've ever been intrigued by the generative AI and natural language processing world, you might have come across two buzzwords: Reinforcement Learning from Human Feedback (RLHF) and Large Language Models (LLMs). But what do these terms mean? And how do they come together to create advanced models like ChatGPT? Here's a deep dive into the concepts and their synergy.

The Genesis of RLHF: Two Fields, One Goal

At the genesis, reinforcement learning (RL) and natural language processing (NLP) appeared as two distinct fields, each progressing independently. Reinforcement learning revolves around training a model to make informed decisions by rewarding successful outcomes and punishing erroneous ones. It is primarily used in applications like gaming and simulated environments. On the other hand, natural language processing is all about understanding and interpreting human languages.

However, the inherent complexity of human language presented significant hurdles to RL's applications in NLP. This is where the concept of RLHF comes into play: a strategy that aims to leverage the strengths of reinforcement learning to enhance the performance of natural language processing.

Reinforcement Learning and Natural Language Processing: An Unusual Blend

The magic began when researchers began toying with combining RL and NLP. They realized that RL techniques could be utilized to amplify NLP's efficacy, leading to the inception of "Reinforcement Learning from Human Feedback", or RLHF.

To paint a clearer picture, let's imagine training a dog. You reward it with a treat when it correctly responds to the command "sit". Over time, the dog learns to associate the reward (positive reinforcement) with the correct action, thus improving its behaviour. The same principle is applied in RLHF, but in this case, the subject is a model, and the task is generating human-like text.

Phases of RLHF

The RLHF process is akin to teaching an intelligent parrot to communicate. There are four key stages:

Pre-Training Phase: The parrot first learns to mimic human language by listening to various conversations - much like how Large Language Models (LLMs) such as GPT-4 learn by analyzing vast amounts of text data.
Supervised Fine-Tuning: You then actively guide the parrot to speak more coherently, similar to how trainers provide specific examples to fine-tune the model's responses.
Reward Modelling: When the parrot gives two different responses, you reward the better one, encouraging the parrot to associate positive behaviour with rewards.
Reinforcement Learning Human Feedback (RLHF): The parrot learns to improve its conversational skills over time through iterative feedback and interaction.

3 phases of ChatGPT development — RLHF Overview from Chip Huyen’s Blog

Role of Reward Modelling

Reward modelling is pivotal to RLHF. To create a reward model, developers rank responses to various prompts based on quality and relevance. With this ranking system, the model associates higher rewards with better responses and progressively generates more of such responses, enhancing its overall performance.

A significant aspect of reward modelling is comparison data creation. Here, multiple responses are compared based on their appropriateness for a given prompt. For instance, given the prompt "What is the capital of France?" the response "The capital of France is Paris" would rank higher than "The capital of France is Berlin". This ranking informs the reward model about which responses deserve higher rewards.

However, evaluating the quality of responses can be subjective and depends on context. A humorous response might be suitable for a casual conversation but could be inappropriate in a formal context. Thus, achieving a balance in response quality can be challenging.

Reinforcement Learning from Human Feedback (RLHF)

RLHF represents the final stage, where the model learns from feedback. The model generates multiple responses to a given prompt, and human evaluators rank these responses based on quality. This ranking is then used to update the model's parameters, leading to the generation of better responses in the future.

The Issue of Hallucinations

In the realm of LLMs, "hallucination" refers to instances where the model generates information that wasn't in the original input. There are two primary hypotheses for why this occurs:

Lack of causal understanding: LLMs may not understand the causal relationship between their inputs and outputs, generating unlinked or out-of-context responses.
Mismatch of knowledge: There could be a gap between the understanding of the human labeller and the model, leading the model to generate outputs that make it appear knowledgeable about concepts it doesn't fully understand.

To mitigate this, one proposed solution is to design a better reward function in the RLHF process. The model could be penalized for generating "hallucinations", encouraging it to stick more closely to its training data.

Progressing to RLHF

RLHF is best implemented when the model struggles with certain types of prompts, even after repeated supervised fine-tuning. Let's consider a language model trained to write poems. Initially, the model is fine-tuned with a dataset of poems. If it starts generating satisfactory poems most of the time, fine-tuning continues.

When will RLHF improve your model’s performance?

However, if the model struggles with writing a haiku, creating comparison data where each example consists of a prompt, a better response, and a worse response could be the next step. If computational resources are abundant, this comparison data is used to create a reward model, applying RLHF to fine-tune the model to better respond to haiku prompts.

RLHF Notebook: Try it out yourself!

Colab Notebook

We have created a notebook where we show you how to fine-tune a GPT2 model using RLHF. We have used a dataset that consists of two different responses generated by the falcon-7b model. We then chose the “best” response of those two responses and train a reward model. Using a PPO training policy, the reward model is then used to fine-tune the GPT2 model. We do all this using the trl library from HuggingFace. Try it out using the notebook above!

Final Thoughts: The Power of RLHF and LLMs

RLHF and LLMs constitute the magic sauce behind models like ChatGPT. While the challenges they pose are non-trivial, the significant advancements they have enabled in the field of NLP are undeniable. By continuously refining these techniques and addressing their shortcomings, we are one step closer to making a powerful tool for solving complex real-world problems.

Through RLHF's stages, combined with the power of large language models, LLMs can generate human-like text, comprehend the nuances of language, and learn to improve over time based on human feedback. This magic sauce isn't just transforming how we approach LLMs; it's redefining the realms of possibility in this field.

ScaleDown

Discussion about this post