RL Solutions

What sorts of tasks are good targets for reinforcement learning?

In short, any task where you can define what good and bad is. If you can measure it, RunRL can help you improve it. Here are some examples that we've seen:

Training a model to generate SMILES strings (a text representation) for molecules that have a high binding affinity to Sars-CoV-2 protease. The reward function is the binding affinity, and even a 3B model learns to generate molecules that bind stronger to the active site than anything Claude 3.7 generates.
Web agents on difficult static forms, such as a long tax form that needs 100% reliability, but does not change very often.
Optimizing performance of code generation models on difficult tasks, such as chip verification
Extracting important information from long documents, especially if there is already some ground-truth data of what the important information might be.
Improving tool-use abilities of agents on custom tools

What base model should I use?

A lot of why RL is so good recently is because of how good base models have gotten; on many benchmarks, Llama2-7B was better than Llama-70B; Llama3-8B was better than Llama2-70B, and Qwen3 is better than all of them.
In general, it makes sense to go with the smallest base model that can readily learn your task, as increasing model size is correlated with an associated increase in training and inference costs.
That said, there's a few important considerations to make. A good rule of thumb is, if a model can do a task 25% of the time, there exists some combination of RL post-training and test-time scaling that gets it to 100%.
So, you probably want to start with a model that can already somewhat do your task; in some cases, you'll want a model that has undergone some supervised fine-tuning, especially if the output format is so tricky as to not be easily learnable with in-context learning. This is another reason evals are important

Classic warning on reward hacking

A good thing to keep in mind with RL is that if there exists a really stupid solution that gets high reward, the model will likely think of it.
One common way for this to happen is by defining a reward function which is independent of the prompt; for instance, we ask a model to write good email responses, and we reward it purely on email quality. Well, the model might then learn a single good email, and always outputs that one, regardless of the email it's responding to!
A way to get around this is to add an LLM-as-judge term in the reward: for instance, we make a request to Anthropic's Claude 3.7, and ask it to rate whether the email response was relevant to the previous email.

What's the difference between RL and supervised fine-tuning (SFT)?

The difference is twofold: first of all, supervised fine-tuning is, as its name suggests, supervised; this means you already need examples of the model doing well to train on.

Second of all, a model trained with SFT can't do better than the data it's being trained on; a model trained with RL can theoretically do arbitrarily well!