OpenAI gpt-oss & all model types now supported!

RL Reward Hacking

Learn what is Reward Hacking in Reinforcement Learning and how to counter it.

The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

Can you counter reward hacking? Yes! In our free gpt-oss RL notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.

🏆 Reward Hacking Overview

Some common examples of reward hacking during RL include:

Laziness

RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.

Caching & Cheating

RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.

We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

Cheating

RL learns to edit the timing function to make it output 0 time as passed. We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals. We are also going to use exec to create the function, so we have to save the output to an empty dict. We also disallow global variable access via types.FunctionType(f.__code__, {})

Last updated

Was this helpful?