RL’s Painful Signal Problem
2:15:37–2:16:11 · 34s
Eric underscores Dwarkesh’s critique of policy-gradient RL: if your policy almost never samples the right answer, you may get virtually no learning signal for a long time.
2:15:37–2:16:11 · 34s
Eric underscores Dwarkesh’s critique of policy-gradient RL: if your policy almost never samples the right answer, you may get virtually no learning signal for a long time.
We use cookies to understand how you use our platform and to improve your experience. Click "Accept All" to consent, or "Decline non-essential" to opt out of non-essential cookies. Read our Privacy Policy.