{"id":886,"date":"2025-03-05T22:50:13","date_gmt":"2025-03-05T22:50:13","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=886"},"modified":"2025-03-05T23:00:59","modified_gmt":"2025-03-05T23:00:59","slug":"the-interface-between-reinforcement-learning-theory-and-language-model-post-training","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2025\/03\/05\/the-interface-between-reinforcement-learning-theory-and-language-model-post-training\/","title":{"rendered":"The Interface Between Reinforcement Learning Theory and Language Model Post-Training"},"content":{"rendered":"\n<p class=\"\">We have another technical blog post, this time by <a href=\"https:\/\/people.cs.umass.edu\/~akshay\/\">Akshay Krishnamurthy<\/a> and <a href=\"https:\/\/audhuang.github.io\/\">Audrey Huang<\/a>, about how ideas from reinforcement learning theory can inspire new algorithms for language model post-training. <\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"\">Over the last several years, we have seen an explosion of interest and research activity into <em>generative models<\/em>\u2014particularly large language models like ChatGPT, Claude, and Gemini\u2014which operate via textual inputs and outputs and can be used for a variety of general-purpose tasks like question-answering, creative writing, and reasoning. At a high level, training these models comprises two phases: (1) in the <em>pre-training<\/em> phase, the model is trained on a large corpus of text to predict each token (word) given the previous tokens in each document, (2) in the <em>post-training<\/em> phase, a variety of techniques are deployed to <em>align<\/em> the model, making it suitable for downstream use. For instance, alignment techniques are used to control or steer the model away from producing inappropriate or offensive content, which is essential for safe deployment.&nbsp;<\/p>\n\n\n\n<p class=\"\">One of the standard approaches for language model alignment is known as <em>Reinforcement Learning from Human Feedback<\/em> (RLHF). The idea is to treat the language model as a decision-making policy and use techniques from reinforcement learning (RL) to optimize for desirable outcomes, where the notion of desirability is derived from a dataset of outcomes curated with human feedback. These RLHF approaches are pervasive; they are employed in the training of essentially every language model. This new application of RL presented an exciting opportunity for the RL research community to translate, refine, and deploy their ideas toward improving language model alignment, and progress in this direction has been rather rapid. In this blog post, we will discuss some recent advances\u2014focusing on theoretical developments\u2014in reinforcement learning for language model alignment.&nbsp;<\/p>\n\n\n\n<p class=\"\">As an outline, most of this blog post will focus on the most standard setting for RLHF. We will start with some background to set the stage, highlight the central challenge of <em>overoptimization<\/em>, and then present a new algorithm, <a href=\"https:\/\/arxiv.org\/abs\/2407.13399\">chi-squared preference optimization<\/a>, that we (in joint work with <a href=\"https:\/\/whzhan99.github.io\/\">Wenhao Zhan<\/a>, <a href=\"https:\/\/tengyangxie.github.io\/\">Tengyang Xie<\/a>, <a href=\"https:\/\/jasondlee88.github.io\/\">Jason Lee<\/a>, <a href=\"https:\/\/wensun.github.io\/\">Wen Sun<\/a>, and <a href=\"https:\/\/dylanfoster.net\/\">Dylan Foster<\/a>) developed to mitigate this issue. To wrap up, we\u2019ll briefly highlight some other work at the interface of RL theory and LLM post-training, and close with some parting thoughts.&nbsp;<\/p>\n\n\n\n<p class=\"\"><strong>Background<\/strong><\/p>\n\n\n\n<p class=\"\">The most basic formulation of RLHF considers single-turn chat scenarios where there is a space <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> of possible prompts and a space <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> of possible responses. In the RL parlance, the prompt is the state of the environment and the response is the action. There are two main ingredients: a pre-trained language model policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%3A+%5Cmathcal%7BX%7D+%5Cto+%5CDelta%28%5Cmathcal%7BY%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}: &#92;mathcal{X} &#92;to &#92;Delta(&#92;mathcal{Y})\" class=\"latex\" \/> which (stochastically) responds with responses to the prompts and a dataset <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D+%3D+%5C%7B+%28x_i%2C+y_i%5E%2B%2C+y_i%5E-%29+%5C%7D_%7Bi%3D1%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D = &#92;{ (x_i, y_i^+, y_i^-) &#92;}_{i=1}^n\" class=\"latex\" \/> comprising of prompts <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_i\" class=\"latex\" \/> along with preferred and dispreferred responses <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_i%5E%2B&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_i^+\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_i%5E-&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_i^-\" class=\"latex\" \/>. For mathematical analysis, it is often assumed that this preference dataset is generated via the following process:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">Prompts <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/> are drawn from some prompt distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_x\" class=\"latex\" \/>,<\/li>\n\n\n\n<li class=\"\">Two responses <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_1%2Cy_2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_1,y_2\" class=\"latex\" \/> are drawn independently from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28%5Ccdot%5Cmid%7B%7Dx%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}(&#92;cdot&#92;mid{}x)\" class=\"latex\" \/>,&nbsp;<\/li>\n\n\n\n<li class=\"\">These responses are ordered as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28y%5E%2B%2Cy%5E-%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(y^+,y^-)\" class=\"latex\" \/> based on the <em>Bradley-Terry<\/em> model parametrized by an unknown reward function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r%5E%5Cstar%28x%2Cy%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r^&#92;star(x,y)\" class=\"latex\" \/>: <\/li>\n<\/ol>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5CPr%5By+%5Csucc+y%27%5D+%3D+%5Cfrac%7B%5Cexp%28r%5E%5Cstar%28x%2Cy%29%29%7D%7B%5Cexp%28r%5E%5Cstar%28x%2Cy%29%29+%2B+%5Cexp%28r%5E%5Cstar%28x%2Cy%27%29%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;Pr[y &#92;succ y&#039;] = &#92;frac{&#92;exp(r^&#92;star(x,y))}{&#92;exp(r^&#92;star(x,y)) + &#92;exp(r^&#92;star(x,y&#039;))}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">Given this dataset, we aim to learn a policy that has high reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Carg%5Cmax_%7B%5Cpi%7D+%5Cmathbb%7BE%7D_%7Bx%5Csim+D_x%2C+y+%5Csim+%5Cpi%28%5Ccdot%5Cmid+x%29%7D%5B+r%5E%5Cstar%28x%2Cy%29+%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;arg&#92;max_{&#92;pi} &#92;mathbb{E}_{x&#92;sim D_x, y &#92;sim &#92;pi(&#92;cdot&#92;mid x)}[ r^&#92;star(x,y) ]\" class=\"latex\" \/>.\u00a0<\/p>\n\n\n\n<p class=\"\">A natural approach, first proposed by <a href=\"https:\/\/arxiv.org\/abs\/1706.03741\">Christiano et al (2017)<\/a>, is to use the preference dataset <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D\" class=\"latex\" \/> to fit an estimated reward function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/> and then find a policy that has a high reward according to the estimated reward function. In practice, this is done using a reinforcement learning algorithm to optimize the KL-regularized objective<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Chat%7B%5Cpi%7D_%7B%5Cmathrm%7BRLHF%7D%7D+%3D+%5Cmathrm%7Bargmax%7D_%7B%5Cpi%7D+%5Cmathbb%7BE%7D_%7Bx%5Csim+D_x%2C+y+%5Csim+%5Cpi%28%5Ccdot%5Cmid+x%29%7D%5B+%5Chat%7Br%7D%28x%2Cy%29+%5D+-+%5Cbeta+D_%7B%5Cmathrm%7BKL%7D%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D+%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;hat{&#92;pi}_{&#92;mathrm{RLHF}} = &#92;mathrm{argmax}_{&#92;pi} &#92;mathbb{E}_{x&#92;sim D_x, y &#92;sim &#92;pi(&#92;cdot&#92;mid x)}[ &#92;hat{r}(x,y) ] - &#92;beta D_{&#92;mathrm{KL}}(&#92;pi || &#92;pi_{&#92;mathrm{ref}} )\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">Where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_%7B%5Cmathrm%7BKL%7D%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D+%29+%3D+%5Cmathbb%7BE%7D_%7Bx+%5Csim+D_x%7D%5B+D_%7B%5Cmathrm%7BKL%7D%7D%28%5Cpi%28%5Ccdot+%5Cmid+x%29+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28%5Ccdot+%5Cmid+x%29+%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_{&#92;mathrm{KL}}(&#92;pi || &#92;pi_{&#92;mathrm{ref}} ) = &#92;mathbb{E}_{x &#92;sim D_x}[ D_{&#92;mathrm{KL}}(&#92;pi(&#92;cdot &#92;mid x) || &#92;pi_{&#92;mathrm{ref}}(&#92;cdot &#92;mid x) ]\" class=\"latex\" \/> is the average (over prompts) KL divergence between the policies\u2019 response distributions. We refer to this method as \u201cstandard RLHF.\u201d<\/p>\n\n\n\n<p class=\"\">One issue with this approach is that, by using reinforcement learning for optimization, it inherits the brittleness and instability of deep reinforcement learning. To address this, <a href=\"https:\/\/arxiv.org\/abs\/2305.18290\">Rafailov et al (2023)<\/a>, observed a certain duality between policies and rewards and used it to derive a much simpler method, called Direct Preference Optimization (DPO). The idea is that for any reward function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r\" class=\"latex\" \/>, the optimal policy for the KL regularized objective above (with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r\" class=\"latex\" \/> instead of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/>) has a closed-form solution,<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Cpi_r%28y+%5Cmid+x%29+%3D+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28y+%5Cmid+x%29+%5Cexp%5Cleft%28r%28x%2Cy%29%2F%5Cbeta%5Cright%29%2FZ%28x%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;pi_r(y &#92;mid x) = &#92;pi_{&#92;mathrm{ref}}(y &#92;mid x) &#92;exp&#92;left(r(x,y)\/&#92;beta&#92;right)\/Z(x),\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Z%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Z(x)\" class=\"latex\" \/> is a normalizing constant that ensures <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_r%28%5Ccdot+%5Cmid+x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_r(&#92;cdot &#92;mid x)\" class=\"latex\" \/> is a distribution. Rafailov et al rearranged this expression to <em>parameterize<\/em> reward functions by policies and then used this parametrization to fit a reward function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/> to the preference data. This essentially amounts to solving a supervised learning problem with a particularly parameterized function class\/architecture, but it directly produces a policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Cpi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;pi}\" class=\"latex\" \/> avoiding the need for complicated reinforcement learning subroutines.\u00a0<\/p>\n\n\n\n<p class=\"\">Unfortunately, both standard RLHF and DPO have been observed to suffer from a phenomenon referred to as <em>overoptimization<\/em> (e.g., in <a href=\"https:\/\/proceedings.mlr.press\/v202\/gao23h\/gao23h.pdf\">Gao et al (2023)<\/a>), where the policy degrades in quality, rather than improves, during the optimization process. As we will see in the next section, one explanation for overoptimization is that it arises from a certain statistical inefficiency of both methods, which can be addressed via a novel algorithm design.\u00a0<\/p>\n\n\n\n<p class=\"\"><strong>Overoptimization hurts performance in RLHF<\/strong><\/p>\n\n\n\n<p class=\"\">Overoptimization can be understood by connecting the RLHF setting to a subfield of reinforcement learning theory known as <em>offline reinforcement learning<\/em>. Although RL typically concerns an agent interacting with an environment in an <em>online<\/em> manner, it can be more practical\/feasible to learn in an <em>offline<\/em> manner, from data that was previously collected by some other decision-making policy (this is also a useful subroutine in online methods). Since we are unable to interact with the environment in these settings and the dataset may not contain information\/demonstrations of near-optimal behavior, a natural desideratum is to do the best we can with the data that we have, i.e., find a policy whose performance is competitive with the best policy \u201csupported\u201d by the data. Recent developments in the theory of offline RL have formalized such guarantees via a notion of \u201csingle-policy concentrability\u201d (whose definition is not essential for understanding this blog post).<\/p>\n\n\n\n<p class=\"\">The fundamental challenge in offline RL is a mismatch between what we can numerically optimize\u2014i.e., an estimate of policy performance based on the data we have\u2014and what we care to optimize\u2014the actual policy performance\u2014resulting in an instance of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Goodhart%27s_law\">Goodhardt\u2019s Law<\/a>. To see this in more detail, observe that the RLHF setting described above is a special case of offline RL because the dataset is collected a priori by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/> and no other information about the ground truth reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r^&#92;star\" class=\"latex\" \/> is available. Standard RLHF optimizes the estimated reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/> as a surrogate for the true reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r^&#92;star\" class=\"latex\" \/>, resulting in <em>overfitting<\/em> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/> while achieving poor performance as measured via <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=r%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"r^&#92;star\" class=\"latex\" \/>. Indeed, this is precisely what is observed experimentally by Gao et al where it is referred to as overoptimization. Accordingly, this viewpoint suggests a statistical mechanism behind overoptimization: it is equivalent to the known challenge of overfitting in offline RL.<\/p>\n\n\n\n<p class=\"\">To address the overfitting challenge (and achieve guarantees based on single policy concentrability), the offline RL literature has developed algorithms based on the <em>principle of pessimism<\/em>\u2014which quantify reward uncertainty and maximize a high confidence lower bound on reward (thus guaranteeing a certain amount of reward). Pessimism can be seen as a form of regularization which forces the optimization process to stay in the region where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D+%5Capprox+r%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r} &#92;approx r^&#92;star\" class=\"latex\" \/>, avoiding overfitting. Even though existing RLHF methods (including standard RLHF and DPO) employ KL-regularization to prevent deviating from the data collection policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/>, the fact that these methods overfit suggests that they are not adequately regularized. Indeed, in our paper (Proposition A.1), we construct an example showing that regularization with KL-divergence is not sufficient to achieve single-policy concentrability guarantees, thus identifying a formal limitation of existing RLHF methods.<\/p>\n\n\n\n<p class=\"\"><strong>Deep dive into Chi-squared preference optimization<\/strong><\/p>\n\n\n\n<p class=\"\">Although regularization with KL-divergence is insufficient, it turns out that regularization with the <em><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-divergence<\/em>&#8212;defined as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_%7B%5Cchi%5E2%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%29+%3D+%5Cfrac%7B1%7D%7B2%7D+%5Cmathbb%7BE%7D_%7Bx+%5Csim+D_x%2Cy+%5Csim+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28%5Ccdot+%5Cmid%7B%7D+x%29%7D%5B+%28%5Cpi%28y%7Cx%29%2F%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28y%7Cx%29+-+1%29%5E2%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_{&#92;chi^2}(&#92;pi || &#92;pi_{&#92;mathrm{ref}}) = &#92;frac{1}{2} &#92;mathbb{E}_{x &#92;sim D_x,y &#92;sim &#92;pi_{&#92;mathrm{ref}}(&#92;cdot &#92;mid{} x)}[ (&#92;pi(y|x)\/&#92;pi_{&#92;mathrm{ref}}(y|x) - 1)^2]\" class=\"latex\" \/>&#8212;is! <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-divergence is a stronger regularizer, we have <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_%7B%5Cmathrm%7BKL%7D%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%29+%5Cleq+D_%7B%5Cchi%5E2%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_{&#92;mathrm{KL}}(&#92;pi || &#92;pi_{&#92;mathrm{ref}}) &#92;leq D_{&#92;chi^2}(&#92;pi || &#92;pi_{&#92;mathrm{ref}})\" class=\"latex\" \/>, but, more importantly, the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-divergence more accurately captures the uncertainty about a policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/>\u2019s reward when data is collected from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/>. To see this in a simplified setup, suppose we have \u201cnon-preference\u201d data of the form <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_%7Brew%7D+%3D+%5C%7B%28x%5Ei%2Cy%5Ei%2Cr%5E%5Cstar%28x%5Ei%2Cy%5Ei%29%29%5C%7D_%7Bi%3D1%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_{rew} = &#92;{(x^i,y^i,r^&#92;star(x^i,y^i))&#92;}_{i=1}^n\" class=\"latex\" \/> where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5Ei+%5Csim+D_x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^i &#92;sim D_x\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y%5Ei+%5Csim+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28%5Ccdot%5Cmid%7B%7Dx%5Ei%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y^i &#92;sim &#92;pi_{&#92;mathrm{ref}}(&#92;cdot&#92;mid{}x^i)\" class=\"latex\" \/>. If we fit a reward estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{r}\" class=\"latex\" \/> via least squares over some function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BR%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{R}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Chat%7Br%7D+%5Cgets+%5Cmathrm%7Bargmin%7D_%7Br+%5Cin+%5Cmathcal%7BR%7D%7D+%5Csum_%7Bi%3D1%7D%5En+%28%5Chat%7Br%7D%28x%5Ei%2Cy%5Ei%29+-+r%5E%5Cstar%28x%5Ei%2Cy%5Ei%29%29%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;hat{r} &#92;gets &#92;mathrm{argmin}_{r &#92;in &#92;mathcal{R}} &#92;sum_{i=1}^n (&#92;hat{r}(x^i,y^i) - r^&#92;star(x^i,y^i))^2\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">We can expect to have low in-distribution risk, say, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D_%7Bx+%5Csim+D_x%2C+y+%5Csim+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28%5Ccdot+%5Cmid+x%29%7D%5B+%28%5Chat%7Br%7D%28x%2Cy%29+-+r%5E%5Cstar%28x%2Cy%29%29%5E2%5D+%5Cleq+%5Cvarepsilon%5E2_%7B%5Cmathrm%7Bstat%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}_{x &#92;sim D_x, y &#92;sim &#92;pi_{&#92;mathrm{ref}}(&#92;cdot &#92;mid x)}[ (&#92;hat{r}(x,y) - r^&#92;star(x,y))^2] &#92;leq &#92;varepsilon^2_{&#92;mathrm{stat}}\" class=\"latex\" \/>. Considering some policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/>, the difference between its true reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D_%5Cpi%5Br%5E%5Cstar%28x%2Cy%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}_&#92;pi[r^&#92;star(x,y)]\" class=\"latex\" \/> and its estimated reward <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D_%7B%5Cpi%7D%5B%5Chat%7Br%7D%28x%2Cy%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}_{&#92;pi}[&#92;hat{r}(x,y)]\" class=\"latex\" \/> can be bounded as<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Cmathbb%7BE%7D_%5Cpi%5B+r%5E%5Cstar%28x%2Cy%29+-+%5Chat%7Br%7D%28x%2Cy%29+%5D+%5C%5C%3D+%5Cmathbb%7BE%7D_x%5Cleft%5B+%5Csum_y+%5Cfrac%7B%5Cpi%28y%5Cmid%7B%7Dx%29%7D%7B%5Csqrt%7B%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28y+%5Cmid%7B%7Dx%29%7D%7D%5Ccdot%5Csqrt%7B%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28y+%5Cmid%7B%7Dx%29%7D+%5Ccdot+%28r%5E%5Cstar%28x%2Cy%29+-+%5Chat%7Br%7D%28x%2Cy%29+%29%5Cright%5D%5C%5C++%5Cleq+%5Csqrt%7B+%5Cmathbb%7BE%7D_x%5Cleft%5B%5Csum_y+%5Cfrac%7B%5Cpi%5E2%28y%5Cmid%7B%7Dx%29%7D%7B%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%28y+%5Cmid%7B%7Dx%29%7D%5Cright%5D%7D%5Ccdot%5Cvarepsilon_%7B%5Cmathrm%7Bstat%7D%7D%5C%5C+%3D+%5Csqrt%7B+%5Cmathbb%7BE%7D_%7B%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%7D+%5Cleft%5B+%5Cfrac%7B%5Cpi%5E2%28y%5Cmid%7B%7Dx%29%7D%7B%5Cpi%5E2_%7B%5Cmathrm%7Bref%7D%7D%28y%5Cmid%7B%7Dx%29%7D+%5Cright%5D%7D+%5Ccdot+%5Cvarepsilon_%7B%5Cmathrm%7Bstat%7D%7D%5C%5C+%3D+%5Csqrt%7B2D_%7B%5Cchi%5E2%7D%28%5Cpi+%7C%7C%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%29+%2B+1%7D+%5Ccdot+%5Cvarepsilon_%7B%5Cmathrm%7Bstat%7D%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;mathbb{E}_&#92;pi[ r^&#92;star(x,y) - &#92;hat{r}(x,y) ] &#92;&#92;= &#92;mathbb{E}_x&#92;left[ &#92;sum_y &#92;frac{&#92;pi(y&#92;mid{}x)}{&#92;sqrt{&#92;pi_{&#92;mathrm{ref}}(y &#92;mid{}x)}}&#92;cdot&#92;sqrt{&#92;pi_{&#92;mathrm{ref}}(y &#92;mid{}x)} &#92;cdot (r^&#92;star(x,y) - &#92;hat{r}(x,y) )&#92;right]&#92;&#92;  &#92;leq &#92;sqrt{ &#92;mathbb{E}_x&#92;left[&#92;sum_y &#92;frac{&#92;pi^2(y&#92;mid{}x)}{&#92;pi_{&#92;mathrm{ref}}(y &#92;mid{}x)}&#92;right]}&#92;cdot&#92;varepsilon_{&#92;mathrm{stat}}&#92;&#92; = &#92;sqrt{ &#92;mathbb{E}_{&#92;pi_{&#92;mathrm{ref}}} &#92;left[ &#92;frac{&#92;pi^2(y&#92;mid{}x)}{&#92;pi^2_{&#92;mathrm{ref}}(y&#92;mid{}x)} &#92;right]} &#92;cdot &#92;varepsilon_{&#92;mathrm{stat}}&#92;&#92; = &#92;sqrt{2D_{&#92;chi^2}(&#92;pi ||&#92;pi_{&#92;mathrm{ref}}) + 1} &#92;cdot &#92;varepsilon_{&#92;mathrm{stat}} \" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">In other words, the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-divergence controls the accuracy of our estimate for policy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/>\u2019s reward when the reward function is trained on data collected by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/>. It correctly captures the uncertainty in the reward function, which is the main requirement for appropriate regularization in offline RL. Using essentially the above calculation, one can show that solving the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularized RLHF objective:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Chat%7B%5Cpi%7D_%7B%5Cmathrm%7BRLHF%7D%7D+%3D+%5Cmathrm%7Bargmax%7D_%7B%5Cpi%7D+%5Cmathbb%7BE%7D_%7Bx%5Csim+D_x%2C+y+%5Csim+%5Cpi%28%5Ccdot%5Cmid+x%29%7D%5B+%5Chat%7Br%7D%28x%2Cy%29+%5D+-+%5Cbeta+D_%7B%5Cchi%5E2%7D%28%5Cpi+%7C%7C+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D+%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;hat{&#92;pi}_{&#92;mathrm{RLHF}} = &#92;mathrm{argmax}_{&#92;pi} &#92;mathbb{E}_{x&#92;sim D_x, y &#92;sim &#92;pi(&#92;cdot&#92;mid x)}[ &#92;hat{r}(x,y) ] - &#92;beta D_{&#92;chi^2}(&#92;pi || &#92;pi_{&#92;mathrm{ref}} ) \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">and appropriately tuning <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta\" class=\"latex\" \/> leads to single-policy concentrability guarantees, and thus overcomes the theoretical limitation of KL-regularized approaches.\u00a0<\/p>\n\n\n\n<p class=\"\">Based on this observation, the main contribution of the paper is a \u201cdirect\u201d variant of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularization, analogous to DPO. The derivation also sheds some light on the favorable statistical properties of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularization. Recall that the DPO derivation uses the closed form solution to the KL-regularized objective, that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Cpi%7D_%7B%5Cmathrm%7BKL%7D%7D+%5Cpropto+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%5Ccdot%5Cexp%28%5Chat%7Br%7D%2F%5Cbeta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;pi}_{&#92;mathrm{KL}} &#92;propto &#92;pi_{&#92;mathrm{ref}}&#92;cdot&#92;exp(&#92;hat{r}\/&#92;beta)\" class=\"latex\" \/>. Unfortunately, with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularization there is no closed form, but we approximately have that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Cpi%7D_%7B%5Cchi%5E2%7D+%5Cpropto+%5Cpi_%7B%5Cmathrm%7Bref%7D%7D%5Ccdot%5Chat%7Br%7D%2F%5Cbeta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;pi}_{&#92;chi^2} &#92;propto &#92;pi_{&#92;mathrm{ref}}&#92;cdot&#92;hat{r}\/&#92;beta\" class=\"latex\" \/>. From this we can see that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/> regularization is much less greedy, or equivalently much more heavy-tailed than KL-regularization: it does not aggressively overfit to responses that have a high estimated reward. And although there is no closed-form solution to the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularized objective, we can still mostly follow the derivation of DPO to obtain our main algorithm: a \u201cdirect\u201d method based on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularization\u2014called <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-preference optimization or <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO, which avoids RL-style optimization and provably achieves single-policy concentrability guarantees.<\/p>\n\n\n\n<p class=\"\">As a final note, we have run preliminary experiments with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO on the TLDR summarization task. Matching our theoretical predictions, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO exhibits significantly less distribution shift from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/>, which leads to performance gains over DPO over a range of training epochs and regularization parameter settings. Notably, the performance gap between <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO and DPO grows as regularization decreases and training length increases, indicating that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO is an effective mitigator of distribution shift.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcV9nOC6ZgitEQtfFmvc4mGV5x2TCUUptt6I1Pi2I88h-G6fQkA7WSJuMK9VAXjVbeLzzrFVv9pLQE7m5EDzdA_AhSjbhuI_nkWuJBTfKiejGrM1fgZiQT8YYsoQGOr-sc46_hlwg?key=9twe7oFFjnRuzIjVPzIiGf-C\" alt=\"\" style=\"width:360px;height:auto\"\/><\/figure>\n<\/div>\n\n\n<p class=\"\">At the same time, the fact that we do not observe large performance gains indicates that statistical overfitting is not the whole story, and suggests many avenues for further investigation. For a theoretical audience, perhaps the most interesting of these are (a) the way preference data is collected in standard benchmarks does not precisely conform to our mathematical setup (in particular it is not clear that the responses are sampled from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi_%7B%5Cmathrm%7Bref%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi_{&#92;mathrm{ref}}\" class=\"latex\" \/>) and (b) <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO seems to induce a more challenging optimization landscape, in part due to the heavy-tailed nature of the ideal <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi^2\" class=\"latex\" \/>-regularized distribution. The latter point raises interesting research questions regarding computational-statistical tradeoffs of direct alignment objectives, and whether we can design algorithms that retain the statistical benefits of pessimism while avoiding optimization challenges, for example, through the use of inference time computation.\u00a0<\/p>\n\n\n\n<p class=\"\"><strong>Parting thoughts<\/strong><\/p>\n\n\n\n<p class=\"\">As we mentioned in the introduction, the interface between RL theory and LLMs is a very active research area. Directly relevant to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO, there are several other works about <a href=\"https:\/\/arxiv.org\/abs\/2405.16436\">mitigating<\/a> <a href=\"https:\/\/arxiv.org\/pdf\/2412.09544?\">overoptimization<\/a>. There are a growing number of theoretical papers trying to <a href=\"https:\/\/arxiv.org\/pdf\/2404.16767\">simplify<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2502.06861\">demystify<\/a>, and <a href=\"https:\/\/arxiv.org\/pdf\/2410.08847\">understand<\/a> standard RLHF and DPO. There are also works that focus on reward modeling, which identify shortcomings with the Bradley-Terry model and develop algorithms based on more <a href=\"https:\/\/arxiv.org\/abs\/2312.00886\">flexible<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2404.03715\">alternatives<\/a>. Finally, a direction we are currently quite excited about involves developing LLM post-training methods that deliberately gather novel information via <a href=\"https:\/\/arxiv.org\/abs\/2405.21046\"><em>online exploration<\/em><\/a>.\u00a0<\/p>\n\n\n\n<p class=\"\">To summarize, we believe there is tremendous potential for a diversity of theoretical perspectives to have an impact in language model post-training and generative AI more broadly. New formalizations and connections with other areas can lead to deeper understanding and novel algorithmic interventions. At the same time, clean mathematical testbeds, while useful, are unlikely to capture the full complexity of modern generative AI. As we\u2019ve learned through our experience working on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cchi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;chi\" class=\"latex\" \/>PO, it is important to stay grounded in the empirics to understand when and how the formalisms might break down. The upshot is that iterating between theory and practice produces a seemingly endless stream of interesting questions and opportunities.\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have another technical blog post, this time by Akshay Krishnamurthy and Audrey Huang, about how ideas from reinforcement learning theory can inspire new algorithms for language model post-training. Over the last several years, we have seen an explosion of interest and research activity into generative models\u2014particularly large language models like ChatGPT, Claude, and Gemini\u2014which [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-886","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/886","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=886"}],"version-history":[{"count":68,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/886\/revisions"}],"predecessor-version":[{"id":954,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/886\/revisions\/954"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=886"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=886"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=886"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}