{"id":1153,"date":"2025-11-26T15:50:48","date_gmt":"2025-11-26T15:50:48","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=1153"},"modified":"2025-11-26T16:10:14","modified_gmt":"2025-11-26T16:10:14","slug":"watermarking-language-models","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2025\/11\/26\/watermarking-language-models\/","title":{"rendered":"Watermarking language models"},"content":{"rendered":"\n<p>Watermarking of language models is a challenging and important problem, where theory and algorithms play an essential role. <a href=\"https:\/\/web.stanford.edu\/~rohithk\/\">Rohith Kuditipudi<\/a> guides us through some of the latest advancements in the area.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The provenance problem<\/strong><\/h2>\n\n\n\n<p>Language models today are able to mass produce fluent, human-like text. This reality poses unprecedented challenges for establishing text provenance while also placing renewed emphasis on its importance. Reliable tools for distinguishing human versus machine-generated text\u2014and attributing text to particular language models\u2014can empower individuals like online platform moderators and teachers to enact and enforce policies on language model usage; moreover, these tools can better enable model providers themselves (e.g., OpenAI) to track the (mis)use of their models, e.g., to scrub synthetic text from the training data of future language models.<\/p>\n\n\n\n<p>A natural approach to developing such a tool is to collect many examples of text generated by humans versus language models and learn a classifier. Unfortunately however, the difficulty of this classification problem scales directly with the quality of the language models. In particular, model providers explicitly optimize language models to maximize the log-likelihood of human-generated text; the better these providers are able to optimize this objective, the harder it becomes to tell apart machine-generated text. Moreover, machine-learned classifiers are prone to issues such as distribution shift and miscalibration, issues which can have serious consequences in the wild (e.g., falsely accusing a student of plagiarism).<\/p>\n\n\n\n<p>In this blog post, we will cover our work [KTHL&#8217;23] on an alternative approach to text provenance: watermarking. To achieve provenance, a watermark is a signal embedded within some generated content\u2014in our case, text from a language model\u2014that encodes the source of the content. We consider a setting where an untrusted third party user queries a language model by sending prompts to a trusted provider: the model provider generates text from their language model with a watermark\u2014which they compute using a (secret) key\u2014so that they may later identify the source of the text if the user publishes it. Essentially, watermarking circumvents the difficulty of classifying human versus machine-generated text by introducing side-information: the idea is that even if the distribution of unwatermarked text is *identical* to watermarked text, the joint distributions of the texts and the watermark key will be easily distinguishable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Statistical Watermarking<\/h2>\n\n\n\n<p><strong>Problem setup<\/strong><\/p>\n\n\n\n<p>Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BV%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{V}\" class=\"latex\" \/> be a discrete set, i.e., the vocabulary, and let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p+%5Cin+%5Cmathcal%7BV%7D%5E%2A+%5Cto+%5CDelta%28%5Cmathcal%7BV%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p &#92;in &#92;mathcal{V}^* &#92;to &#92;Delta(&#92;mathcal{V})\" class=\"latex\" \/> be an autoregressive language model which maps a string of arbitrary length to a distribution over the vocabulary, with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p%28%5Ccdot+%5Cmid+x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p(&#92;cdot &#92;mid x)\" class=\"latex\" \/> denoting the distribution of the next token given the prefix <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x+%5Cin+%5Cmathcal%7BV%7D%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x &#92;in &#92;mathcal{V}^*\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CXi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Xi\" class=\"latex\" \/> denote the space in which the elements of the watermark key sequence lie. The following protocol defines our problem setting:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The model provider generates a random watermark key <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi+%5Csim+%5Cnu&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi &#92;sim &#92;nu\" class=\"latex\" \/> for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnu+%5Cin+%5CDelta%28%5CXi%5E%2A%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nu &#92;in &#92;Delta(&#92;Xi^*)\" class=\"latex\" \/>;<\/li>\n\n\n\n<li>The user sends a prompt <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x+%5Cin+%5Cmathcal%7BV%7D%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x &#92;in &#92;mathcal{V}^*\" class=\"latex\" \/> to the model provider;<\/li>\n\n\n\n<li>The model provider generates text <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y+%5Cin+%5Cmathcal%7BV%7D%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y &#92;in &#92;mathcal{V}^*\" class=\"latex\" \/> by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y+%3D+%5Cmathtt%7Bgenerate%7D%28x%2C+%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y = &#92;mathtt{generate}(x, &#92;xi)\" class=\"latex\" \/>;<\/li>\n\n\n\n<li>The user publishes text <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D+%5Cin+%5Cmathcal%7BV%7D%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y} &#92;in &#92;mathcal{V}^*\" class=\"latex\" \/>, which may be either (i) (an edited version of) the generated text <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> or (ii) text independent of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> (e.g., text that they wrote themselves);<\/li>\n\n\n\n<li>The provider determines if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y}\" class=\"latex\" \/> is watermarked\u2014i.e., if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y}\" class=\"latex\" \/> depends on the watermark key sequence\u2014by computing a p-value <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bp%7D+%3D+%5Cmathtt%7Bdetect%7D%28%5Ctilde%7BY%7D+%2C+%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{p} = &#92;mathtt{detect}(&#92;tilde{Y} , &#92;xi)\" class=\"latex\" \/> with respect to the null hypothesis that Ye is independent of \u03be (i.e., not watermarked).<\/li>\n<\/ol>\n\n\n\n<p>We collectively refer to the tuple <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Cmathtt%7Bgenerate%7D%2C%5Cmathtt%7Bdetect%7D%2C%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;mathtt{generate},&#92;mathtt{detect},&#92;xi)\" class=\"latex\" \/> as a watermarking scheme (i.e., we define a scheme by specifying generate and detect methods along with the distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/>). We posit a good scheme should satisfy at least three desiderata. First, it should be <em>distortion-free<\/em>: marginalizing over the watermark key sequence, each call to generate is equal in distribution to a sample from the original language model, i.e., the distribution<\/p>\n\n\n\n<p class=\"has-text-align-center has-medium-font-size\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P%28%5Cmathbf%7Btext%7D%29+%3D+%5Cint_%5Cxi+%5Cmathbf%7B1%7D%5C%7B%5Cmathbf%7Btext%7D+%3D+%5Cmathtt%7Bgenerate%7D%28%5Cxi%2C%5Cmathbf%7Bprompt%7D%29%5C%7Dd%5Cnu%28%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P(&#92;mathbf{text}) = &#92;int_&#92;xi &#92;mathbf{1}&#92;{&#92;mathbf{text} = &#92;mathtt{generate}(&#92;xi,&#92;mathbf{prompt})&#92;}d&#92;nu(&#92;xi)\" class=\"latex\" \/><\/p>\n\n\n\n<p>is equal to the original language model\u2019s sampling distribution. Second, it should be <em>reliable<\/em> by enabling exact control of false positive (Type I) errors. Finally, it should be <em>robust<\/em>: the provider should be able to detect the watermark even if Ye is not an exact copy of watermarked text.<\/p>\n\n\n\n<p><strong>Designing a watermarking scheme<\/strong><\/p>\n\n\n\n<p>When we sample text from an autoregressive language model, under the hood the computer generates each token by mapping a random seed to a sample from the model\u2019s output distribution.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1898\" height=\"594\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?fit=678%2C212&amp;ssl=1\" alt=\"\" class=\"wp-image-1186\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?w=1898&amp;ssl=1 1898w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?resize=300%2C94&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?resize=1024%2C320&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?resize=768%2C240&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?resize=1536%2C481&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.08-PM.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><figcaption class=\"wp-element-caption\">Sampling from a language model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The basic premise of statistical watermarking is to use a watermark key in place of a random seed and define a transformation from the key to tokens that enables downstream identification of watermarked token sequences.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1928\" height=\"666\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?fit=678%2C234&amp;ssl=1\" alt=\"\" class=\"wp-image-1187\" style=\"width:677px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?w=1928&amp;ssl=1 1928w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?resize=300%2C104&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?resize=1024%2C354&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?resize=768%2C265&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?resize=1536%2C531&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-1.36.01-PM.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><figcaption class=\"wp-element-caption\">Watermarking a language model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>In our scheme, we take <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> to be a (long) i.i.d. sequence of numbers drawn uniformly at random in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5B0%2C1%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"[0,1]\" class=\"latex\" \/>. To generate watermarked text, we map elements from a random subsequence of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> to samples from the language model\u2019s output distribution using a technique known as inverse transform sampling, which we illustrate below. The distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> ensures the resulting watermarked text will be distortion-free (so long as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> is long enough that we do not need to reuse any elements). Crucially, by generating text in this way we introduce some correlation between the identities of the watermarked tokens in the vocabulary and the subsequence of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> we used to generate the tokens. (As a preprocessing step, we shuffle the identities of tokens in the vocabulary to ensure there will be no correlation for regular text.)\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"651\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?fit=678%2C276&amp;ssl=1\" alt=\"\" class=\"wp-image-1159\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?resize=300%2C122&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?resize=1024%2C417&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?resize=768%2C312&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?resize=1536%2C625&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-3.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"788\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?fit=678%2C334&amp;ssl=1\" alt=\"\" class=\"wp-image-1160\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?resize=300%2C148&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?resize=1024%2C504&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?resize=768%2C378&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?resize=1536%2C756&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-4.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><figcaption class=\"wp-element-caption\">Our watermarking scheme with inverse transform sampling.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>We use this correlation to define a test statistic for detecting watermarked text downstream. The test statistic compares the indices of tokens in a putative watermarked text (normalized by the vocabulary size) to subsequences of the watermark key. We would expect the value of the test statistic to be smaller for watermarked versus unwatermarked text (since the token indices of watermarked text are positively correlated with the subsequence of the watermark key which we used to generate the text). Because we do not know which subsequence of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> we may have used to generate the text, we must compare the text against all possible subsequences.<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cphi%28Y%2C%5Cxi%29+%3D+%5Cmin_i+%7C%7CY%2F%7CV%7C+-+%5Cxi_%7Bi%3Ai%2B%5Cmathtt%7Blen%7D%28Y%29%7D%7C%7C_1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;phi(Y,&#92;xi) = &#92;min_i ||Y\/|V| - &#92;xi_{i:i+&#92;mathtt{len}(Y)}||_1\" class=\"latex\" \/><\/p>\n\n\n\n<p>We can obtain exact p-values from *any* such test statistic using a permutation test: we generate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=T&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"T\" class=\"latex\" \/> independent copies <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%5Cxi_t%5C%7D_%7Bt%3D1%7D%5ET&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{&#92;xi_t&#92;}_{t=1}^T\" class=\"latex\" \/> of the watermark key (by rerunning the same procedure we used to sample the original key) and compare the value of the original test statistic <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cphi%28%5Ctilde%7BY%7D%2C%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;phi(&#92;tilde{Y},&#92;xi)\" class=\"latex\" \/> with the collection <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%5Cphi%28%5Ctilde%7BY%7D%2C%5Cxi_t%5C%7D_%7Bt%3D1%7D%5ET&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{&#92;phi(&#92;tilde{Y},&#92;xi_t&#92;}_{t=1}^T\" class=\"latex\" \/>. If <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y}\" class=\"latex\" \/> is not watermarked (i.e., if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y}\" class=\"latex\" \/> is independent of the original key), then by symmetry the rank of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cphi%28%5Ctilde%7BY%7D%2C%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;phi(&#92;tilde{Y},&#92;xi)\" class=\"latex\" \/> is uniformly distributed over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B1%2F%28T+%2B+1%29%2C+2%2F%28T+%2B+1%29%2C+.+.+.+%2C+1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{1\/(T + 1), 2\/(T + 1), . . . , 1}\" class=\"latex\" \/> since each <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Ctilde%7BY%7D%2C%5Cxi_t%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;tilde{Y},&#92;xi_t)\" class=\"latex\" \/> constitutes an exchangeable copy of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Ctilde%7BY%7D%2C%5Cxi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;tilde{Y},&#92;xi)\" class=\"latex\" \/>. This rank therefore amounts to a p-value with respect to the null hypothesis that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{Y}\" class=\"latex\" \/> is not watermarked, enabling exact control over false positive errors from the test.<\/p>\n\n\n\n<p>Already, this test is robust to a user swapping some fraction of tokens in a watermarked text with synonyms, since the remaining token identities will still exhibit correlation with the watermark key. However, it is not robust to inserting or deleting tokens (e.g., inserting a filler word) since these types of edits will cause misalignment between the entire text and the key. We address this issue in the paper by using techniques for robust sequence alignment to align a putative watermarked text to the watermark key sequence. Namely, we define a variation of Levenshtein \u201cdistance\u201d between token sequences and watermark keys and use it in place of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cell_1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;ell_1\" class=\"latex\" \/> distance in the previous equation. The Levenshtein distance between two strings approximates their edit distance; we can compute it efficiently using dynamic programming to substantially enhance the robustness of our test to more general kinds of editing.<\/p>\n\n\n\n<p>The main distinguishing feature of statistical watermarking in general is reliability: as we will see shortly, we can identify watermarked text from just a handful of tokens while provably controlling the likelihood of a false positive. What distinguishes our watermarking scheme in particular is that it is both distortion-free <em>and<\/em> robust. Other schemes [KGW+&#8217;23; AK&#8217;23; CGZ&#8217;23]\u2014at least, all those prior and contemporaneous to ours\u2014determine the next watermarked token based on a hash of the most recent tokens for some window size (illustrated below). The provider commits to a random hash function beforehand, which serves as the watermark key. Using a hash function as the watermark key introduces a direct trade-off between robustness and distortion: increasing the window size hurts robustness (since replacing any one of the tokens in the previous window will obfuscate the watermark in the current token), but small window sizes increase the likelihood of a hash collision, skewing the distribution of watermarked text away from the original language model. (We give some examples of this skewed behavior in our paper.)<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"724\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?fit=678%2C307&amp;ssl=1\" alt=\"\" class=\"wp-image-1161\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?resize=300%2C136&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?resize=1024%2C463&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?resize=768%2C348&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?resize=1536%2C695&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-5.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><figcaption class=\"wp-element-caption\">Hashing-based watermarks.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Evaluating its effectiveness<\/h2>\n\n\n\n<p><strong>In theory&#8230;<\/strong><\/p>\n\n\n\n<p>We can reliably detect watermarked text from just a few tokens. In particular, suppose we generate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> watermarked tokens <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> using a key <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> of length <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>, and let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi&#039;\" class=\"latex\" \/> be an independent copy of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> (e.g., one of the copies we use in the permutation test). We show in the paper that&nbsp;<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BP%7D%28%5Cphi%28Y%2C%5Cxi%29+%3E+%5Cphi%28Y%2C%5Cxi%27%29+%5Cleq+n+%5Cexp%28-%5COmega%28m%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{P}(&#92;phi(Y,&#92;xi) &gt; &#92;phi(Y,&#92;xi&#039;) &#92;leq n &#92;exp(-&#92;Omega(m))\" class=\"latex\" \/><\/p>\n\n\n\n<p>In other words, applying a basic concentration argument (i.e., a Hoeffding bound), we should expect to obtain p-values for watermarked text that are exponentially small in the length of the text. The factor of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> in the bound appears because we compare <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> to all length-<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> subsequences of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> (we apply a union bound over the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> possible such subsequences).&nbsp;<\/p>\n\n\n\n<p>One way to understand the effectiveness of watermarking is by analogy to a randomly generated error-correcting code. We generate m watermarked tokens x from a \u201ccodeword\u201d of length <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> that is a subsequence of the randomly generated \u201ccodebook\u201d <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/>. This process is inherently noisy, both because the tokens are not perfectly correlated with the initial codeword and because the user may edit the text before we observe it. But so long as the edits preserve some sufficiently large fraction of the tokens, we should be able to align the text to the original codeword to detect the watermark.<\/p>\n\n\n\n<p>One caveat is that watermarking is only effective on language models with sufficient entropy. In the above analysis, we treated the per-token entropy of the language model as a hidden constant. As entropy tends to zero it becomes more difficult to detect the watermark. This trade-off is inevitable for any distortion-free watermark (not just ours). As an extreme example, if a user asks a language model to recite the alphabet, then the (correct) output will be the same regardless of whether we apply a watermark. To formalize this intuition, let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;xi\" class=\"latex\" \/> be the watermark key and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y%27+%5Cperp+%5Cxi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y&#039; &#92;perp &#92;xi\" class=\"latex\" \/> be regular text sampled from an unwatermarked language model. If <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> is distortion-free (meaning <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y+%5Csim+Y%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y &#92;sim Y&#039;\" class=\"latex\" \/>), then<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_%7B%5Ctext%7BKL%7D%7D%28%28Y%2C%5Cxi%29+%7C%7C+%28Y%27%2C%5Cxi%29%29+%3D+I%28Y%3B%5Cxi%29+%5Cleq+H%28Y%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_{&#92;text{KL}}((Y,&#92;xi) || (Y&#039;,&#92;xi)) = I(Y;&#92;xi) &#92;leq H(Y)\" class=\"latex\" \/>.<\/p>\n\n\n\n<p>This bound implies we cannot reliably tell apart watermarked versus non-watermarked text (i.e., <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> versus <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y&#039;\" class=\"latex\" \/>) from a low-entropy language model even with the watermark key.<\/p>\n\n\n\n<p><strong>&#8230;and in practice<\/strong><\/p>\n\n\n\n<p>We benchmark our watermarks by using the Llama 7B language model to generate completions of random C4 prompts (Figure 5). Throughout, we use \u201cITS\u201d to refer to the inverse transform sampling watermark we have described above and \u201cITS-edit\u201d to refer to the more robust version thereof using Levenshtein distance. We also tried using exponential minimum sampling (proposed by [AK&#8217;23]) as an alternative to inverse transform sampling\u2014i.e., \u201cEXP\u201d and \u201cEXP-edit\u201d. We compare the effectiveness of our watermarks against the prior work of [KGW+&#8217;23]. That said, the comparison is not quite apples to apples since the watermarks of [KGW+&#8217;23] are not distortion-free; in particular, they bias the logits of the language model by a certain amount (larger <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> corresponds to more bias) in a direction based on a hash of previous tokens.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"241\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-2.15.00-PM-4.png?resize=678%2C241&#038;ssl=1\" alt=\"\" class=\"wp-image-1215\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-2.15.00-PM-4.png?w=1286&amp;ssl=1 1286w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-2.15.00-PM-4.png?resize=300%2C107&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-18-at-2.15.00-PM-4.png?resize=1024%2C365&amp;ssl=1 1024w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n\n\n\n<p>All watermarks are reliably detectable from less than 50 tokens, which in practice amounts to a few short sentences of text. The strongest watermarks (i.e., EXP and KGW-2.0) are all robust to randomly substituting large fractions of tokens. However, only EXP-edit is robust to randomly inserting and deleting tokens. In an attempt to simulate more realistic paraphrasing (e.g., a human editing watermarked text), we tried translating watermarked text to various languages (French and Russian) and back to English and found that the EXP-edit watermark is often still detectable. That said, in practice a user can easily evade detection by generating watermarked text in (eg) French and then translating back to English; even today, we are not aware of any distortion-free watermark that is robust to this kind of evasion attack.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1512\" height=\"632\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?fit=678%2C283&amp;ssl=1\" alt=\"\" class=\"wp-image-1163\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?w=1512&amp;ssl=1 1512w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?resize=300%2C125&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?resize=1024%2C428&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?resize=768%2C321&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2025\/09\/image-7.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><figcaption class=\"wp-element-caption\">Attempting to evade watermark detection via roundtrip translation.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Next steps<\/h2>\n\n\n\n<p>So far in this blog post, we\u2019ve covered the basics of statistical watermarking and seen how effective these watermarks can be. That said, as we\u2019ve already alluded to earlier, there are a number of open questions and challenges remaining.&nbsp;<\/p>\n\n\n\n<p>The watermarks we have proposed are distortion-free and robust. However, detection is expensive; in particular, its runtime scales linearly in the length of the watermark key, which imposes a hard cap on the total number of distortion-free watermarked tokens we can generate. Thus, our watermarks exhibit a trade-off between detection runtime complexity and distortion. Revisiting the earlier analogy to error-correcting codes, the most naive strategy of decoding using a look-up table entails that the runtime of decoding scales linearly with the number of codewords. A recent and exciting line of work has developed families of <em>pseudorandom error-correcting codes<\/em> that are indistinguishable from random codes (like ours) yet admit efficient decoding algorithms [CG&#8217;24; GM&#8217;24; GG&#8217;24]. When applied to watermarking, in principle these codes enable schemes that are simultaneously distortion-free, and computationally efficient, and robust to a (sufficiently small) constant fraction of substitutions, insertions and deletions. In practice, the robustness of these schemes is not great; put differently, there remains much room for future work on improved constructions of pseudorandom codes!<\/p>\n\n\n\n<p>Of course, stepping back one might argue that measuring robustness with respect to the fraction of tokens that are substituted, inserted, or deleted in watermarked text does not capture the kinds of evasion attacks to which we would like to be robust in the real world. For example, the attack we mentioned earlier of generating watermarked text in French and then translating to English (perhaps using a much smaller and\/or cheaper model to do the translation step) would result in almost the entire translated token sequence being different from the original, despite being fairly cheap to implement. Other work has proposed recursively paraphrasing watermarked text (again, using a much cheaper model) to evade detection while maintaining the quality of the text. It remains a major open question to what extent robustness to these more general kinds of automated attacks is achievable.<\/p>\n\n\n\n<p>Finally, the kinds of watermarks we have discussed in this blog post are all inapplicable to open-weight and open-source models, since they must be applied by the model provider at generation time. Even setting aside robustness, developing reliable and distortion-free (for some appropriate definition of distortion) watermarks for these kinds of models is an exciting and important direction for future work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p>[KTHL&#8217;23] Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. (2023)&nbsp; Robust Distortion-free Watermarks for Language Models.<br>[KGW+&#8217;23] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. (2023)&nbsp; A Watermark for Large Language Models.<br>[AK&#8217;23] Scott Aaronson and Hendrik Kirchner. (2023) Watermarking GPT Outputs.<br>[CGZ&#8217;24] Miranda Christ, Sam Gunn, and Or Zamir. (2023) Undetectable Watermarks for Language Models.<br>[CG&#8217;24] Miranda Christ and Sam Gunn. (2024) Pseudorandom Error-Correcting Codes.<br>[GM&#8217;24] Noah Golowich and Ankur Moitra. (2024) Edit Distance Robust Watermarks for Language Models.<br>[GG&#8217;24] Surendra Ghentiyala and Venkatesan Guruswami. (2024) New Constructions of Pseudorandom Codes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Watermarking of language models is a challenging and important problem, where theory and algorithms play an essential role. Rohith Kuditipudi guides us through some of the latest advancements in the area. The provenance problem Language models today are able to mass produce fluent, human-like text. This reality poses unprecedented challenges for establishing text provenance while [&hellip;]<\/p>\n","protected":false},"author":19,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-1153","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/1153","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=1153"}],"version-history":[{"count":68,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/1153\/revisions"}],"predecessor-version":[{"id":1242,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/1153\/revisions\/1242"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=1153"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=1153"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=1153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}