{"id":751,"date":"2024-09-11T13:00:00","date_gmt":"2024-09-11T13:00:00","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=751"},"modified":"2024-09-12T02:17:54","modified_gmt":"2024-09-12T02:17:54","slug":"one-inclusion-graphs-and-the-optimal-sample-complexity-of-pac-learning-part-2","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2024\/09\/11\/one-inclusion-graphs-and-the-optimal-sample-complexity-of-pac-learning-part-2\/","title":{"rendered":"One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 2"},"content":{"rendered":"\n<p>We&#8217;re back with the second blog post by&nbsp;<a href=\"https:\/\/ishaqadenali.github.io\/\">Ishaq Aden-Ali<\/a>,&nbsp;<a href=\"https:\/\/yeshwanth94.github.io\/\">Yeshwanth Cherapanamjeri<\/a>,&nbsp;<a href=\"https:\/\/ashettyv.github.io\/\">Abhishek Shetty<\/a>, and&nbsp;<a href=\"https:\/\/sites.google.com\/view\/nikitazhivotovskiy\/\">Nikita Zhivotovskiy<\/a>, continuing on the optimal sample complexity of PAC learning. If you missed the first post, check it out <a href=\"https:\/\/www.let-all.com\/blog\/2024\/06\/06\/one-inclusion-graphs-and-the-optimal-sample-complexity-of-pac-learning-part-1\/\">here<\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<p>In the last blog post, we saw the transductive model of learning, the one-inclusion graph (OIG) algorithm and how this led to optimal in-expectation bounds for learning VC classes. But, we also saw that the OIG algorithm itself does not achieve non-trivial high probability risk bounds. In this blog post, we will see how to modify the one-inclusion graph algorithm to achieve high probability bounds. Towards this, we will see a connection to online learning and a technique known as online-to-batch.<\/p>\n\n\n\n<p>Further, we will connect this back to the notion of exchangeability which, as we mentioned before, will play a key role.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>High Probability PAC Bounds<\/strong><\/h2>\n\n\n\n<p>From the previous blog post, recall that the one-inclusion graph algorithm achieved optimal results in expectation over the training data. That is, the algorithm achieved <\/p>\n\n\n\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D_%7BS%7D+%5Cleft%5B+%5CPr_%7BX+%5Csim+P%7D+%5Cleft%5B%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D%28X%29+%5Cneq+f%5E%5Cstar%28X%29+%5Cright%5D%5Cright%5D++%5Cle+%5Cfrac%7Bd%7D%7Bn%2B1%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}_{S} &#92;left[ &#92;Pr_{X &#92;sim P} &#92;left[&#92;widehat{f}_{&#92;mathrm{OIG}}(X) &#92;neq f^&#92;star(X) &#92;right]&#92;right]  &#92;le &#92;frac{d}{n+1}.\" class=\"latex\" \/><\/p>\n\n\n\n<p>It is natural to ask whether this would imply optimal results in the PAC model. That is, when we require <\/p>\n\n\n\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr_%7BS%7D+%5Cleft%5B+%5CPr_%7BX+%5Csim+P%7D+%5Cleft%5B%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D%28X%29+%5Cneq+f%5E%5Cstar%28X%29+%5Cright%5D++%5Cgeq+%5Cepsilon+%5Cright%5D+%5Cle+%5Cdelta+.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr_{S} &#92;left[ &#92;Pr_{X &#92;sim P} &#92;left[&#92;widehat{f}_{&#92;mathrm{OIG}}(X) &#92;neq f^&#92;star(X) &#92;right]  &#92;geq &#92;epsilon &#92;right] &#92;le &#92;delta .\" class=\"latex\" \/><\/p>\n\n\n\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> are parameters. Unfortunately, we saw in the previous blog post that directly implementing the one-inclusion graph algorithm provably does beat just applying Markov&#8217;s inequality to the expected risk, that is, we get <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon+%3D+d%2Fn+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon = d\/n &#92;delta\" class=\"latex\" \/>. <\/p>\n\n\n\n<p>However, we will now describe a general framework that enables transferring many of these same guarantees to the PAC setting as well. The framework is a variant of the online-to-batch technique to convert strong predictors in the <em>online<\/em> setting to the batch setting of PAC learning. A key aspect of this online-to-batch conversion is that rather than building on worst case online guarantees, we will need to rely on the <em>average<\/em> performance of the online algorithm in order to get meaningful PAC guarantees.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Online-to-batch: Deterministic Error Bound<\/h2>\n\n\n\n<p>To describe the online-to-batch framework, we will use the same notation with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> being input and label spaces, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> being a function class of mappings from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/>. In the online setting, a learner observes a series of samples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28x_1%2C+y_1%29%2C+%5Cdots%2C+%28x_n%2C+y_n%29+%5Cin+%5Cmathcal%7BX%7D+%5Ctimes+%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(x_1, y_1), &#92;dots, (x_n, y_n) &#92;in &#92;mathcal{X} &#92;times &#92;mathcal{Y}\" class=\"latex\" \/> with the guarantee that the sequence is <em>realizable<\/em> with respect to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>, i.e. there exists a function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star} &#92;in &#92;mathcal{F}\" class=\"latex\" \/> such that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_i+%3D+f%5E%7B%5Cstar%7D+%28x_i%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_i = f^{&#92;star} (x_i)\" class=\"latex\" \/>. For each <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i\" class=\"latex\" \/>, the task of the learner is to predict the label of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_i\" class=\"latex\" \/> given <em>only<\/em> the data points up to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i+-+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i - 1\" class=\"latex\" \/> and is rewarded if they get it right or penalized if the prediction is incorrect. That is, a learner, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/>, aims to minimize:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BI%7D+%5Cleft%5C%7By_i+%5Cneq+%5Cmathcal%7BA%7D+%28x_i%3B+%5C%7B%28x_j%2C+y_j%29%5C%7D_%7Bj+%3D+1%7D%5E%7Bi+-+1%7D%29+%5Cright%5C%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathbb{I} &#92;left&#92;{y_i &#92;neq &#92;mathcal{A} (x_i; &#92;{(x_j, y_j)&#92;}_{j = 1}^{i - 1}) &#92;right&#92;}.\" class=\"latex\" \/><\/p>\n\n\n\n<p>Note that key difference between the online and the PAC setting is that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28x_t%2C+y_t%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(x_t, y_t) \" class=\"latex\" \/> can depend fully on the predictions of the learner in the past and on the previous examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28x_1%2C+y_1%29%2C+%5Cdots%2C+%28x_%7Bt-1%7D%2C+y_%7Bt-1%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(x_1, y_1), &#92;dots, (x_{t-1}, y_{t-1})\" class=\"latex\" \/>. This setting corresponds to one in which an adversary, at each time <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"t\" class=\"latex\" \/> picks <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28x_t+%2C+y_t%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(x_t , y_t)\" class=\"latex\" \/> with full knowledge of the entire history of the interaction so far, potentially with the aim to maximize the number of mistakes made by the learner. A learner that achieves low error in this setting <em>learns<\/em> the class in a very strong sense.<\/p>\n\n\n\n<p>Suppose, now that we have access to a good prediction strategy for the online setting. Formally, a strategy, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/>, that satisfies for any <em>realizable<\/em> sequence for some <em>deterministic and fixed<\/em> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M+%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M &gt; 0\" class=\"latex\" \/>:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BI%7D+%5Cleft%5C%7By_i+%5Cneq+%5Cmathcal%7BA%7D+%28x_i%3B+%5C%7B%28x_j%2C+y_j%29%5C%7D_%7Bj+%3D+1%7D%5E%7Bi+-+1%7D%29+%5Cright%5C%7D+%5Cleq+M.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathbb{I} &#92;left&#92;{y_i &#92;neq &#92;mathcal{A} (x_i; &#92;{(x_j, y_j)&#92;}_{j = 1}^{i - 1}) &#92;right&#92;} &#92;leq M.\" class=\"latex\" \/><\/p>\n\n\n\n<p>Can we now use this algorithm to derive one for the PAC setting? To see how such results transfer to the PAC setting, consider now the simpler setting where the sequence <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+Y_1%29%2C+%5Cdots%2C+%28X_n%2C+Y_n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, Y_1), &#92;dots, (X_n, Y_n)\" class=\"latex\" \/> is such that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_t\" class=\"latex\" \/> are drawn i.i.d. from  distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>  with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y_t+%3D+f%5E%7B%5Cstar%7D+%28X_t%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y_t = f^{&#92;star} (X_t)\" class=\"latex\" \/> for some function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star} &#92;in &#92;mathcal{F}\" class=\"latex\" \/>. Now, suppose we were to run our algorithm <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/> on this randomly generated sequence and obtain a series of predictions <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7Bi%7D+%28%5Ccdot%29+%3D+%5Cmathcal%7BA%7D+%28%5Ccdot+%3B+%5C%7B%28X_j%2C+Y_j%29%5C%7D_%7Bj+%3D+1%7D%5E%7Bi+-+1%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{i} (&#92;cdot) = &#92;mathcal{A} (&#92;cdot ; &#92;{(X_j, Y_j)&#92;}_{j = 1}^{i - 1})\" class=\"latex\" \/>, how would these perform on the true underlying distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>? As a first attempt towards this, consider the <em>average<\/em> risk of the classifiers <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{i}\" class=\"latex\" \/> where we use the fact that the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/> is a strong <em>online<\/em> predictor:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+%3D+%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+-+%5Cmathbb%7BI%7D+%5Cleft%5C%7BY_i+%5Cneq+f%5E%7Bi+-+1%7D+%28X_i%29+%5Cright%5C%7D+%2B+%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BI%7D+%5Cleft%5C%7BY_i+%5Cneq+f%5E%7Bi+-+1%7D+%28X_i%29+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) = &#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) - &#92;mathbb{I} &#92;left&#92;{Y_i &#92;neq f^{i - 1} (X_i) &#92;right&#92;} + &#92;sum_{i = 1}^n &#92;mathbb{I} &#92;left&#92;{Y_i &#92;neq f^{i - 1} (X_i) &#92;right&#92;} \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M+%2B+%5Cleq+%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+-+%5Cmathbb%7BI%7D+%5Cleft%5C%7B+Y_i+%5Cneq+f%5E%7Bi+-+1%7D+%28X_i%29+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M + &#92;leq &#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) - &#92;mathbb{I} &#92;left&#92;{ Y_i &#92;neq f^{i - 1} (X_i) &#92;right&#92;} \" class=\"latex\" \/><\/p>\n\n\n\n<p>The first term above corresponds to the <em>generalization<\/em> <em>error<\/em>. Given the above decomposition, we are left with two additional issues. The first is that the above would simply bound the <em>average<\/em> risk of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{i}\" class=\"latex\" \/>. The second issue is obtaining a bound on the Generalization Term above. As we will see, the first issue is easy to resolve. However, for the second, it is not immediately clear why the Generalization Term is small. We see that it is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0\" class=\"latex\" \/> <em>in expectation<\/em>. To see why, note that since <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_i+%2C+Y_i%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_i , Y_i)\" class=\"latex\" \/> are independent of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%28X_1%2C+Y_1%29+%5Cdots+%28X_%7Bi-1%7D+%2CY_%7Bi-1%7D%29%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{(X_1, Y_1) &#92;dots (X_{i-1} ,Y_{i-1})&#92;}\" class=\"latex\" \/> and thus, evaluating on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_i%2C+Y_i%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_i, Y_i)\" class=\"latex\" \/> is exactly evaluating the risk. However, why should it <em>concentrate<\/em>? In the expression, each term in the sum is the difference between the <em>population risk<\/em> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7Bi-1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{i-1}\" class=\"latex\" \/>, and loss evaluated on a <em>single<\/em> fresh sample drawn from the distribution, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>. Since the sample is drawn <em>independently<\/em> in each step, this can be utilized to establish the concentration of the generalization term.<\/p>\n\n\n\n<p><strong>Lemma<\/strong>: Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_1%2C+%5Cdots%2C+W_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_1, &#92;dots, W_n\" class=\"latex\" \/> be a sequence of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C+1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0, 1&#92;}\" class=\"latex\" \/> random variables adapted to a filtration <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BZ%7D_1+%5Csubseteq+%5Cmathcal%7BZ%7D_2+%5Cdots+%5Cmathcal%7BZ%7D_%7Bn+-+1%7D+%5Csubseteq+%5Cmathcal%7BZ%7D_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Z}_1 &#92;subseteq &#92;mathcal{Z}_2 &#92;dots &#92;mathcal{Z}_{n - 1} &#92;subseteq &#92;mathcal{Z}_n\" class=\"latex\" \/>. Then, we have:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+W_i+%5Cleq+4+%5Cleft%28+%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BE%7D+%5BW_i+%5Cmid+%5Cmathcal%7BZ%7D_%7Bi+-+1%7D%5D+%2B+%5Clog+%282+%2F+%5Cdelta%29+%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n W_i &#92;leq 4 &#92;left( &#92;sum_{i = 1}^n &#92;mathbb{E} [W_i &#92;mid &#92;mathcal{Z}_{i - 1}] + &#92;log (2 \/ &#92;delta) &#92;right)\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+W_i+%5Cgeq+%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BE%7D+%5BW_i+%5Cmid+%5Cmathcal%7BZ%7D_%7Bi+-+1%7D%5D+-+2+%5Clog+%282+%2F+%5Cdelta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n W_i &#92;geq &#92;frac{1}{2}&#92;sum_{i = 1}^n &#92;mathbb{E} [W_i &#92;mid &#92;mathcal{Z}_{i - 1}] - 2 &#92;log (2 \/ &#92;delta)\" class=\"latex\" \/><\/p>\n\n\n\n<p>with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/>.<\/p>\n\n\n\n<p>As a consequence, we get:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+%5Cleq+5+%28M+%2B+%5Clog+%281+%2F+%5Cdelta%29%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) &#92;leq 5 (M + &#92;log (1 \/ &#92;delta)),\" class=\"latex\" \/><\/p>\n\n\n\n<p>with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/>. Hence, beyond the average risk of the classifiers <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> being small <em>in expectation<\/em>, it is also small with <em>high probability<\/em>. To construct a predictor from the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/>, we can simply take the <em>majority<\/em> vote. To see why <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bf%7D+%3D+%5Cmathrm%7BMaj%7D+%28%5C%7Bf%5Ei%5C%7D_%7Bi+%3D+1%7D%5En%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{f} = &#92;mathrm{Maj} (&#92;{f^i&#92;}_{i = 1}^n)\" class=\"latex\" \/> has small risk, observe the following:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BR%7D+%28%5Chat%7Bf%7D%29+%3D+%5Cmathbb%7BE%7D_%7B%28X%2C+Y%29+%5Csim+%5Cmathcal%7BD%7D%7D+%5B%5Cmathbb%7BI%7D+%5Cleft%5C%7B%5Chat%7Bf%7D+%28X%29+%5Cneq+Y%5Cright%5C%7D%5D+%5Cleq+%5Cfrac%7B2%7D%7Bn%7D+%5Cmathbb%7BE%7D_%7B%28X%2C+Y%29+%5Csim+%5Cmathcal%7BD%7D%7D+%5Cleft%5B%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BI%7D+%5Cleft%5C%7B+f%5Ei+%28X%29+%5Cneq+Y%5Cright%5C%7D%5Cright%5D+%5Cleq+10+%5Cleft%28%5Cfrac%7BM+%2B+%5Clog+%281+%2F+%5Cdelta%29%7D%7Bn%7D+%5Cright%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{R} (&#92;hat{f}) = &#92;mathbb{E}_{(X, Y) &#92;sim &#92;mathcal{D}} [&#92;mathbb{I} &#92;left&#92;{&#92;hat{f} (X) &#92;neq Y&#92;right&#92;}] &#92;leq &#92;frac{2}{n} &#92;mathbb{E}_{(X, Y) &#92;sim &#92;mathcal{D}} &#92;left[&#92;sum_{i = 1}^n &#92;mathbb{I} &#92;left&#92;{ f^i (X) &#92;neq Y&#92;right&#92;}&#92;right] &#92;leq 10 &#92;left(&#92;frac{M + &#92;log (1 \/ &#92;delta)}{n} &#92;right),\" class=\"latex\" \/><\/p>\n\n\n\n<p>as the majority classifier <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bf%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{f}\" class=\"latex\" \/> only makes an error when more than half of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> make an error. This online-to-batch conversion was first discovered by Nick Littlestone a few decades ago [Lit89]. <sup data-fn=\"1554d7be-50a0-4bdb-9951-a1191af31329\" class=\"fn\"><a href=\"#1554d7be-50a0-4bdb-9951-a1191af31329\" id=\"1554d7be-50a0-4bdb-9951-a1191af31329-link\">1<\/a><\/sup> Given the above result, we would be done if there existed good prediction strategies for the online setting. Unfortunately, this is not the case.<\/p>\n\n\n\n<p>Even for simple hypothesis classes, such linear thresholds in one dimension <em>any<\/em> online prediction strategy for the simplest setting of threshold classifiers must incur error <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5COmega+%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Omega (n)\" class=\"latex\" \/> for <em>some<\/em> realizable sequence.<sup data-fn=\"b6995c1e-06cf-4a95-b720-746d7d76d7ef\" class=\"fn\"><a href=\"#b6995c1e-06cf-4a95-b720-746d7d76d7ef\" id=\"b6995c1e-06cf-4a95-b720-746d7d76d7ef-link\">2<\/a><\/sup>  This precludes the possibility of obtaining good PAC guarantees using <em>worst-case<\/em> guarantees of online learning algorithms. In the next section, we discuss how this limitation can be overcome in the PAC setting, by exploiting the <em>exchangeability<\/em> of the input sequences, allowing us to take advantage of <em>beyond worst-case<\/em> nature of our sequences.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Online-to-batch: PAC Error Bound<\/h2>\n\n\n\n<p>Given the worst-case error guarantees of an online learning algorithm, it is natural to conclude that the online-to-batch framework cannot yield non-trivial conclusions for the PAC setting. However, a more careful analysis of the argument from the previous section shows we only require an online learning algorithm that performs well on sequences <em>generated i.i.d. from a distribution<\/em>. One might think that one immediately needs to worry about &#8220;average-case&#8221; instances that come from a distribution and this would potentially lead to a PAC guarantee. But it turns out in most cases, the choice of the instances <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_1%2C+%5Cdots%2C+x_n+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_1, &#92;dots, x_n \" class=\"latex\" \/> are less important than the <em>order<\/em> in which they are presented to the learner. But for the online-to-batch framework in the previous section, the fact that the error bound <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M\" class=\"latex\" \/> works for any sequence (in any order) of examples is crucial. Another way of seeing this is to note that we had to bound<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathbb%7BI%7D++%5Cleft%5C%7BY_i+%5Cneq+f%5E%7Bi+-+1%7D+%28X_i%29%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathbb{I}  &#92;left&#92;{Y_i &#92;neq f^{i - 1} (X_i)&#92;right&#92;}\" class=\"latex\" \/><\/p>\n\n\n\n<p>with high probability over the choice of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_i\" class=\"latex\" \/>. If one tried to apply standard martingale type arguments to this, one would be required to bound<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%5B%5Cmathbb%7BI%7D+%5Cleft%5C%7BY_i+%5Cneq+f%5E%7Bi+-+1%7D+%28X_i%29+%5Cright%5C%7D+%7C+%28X_1%2C+Y_1%29%2C%5Cdots%2C+%28X_%7Bi-1%7D+%2C+Y_%7Bi-1%7D%29+%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}[&#92;mathbb{I} &#92;left&#92;{Y_i &#92;neq f^{i - 1} (X_i) &#92;right&#92;} | (X_1, Y_1),&#92;dots, (X_{i-1} , Y_{i-1}) ]\" class=\"latex\" \/><\/p>\n\n\n\n<p>in terms of the VC dimension. Note that in this argument, we would need to condition on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2CY_1%29%2C+%5Cdots+%28X_%7Bi-1%7D+%2C+Y_%7Bi-1%7D%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1,Y_1), &#92;dots (X_{i-1} , Y_{i-1}) \" class=\"latex\" \/>. This would nullify any simple hope that one might have to use the average case nature of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_i\" class=\"latex\" \/>.<\/p>\n\n\n\n<p>But, let us try to recall what type of guarantees we have already seen from transductive learning. Note that the one can interpret the OIG as an algorithm <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/> that gave a guarantee akin to<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B1%7D%7Bi%21%7D+%5Csum_%7B+%5Cpi+%7D+%5Cmathbb%7BI%7D+%5Cleft%5B+Y_%7B%5Cpi%28i%29%7D+%5Cneq+%5Cmathcal%7BA%7D+%28+X_%7B+%5Cpi%28i%29+%7D+%3B+%5C%7B+%28X_%7B+%5Cpi%28+k%29+%7D+%2C+Y_%7B%5Cpi%28k%29%7D+%29+%5C%7D_%7B+%5Cpi%28k%29+%5Cneq+%5Cpi%28i%29+%7D+%29+%5Cright%5D+%5Cleq+%5Cfrac%7Bd%7D%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{1}{i!} &#92;sum_{ &#92;pi } &#92;mathbb{I} &#92;left[ Y_{&#92;pi(i)} &#92;neq &#92;mathcal{A} ( X_{ &#92;pi(i) } ; &#92;{ (X_{ &#92;pi( k) } , Y_{&#92;pi(k)} ) &#92;}_{ &#92;pi(k) &#92;neq &#92;pi(i) } ) &#92;right] &#92;leq &#92;frac{d}{i}\" class=\"latex\" \/><\/p>\n\n\n\n<p>The sum is over all permutations. But, this seems to be insufficient for what we need since naively to apply the martingale argument, we don&#8217;t get to average over the choice of test point again.<\/p>\n\n\n\n<p>But, the above way of thinking of problem is still useful. The first high level point is that, in order to evaluate the loss and use the average case nature of the instances, ideally we would like to condition on as little as possible. This would allow for us preserve more randomness in order to use the average case nature of the problem. So, what should be condition on?<\/p>\n\n\n\n<p>The first observation is a key aspect that mathematically characterizes why i.i.d. sequences are unlikely to behave like worst-case ones. This is the notion of <em>exchangeability<\/em> i.e. that i.i.d. sequences are equivalent to be presented in random order. Recall that a sequence of random variables <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+%5Cdots%2C+X_n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, &#92;dots, X_n)\" class=\"latex\" \/> is <em>exchangeable<\/em> if for any permutation <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_%7B%5Cpi+%281%29%7D%2C+%5Cdots%2C+X_%7B%5Cpi+%28n%29%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_{&#92;pi (1)}, &#92;dots, X_{&#92;pi (n)})\" class=\"latex\" \/> has the same distribution as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+%5Cdots%2C+X_n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, &#92;dots, X_n)\" class=\"latex\" \/>. Since each element of an i.i.d. sequence is independently drawn from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>, these sequences are also <em>exchangeable<\/em>. This hints at the fact that in order to evaluate quantities under an i.i.d sequence it might suffice to condition on the values <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%28x_1%2C+y_1%29%2C+%5Cdots%2C+%28x_n%2C+y_n%29%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{(x_1, y_1), &#92;dots, (x_n, y_n)&#92;}\" class=\"latex\" \/> in the sequence <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+Y_1%29%2C+%5Cdots%2C+%28X_n%2C+Y_n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, Y_1), &#92;dots, (X_n, Y_n)\" class=\"latex\" \/> but not their <em>order<\/em>.<\/p>\n\n\n\n<p>We have observed that for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_i\" class=\"latex\" \/>, the order does not matter. But this alone is not sufficient. Ideally, whatever we decide to condition on, we still would like to able to evaluate the loss <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BI%7D+%5B+Y_j+%5Cneq+A%28X_j+%3B+%5C%7B+%28X_1%2C+Y_1%29+%2C+%5Cdots+%28X_%7Bj-1%7D+%2C+Y_%7Bj-1%7D+%29+%29+%5D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{I} [ Y_j &#92;neq A(X_j ; &#92;{ (X_1, Y_1) , &#92;dots (X_{j-1} , Y_{j-1} ) ) ] \" class=\"latex\" \/>. This a prior could depend on the order of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+Y_1%29+%2C+%5Cdots+%28X_%7Bj-1%7D+%2C+Y_%7Bj-1%7D+%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, Y_1) , &#92;dots (X_{j-1} , Y_{j-1} )\" class=\"latex\" \/> since the algorithm is allowed use the data set in whatever order it would like. But wait! We noted earlier that the OIG algorithm, <em>does not<\/em> depend on the order i.e. is permutation invariant. Thus, it seems like we should condition on the data points <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7BX_1%2C+%5Cdots+%2C+X_i%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{X_1, &#92;dots , X_i&#92;}\" class=\"latex\" \/>, and importantly not the order.<\/p>\n\n\n\n<p>To formalize this notion, we introduce some notation that clarifies the conditioning in our argument. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/> be a random permutation. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> be some prediction algorithm that we will choose later. Further, define<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BZ%7D_i+%3D+%5C%7B%5Cpi+%281%29%2C+%5Cdots%2C+%5Cpi+%28i%29%5C%7D+%5Ctext%7B+and+%7D+W_i+%3D+%5Cmathbb%7BI%7D+%5Cleft%5C%7Bf%5Ei+%28x_%7B%5Cpi+%28i+%2B+1%29%7D%29+%5Cneq+y_%7B%5Cpi%28i%2B1%29%7D+%5Cright%5C%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Z}_i = &#92;{&#92;pi (1), &#92;dots, &#92;pi (i)&#92;} &#92;text{ and } W_i = &#92;mathbb{I} &#92;left&#92;{f^i (x_{&#92;pi (i + 1)}) &#92;neq y_{&#92;pi(i+1)} &#92;right&#92;}.\" class=\"latex\" \/><\/p>\n\n\n\n<p>The random quantity <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BZ%7D_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Z}_i\" class=\"latex\" \/> in the above expression concerns the specific elements in the first <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i\" class=\"latex\" \/> positions of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7B%5Cpi+%281%29%7D%2C+%5Cdots%2C+x_%7B%5Cpi+%28n%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{&#92;pi (1)}, &#92;dots, x_{&#92;pi (n)}\" class=\"latex\" \/> but crucially not on their order nor the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28i+%2B+1%29%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(i + 1)^{th}\" class=\"latex\" \/> element. In order for the above conditioning to be useful, ideally <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_i\" class=\"latex\" \/> depends only on the set of elements and not the order.<\/p>\n\n\n\n<p>This would happen if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> is permutation invariant. Hence, conditioning on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BZ%7D_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Z}_i\" class=\"latex\" \/>, the value of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_i\" class=\"latex\" \/> is determined as it does not depend on the <em>ordering<\/em> of the first <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i\" class=\"latex\" \/> elements.<\/p>\n\n\n\n<p>We could now apply martingale arguments to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_i\" class=\"latex\" \/>. In order to do this, we need to evaluate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D+%5Cleft%5BW_%7Bi+-+1%7D+%5Cmid+%5Cmathcal%7BZ%7D_i+%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E} &#92;left[W_{i - 1} &#92;mid &#92;mathcal{Z}_i &#92;right]\" class=\"latex\" \/>. On the other hand, we also have:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D+%5Cleft%5B+W_%7Bi+-+1%7D+%5Cmid+%5Cmathcal%7BZ%7D_i+%5Cright%5D+%5Cleq+M_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E} &#92;left[ W_{i - 1} &#92;mid &#92;mathcal{Z}_i &#92;right] &#92;leq M_i\" class=\"latex\" \/><\/p>\n\n\n\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M_i\" class=\"latex\" \/> is the error of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> on a randomly chosen test point in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B+x_%7B%5Cpi%281%29%7D+%2C+%5Cdots+%2C+x_%7B%5Cpi%28i%29%2B1%7D%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{ x_{&#92;pi(1)} , &#92;dots , x_{&#92;pi(i)+1}&#92;}\" class=\"latex\" \/>.<\/p>\n\n\n\n<p>This is due to the fact that conditioned on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BZ%7D_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Z}_i\" class=\"latex\" \/>, the test point for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_%7Bi+-+1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_{i - 1}\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7B%5Cpi+%28i%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{&#92;pi (i)}\" class=\"latex\" \/>, is chosen uniformly among the elements in the set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7Bx_%7B%5Cpi+%281%29%7D%2C+%5Cdots%2C+x_%7B%5Cpi+%28i%29%7D%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{x_{&#92;pi (1)}, &#92;dots, x_{&#92;pi (i)}&#92;}\" class=\"latex\" \/>. The key observation is that this exactly corresponds to the error of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ei&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^i\" class=\"latex\" \/> in the transductive model. Thus, we just need to chose a permutation-invariant function that works in the transductive model. As we saw earlier, this is exactly the one-inclusion graph algorithm for which we have the bound:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D+%5Cleft%5B+W_%7Bi+-+1%7D+%5Cmid+%5Cmathcal%7BZ%7D_i+%5Cright%5D+%5Cleq+%5Cfrac%7Bd%7D%7Bi%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E} &#92;left[ W_{i - 1} &#92;mid &#92;mathcal{Z}_i &#92;right] &#92;leq &#92;frac{d}{i}.\" class=\"latex\" \/><\/p>\n\n\n\n<p>Furthermore, since each <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=W_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"W_i\" class=\"latex\" \/> is a function only of the <em>set<\/em> of elements <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7Bx_%7B%5Cpi+%281%29%7D%2C+%5Cdots%2C+x_%7B%5Cpi+%28i%29%7D%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{x_{&#92;pi (1)}, &#92;dots, x_{&#92;pi (i)}&#92;}\" class=\"latex\" \/>, and the test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7B%5Cpi+%28i+%2B+1%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{&#92;pi (i + 1)}\" class=\"latex\" \/>, we may apply our concentration inequality again to obtain the following result for the risks:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+%5Cleq+%5Cfrac%7B1%7D%7Bn%7D+%5Csum_i+%5Cfrac%7Bd%7D%7Bi%7D+%2B+%5Cfrac%7B%5Clog%281%2F%5Cdelta%29%7D%7Bn%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) &#92;leq &#92;frac{1}{n} &#92;sum_i &#92;frac{d}{i} + &#92;frac{&#92;log(1\/&#92;delta)}{n}\" class=\"latex\" \/><\/p>\n\n\n\n<p>with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/>. But, this does not seem like it improves over the bound for ERM that we discussed earlier.<\/p>\n\n\n\n<p>One final trick that we need to fix this issue is considering <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{i}\" class=\"latex\" \/> only for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i%3E+n%2F2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i&gt; n\/2\" class=\"latex\" \/>. This still allows us to average over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%2F2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\/2\" class=\"latex\" \/> predictors, but we have<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi+%3D+1%7D%5En+%5Cmathcal%7BR%7D+%28f%5E%7Bi+-+1%7D%29+%5Cleq+%5Cfrac%7B1%7D%7Bn%7D+%5Csum_%7Bi%3En%2F2%7D+%5Cfrac%7Bd%7D%7Bi%7D+%2B+%5Cfrac%7B%5Clog%281%2F%5Cdelta%29%7D%7Bn%7D+%5Cleq+O+%5Cleft%28%5Cfrac%7Bd+%2B+%5Clog%281%2F%5Cdelta%29+%7D%7Bn%7D+%5Cright%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i = 1}^n &#92;mathcal{R} (f^{i - 1}) &#92;leq &#92;frac{1}{n} &#92;sum_{i&gt;n\/2} &#92;frac{d}{i} + &#92;frac{&#92;log(1\/&#92;delta)}{n} &#92;leq O &#92;left(&#92;frac{d + &#92;log(1\/&#92;delta) }{n} &#92;right).\" class=\"latex\" \/><\/p>\n\n\n\n<p>Hence, by simply exploiting the <em>permutation-invariance<\/em> of the distribution generating the input sequence, we are able to obtain a high-probability concentration inequality for the mistake bound that is substantially better than the worst-case bound incurred for arbitrary realizable input sequences. This mistake bound along with the discussion in the previous section now yields PAC learners with optimal high-probability guarantees. <\/p>\n\n\n\n<p class=\"has-text-align-left\"><strong>Theorem 3<\/strong> [Optimal Sample Complexity of PAC Learning, [AACSZ23a]] Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> be a domain and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a class of functions from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/> with VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%2C+%5Cdelta+%5Cin+%280%2C1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon, &#92;delta &#92;in (0,1)\" class=\"latex\" \/>. Let<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%3D+%5COmega%5Cleft%28+%5Cfrac%7Bd+%2B+%5Clog+%5Cleft%28+1+%2F+%5Cdelta+%5Cright%29+%7D%7B%5Cepsilon%7D+%5Cright%29.+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n = &#92;Omega&#92;left( &#92;frac{d + &#92;log &#92;left( 1 \/ &#92;delta &#92;right) }{&#92;epsilon} &#92;right). \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-left\"><br> Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%5Cleft%5C%7B+%28x_i+%2C+f%5E%7B%5Cstar%7D+%28x_i%29+%29+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = &#92;left&#92;{ (x_i , f^{&#92;star} (x_i) ) &#92;right&#92;} \" class=\"latex\" \/> be a sample of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> drawn i.i.d. from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D+%5Cin+%5Cmathcal%7BF%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star} &#92;in &#92;mathcal{F} \" class=\"latex\" \/>. There is an prediction strategy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bf%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{f}\" class=\"latex\" \/> such that, with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/> over the choice of the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>, we have<\/p>\n\n\n\n<p class=\"has-text-align-left\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr_%7Bx+%5Csim+D+%7D+%5Cleft%5B+%5Chat%7Bf%7D%28x%29+%5Cneq+f%5E%7B%5Cstar%7D%28x%29+%5Cright%5D+%5Cleq+%5Cepsilon.+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr_{x &#92;sim D } &#92;left[ &#92;hat{f}(x) &#92;neq f^{&#92;star}(x) &#92;right] &#92;leq &#92;epsilon. \" class=\"latex\" \/><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n\n\n\n<p>In this pair of blogposts, we discussed the problem binary classification in both the PAC and transductive models of learning. We talked about an elegant learning algorithm in the transductive learning model based on the idea of one-inclusion graphs. We then saw how the classic notion of online-to-batch conversion can be modified in a simple way in order to port strong learning guarantees from the transductive model to the PAC model in an optimal way. Though the focus of this blogpost was on binary classification, the ideas we discussed naturally extend to other settings such as multiclass classification and regression and thus these techniques can be seen as a unifying perspective on optimal PAC bounds in various models of realizable learning. For more details on this, check out the full paper [AACSZ23a] (and maybe also [AACSZ23b]).<\/p>\n\n\n\n<p>An interesting direction for future study is whether the techniques presented here could lead to proofs of optimal PAC bounds in the agnostic setting, where all known such bounds go through so-called chaining techniques. It would be interesting to develop techniques that unify approaches to obtaining optimal PAC bounds in both the realizable and agnostic cases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[Lit89] Nick Littlestone: From on-line to batch learning. In Proceedings of the Second Annual Workshop<br>on Computational Learning Theory, pages 269\u2013284. Morgan Kaufmann, 1989.<\/li>\n\n\n\n<li>[AACSZ23a] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. Optimal PAC bounds without uniform convergence, In Foundations of Computer Science (FOCS), 2023.<\/li>\n\n\n\n<li>[AACSZ23b] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. The one-inclusion graph algorithm is not always optimal. In Conference on Learning Theory (COLT), 2023.<\/li>\n<\/ul>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"1554d7be-50a0-4bdb-9951-a1191af31329\">The argument presented here is slightly simpler. In his original argument, Littlestone used a small holdout set to pick the best hypothesis in the sequence instead of taking a majority vote. <a href=\"#1554d7be-50a0-4bdb-9951-a1191af31329-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><li id=\"b6995c1e-06cf-4a95-b720-746d7d76d7ef\">The notion of complexity, analogous to the VC dimension, that captures this setting is known as the Littlestone dimension. It turns out that even for simple classes such as the set of thresholds, this complexity is unbounded. <a href=\"#b6995c1e-06cf-4a95-b720-746d7d76d7ef-link\" aria-label=\"Jump to footnote reference 2\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>We&#8217;re back with the second blog post by&nbsp;Ishaq Aden-Ali,&nbsp;Yeshwanth Cherapanamjeri,&nbsp;Abhishek Shetty, and&nbsp;Nikita Zhivotovskiy, continuing on the optimal sample complexity of PAC learning. If you missed the first post, check it out here. In the last blog post, we saw the transductive model of learning, the one-inclusion graph (OIG) algorithm and how this led to optimal [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"[{\"id\":\"1554d7be-50a0-4bdb-9951-a1191af31329\",\"content\":\"The argument presented here is slightly simpler. In his original argument, Littlestone used a small holdout set to pick the best hypothesis in the sequence instead of taking a majority vote.\"},{\"id\":\"b6995c1e-06cf-4a95-b720-746d7d76d7ef\",\"content\":\"The notion of complexity, analogous to the VC dimension, that captures this setting is known as the Littlestone dimension. It turns out that even for simple classes such as the set of thresholds, this complexity is unbounded.\"}]"},"categories":[4],"tags":[],"class_list":["post-751","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/751","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=751"}],"version-history":[{"count":4,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/751\/revisions"}],"predecessor-version":[{"id":839,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/751\/revisions\/839"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=751"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=751"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=751"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}