{"id":661,"date":"2024-06-06T17:03:57","date_gmt":"2024-06-06T17:03:57","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=661"},"modified":"2024-08-06T22:02:34","modified_gmt":"2024-08-06T22:02:34","slug":"one-inclusion-graphs-and-the-optimal-sample-complexity-of-pac-learning-part-1","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2024\/06\/06\/one-inclusion-graphs-and-the-optimal-sample-complexity-of-pac-learning-part-1\/","title":{"rendered":"One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 1"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">We&#8217;re back again with another post! In case you missed the first two in the series, on <a href=\"https:\/\/www.let-all.com\/blog\/2024\/03\/13\/calibration-for-decision-making-a-principled-approach-to-trustworthy-ml\/\">calibration<\/a> and <a href=\"https:\/\/www.let-all.com\/blog\/2024\/04\/29\/the-curious-landscape-of-multiclass-learning\/\">multi-class classification<\/a>, please check them out.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Today, we have the first in a two-parter by <a href=\"https:\/\/ishaqadenali.github.io\/\">Ishaq Aden-Ali<\/a>, <a href=\"https:\/\/yeshwanth94.github.io\/\">Yeshwanth Cherapanamjeri<\/a>, <a href=\"https:\/\/ashettyv.github.io\/\">Abhishek Shetty<\/a>, and <a href=\"https:\/\/sites.google.com\/view\/nikitazhivotovskiy\/\">Nikita Zhivotovskiy<\/a>, again on sample-efficient methods for PAC learning. In the first post, they revisit and discuss a <a href=\"https:\/\/arxiv.org\/abs\/2212.09270\">refutation<\/a> of the strong version of Warmuth&#8217;s one-inclusion graph conjecture.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Introduction<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">One of the key contributions of computer science is the formalization of the concept of learning. These definitions try to get at the heart of the question of what patterns can be recognized from examples. In this (and the next) blog post, we will discuss a probabilistic formalization of this question through the Probably Approximately Correct (PAC) model of learning, and we will try to understand how many examples are needed to learn a pattern with high probability over the data generating process, i.e. the <em>sample complexity<\/em>. Towards this goal of characterizing the optimal sample complexity, we will see connections between the PAC model and a model of learning known as transductive learning. Further, we will characterize learning in the transductive setting using an elegant combinatorial argument.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we will connect high probability PAC bounds to transductive learning using a general connection with online learning using a simple argument connecting beyond worst-case analysis of sequential decision-making can lead to sharp bounds in an a priori offline learning setting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Probably Approximately Correct (PAC) Learning<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The basic mathematical setup for learning involves a set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> and a set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> which we will refer to as the domain and the label set respectively. The domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> is the set of all possible examples and the label set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> is the set of all possible labels. An example to keep in mind is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> is the set of all images and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> is the set of all possible labels for the images (e.g. cat, dog, etc.). For simplicity, we will focus on the case where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D+%3D%5Cleft%5C%7B+0%2C1+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y} =&#92;left&#92;{ 0,1 &#92;right&#92;} \" class=\"latex\" \/>. This is referred to as binary classification. Later in the post, we will comment on how the ideas apply to more general settings such as when <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> has larger cardinality or even when <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> is a continuous set such as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cleft%5B+0%2C1+%5Cright%5D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;left[ 0,1 &#92;right] \" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The key modeling assumption is the presumption of the existence of a distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> over the domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> and that the examples are drawn independently from this distribution. This distribution models the idea that the examples are produced from some underlying data generating process which, though unknown, is fixed and does not change over time. The learner has access to labelled examples drawn from this distribution. This set of examples is referred to as the training set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since one cannot expect to learn arbitrary labelling functions, simplicity of the labelling function is usually encoded by assuming that the labelling function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star}\" class=\"latex\" \/> is chosen from a class of functions <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>, known as the hypothesis class. This setting is referred to as the realizable setting since we are assuming that the labels are <em>realized<\/em> by some function in the hypothesis class. One can consider extensions to allow for label noise or even arbitrary labels but in this post we will focus on the realizable setting.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The learner, thus, has access to a training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%7B%28x_1%2C+f%5E%7B%5Cstar%7D%28x_1%29%29%2C+%5Cldots%2C+%28x_m%2C+f%5E%7B%5Cstar%7D%28x_m%29%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = {(x_1, f^{&#92;star}(x_1)), &#92;ldots, (x_m, f^{&#92;star}(x_m))}\" class=\"latex\" \/> drawn i.i.d. from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>. The aim of the learner is to output, with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/> over the choice of the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> (i.e. probably), a hypothesis <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> <sup data-fn=\"b3657ab7-3d2d-426f-b123-d87386a1ae31\" class=\"fn\"><a href=\"#b3657ab7-3d2d-426f-b123-d87386a1ae31\" id=\"b3657ab7-3d2d-426f-b123-d87386a1ae31-link\">1<\/a><\/sup> which is approximately correct with respect to an unseen example from the same distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>, i.e.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BR%7D%28h%29+%3D+%5CPr_%7BX+%5Csim+%5Cmathcal%7BD%7D%7D%5Bh%28X%29+%5Cneq+f%5E%7B%5Cstar%7D%28X%29%5D+%5Cleq+%5Cepsilon.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{R}(h) = &#92;Pr_{X &#92;sim &#92;mathcal{D}}[h(X) &#92;neq f^{&#92;star}(X)] &#92;leq &#92;epsilon.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\"><br>We will refer to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BR%7D%28h%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{R}(h)\" class=\"latex\" \/> as the <em>risk<\/em> of the function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> (under <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> with respect to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>). This model of learning is known as Probably Approximately Correct (PAC) learning. The choice of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> controls the probability of failure while <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> controls the accuracy of the output hypothesis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Empirical Risk Minimization<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the above model, given that there is a consistent hypothesis respecting the labels, a natural algorithm would be to find a function consistent with the labels seen so far. That is, output <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7Bf%7D+%5Cin+%5Cleft%5C%7B+f+%5Cin+%5Cmathcal%7BF%7D%3A+f%28x_i%29+%3D+y_i+%5C%3B+%5Cforall+i+%5Cin+%5Bn%5D+%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{f} &#92;in &#92;left&#92;{ f &#92;in &#92;mathcal{F}: f(x_i) = y_i &#92;; &#92;forall i &#92;in [n] &#92;right&#92;}\" class=\"latex\" \/>. Note that this set is non-empty because <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star}\" class=\"latex\" \/> is in this set. Any algorithm that returns such a function is referred to as an Empirical Risk Minimization (ERM) algorithm. The main challenge in analyzing the algorithm is that this set is typically very large and the different functions in this set can make different predictions on the unseen points. Since we are choosing an arbitrary realizable function, we would like to argue any of them will typically make good predictions on unseen points.<br><br>The idea is to argue that if it is likely that a candidate function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7Bf%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{f}\" class=\"latex\" \/> disagrees with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star}\" class=\"latex\" \/> on the test example, then it is likely that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7Bf%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{f}\" class=\"latex\" \/> would have disagreed with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star}\" class=\"latex\" \/> on the training examples as well and thus the algorithm would have output a different function. This hints at the idea of exchangeability of the training examples with the test example playing an important role in the behavior of learning algorithms. Though this is indeed the case, typical analyses of ERM do not discuss exchangeability directly but rather go through independence and concentration arguments. What we will see later is that exchangeability will play a crucial role in our understanding of optimal sample complexity of PAC learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But, getting back to the ERM algorithm, we need a formal notion of complexity of the hypothesis class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> to bound the error of the ERM algorithm. This is formalized by the Vapnik-Chervonenkis (VC) dimension. Intuitively, the VC dimension controls how many &#8220;different&#8221; functions the hypothesis class can express. We will not go into the details of this notion of complexity right now, but we will return it later in the post (Definition 1 and Theorem 4) and see how it plays a crucial role in the analysis of optimal learning algorithms albeit capturing a slightly different aspect of the complexity of a function class. Now, we will state the sample complexity of the ERM algorithm in the PAC model.<\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\"><strong>Theorem<\/strong> 1 [VC Theorem, See e.g. [SB14]] Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> be a domain and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a class of functions from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/> with VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%2C+%5Cdelta+%5Cin+%280%2C1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon, &#92;delta &#92;in (0,1)\" class=\"latex\" \/>. Let<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><br><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%3D+%5COmega%5Cleft%28+%5Cfrac%7Bd+%5Clog%5Cleft%28+1%2F+%5Cepsilon+%5Cright%29+%2B+%5Clog+%5Cleft%28+1+%2F+%5Cdelta+%5Cright%29+%7D%7B%5Cepsilon%7D+%5Cright%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n = &#92;Omega&#92;left( &#92;frac{d &#92;log&#92;left( 1\/ &#92;epsilon &#92;right) + &#92;log &#92;left( 1 \/ &#92;delta &#92;right) }{&#92;epsilon} &#92;right).\" class=\"latex\" \/><br><\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\">Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%5Cleft%5C%7B+%28x_i+%2C+f%5E%7B%5Cstar%7D+%28x_i%29+%29+%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = &#92;left&#92;{ (x_i , f^{&#92;star} (x_i) ) &#92;right&#92;}\" class=\"latex\" \/> be a sample of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> drawn i.i.d. from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D+%5Cin+%5Cmathcal%7BF%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star} &#92;in &#92;mathcal{F} \" class=\"latex\" \/>. Then, with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/> over the choice of the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>, any <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> that is consistent with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> satisfies<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr_%7BX+%5Csim+D+%7D+%5Cleft%5B+f%28X%29+%5Cneq+f%5E%7B%5Cstar%7D%28X%29+%5Cright%5D+%5Cleq+%5Cepsilon.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr_{X &#92;sim D } &#92;left[ f(X) &#92;neq f^{&#92;star}(X) &#92;right] &#92;leq &#92;epsilon.\" class=\"latex\" \/><br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Optimal Sample Complexity of PAC Learning<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><br>A natural question is to ask whether the sample complexity of the ERM algorithm is tight. Using simple arguments, one can show lower bounds on the sample complexity of <em>any<\/em> learning algorithm in terms of the parameters <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> appearing in Theorem 1.<br><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Theorem<\/strong> 2 [Sample Complexity Lower Bound, See e.g. [SB14]] Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> be a domain and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a class of functions from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/> with VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%2C+%5Cdelta+%5Cin+%280%2C1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon, &#92;delta &#92;in (0,1)\" class=\"latex\" \/>. Then, any algorithm that outputs a hypothesis <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> that has error at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/> over the choice of the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>, must require a sample size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> satisfying<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%3D+%5COmega+%5Cleft%28+%5Cfrac%7Bd+%2B+%5Clog+%5Cleft%28+1+%2F+%5Cdelta+%5Cright%29+%7D%7B%5Cepsilon%7D+%5Cright%29.+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n = &#92;Omega &#92;left( &#92;frac{d + &#92;log &#92;left( 1 \/ &#92;delta &#92;right) }{&#92;epsilon} &#92;right). \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\"><br>Note that the upper and lower bounds differ by a factor of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clog%281%2F%5Cepsilon%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;log(1\/&#92;epsilon)\" class=\"latex\" \/>. Given this mismatch, the first question that one might ask is which of the bounds is tight. Unfortunately, for the ERM algorithm, the upper bound is indeed tight. But, one might wonder if there is a better algorithm than ERM. This question remained open for several decades until the breakthrough works of Simon [Sim15] and Hanneke [Han16] who showed that the lower bound is tight by presenting an algorithm that achieves this sample complexity. In the rest of the current and the following post, we will see why the sample complexity from Theorem 2 is natural and present a general technique recently discovered by [AACSZ23a] to achieve this sample complexity in the PAC model, leading up to the following theorem.<br><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Theorem 3<\/strong> [Optimal Sample Complexity of PAC Learning] Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> be a domain and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a class of functions from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/> with VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%2C+%5Cdelta+%5Cin+%280%2C1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon, &#92;delta &#92;in (0,1)\" class=\"latex\" \/>. Let<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%3D+%5COmega%5Cleft%28+%5Cfrac%7Bd+%2B+%5Clog+%5Cleft%28+1+%2F+%5Cdelta+%5Cright%29+%7D%7B%5Cepsilon%7D+%5Cright%29.+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n = &#92;Omega&#92;left( &#92;frac{d + &#92;log &#92;left( 1 \/ &#92;delta &#92;right) }{&#92;epsilon} &#92;right). \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\"><br> Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%5Cleft%5C%7B+%28x_i+%2C+f%5E%7B%5Cstar%7D+%28x_i%29+%29+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = &#92;left&#92;{ (x_i , f^{&#92;star} (x_i) ) &#92;right&#92;} \" class=\"latex\" \/> be a sample of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> drawn i.i.d. from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cstar%7D+%5Cin+%5Cmathcal%7BF%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;star} &#92;in &#92;mathcal{F} \" class=\"latex\" \/>. There is an prediction strategy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bf%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{f}\" class=\"latex\" \/> such that, with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1+-+%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1 - &#92;delta\" class=\"latex\" \/> over the choice of the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>, we have<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr_%7Bx+%5Csim+D+%7D+%5Cleft%5B+%5Chat%7Bf%7D%28x%29+%5Cneq+f%5E%7B%5Cstar%7D%28x%29+%5Cright%5D+%5Cleq+%5Cepsilon.+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr_{x &#92;sim D } &#92;left[ &#92;hat{f}(x) &#92;neq f^{&#92;star}(x) &#92;right] &#92;leq &#92;epsilon. \" class=\"latex\" \/><br><\/p>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\"><br><br>A key feature of the techniques we will see in this post is that they allow us to generalize from binary classification to more general settings, leading to a unified perspective on optimal PAC learning algorithms.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Transductive Learning and the One-Inclusion Graph Algorithm<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\"><br>In order to motivate Theorem 3, we will take a detour and discuss a learning setting known as <em>transductive learning<\/em>. We will view transductive learning as a game where a player is competing against an adversary. As before we will will fix a function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>. The game proceeds as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The adversary picks a set of examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7Bx_1%2C+%5Cdots%2C+x_%7Bn%2B1%7D+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{x_1, &#92;dots, x_{n+1} }\" class=\"latex\" \/>, a labelling function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star &#92;in &#92;mathcal{F}\" class=\"latex\" \/>, and produces the labelled set of examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%28x_1%2C+f%5E%5Cstar%28x_1%29%29+%2C+%5Cdots%2C+%28x_%7Bn%2B1%7D%2C+f%5E%5Cstar%28x_%7Bn%2B1%7D%29%29+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{(x_1, f^&#92;star(x_1)) , &#92;dots, (x_{n+1}, f^&#92;star(x_{n+1})) }\" class=\"latex\" \/><\/li>\n\n\n\n<li class=\"\">A uniformly random permutation\/shuffle is applied to these labelled examples to produce the <em>random sequence<\/em> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%28X_1%2C+f%5E%5Cstar%28X_1%29%29%2C+%5Cdots%2C+%28X_%7Bn%2B1%7D%2C+f%5E%5Cstar%28X_%7Bn%2B1%7D%29%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{(X_1, f^&#92;star(X_1)), &#92;dots, (X_{n+1}, f^&#92;star(X_{n+1}))}\" class=\"latex\" \/>. Note that the examples are exchangeable.<\/li>\n\n\n\n<li class=\"\">The player receives the first <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> <em>labelled examples<\/em> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%28%28X_1%2C+f%5E%5Cstar%28X_1%29%29%2C+%5Cdots%2C+%28X_n%2Cf%5E%5Cstar%28X_n%29%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = ((X_1, f^&#92;star(X_1)), &#92;dots, (X_n,f^&#92;star(X_n)))\" class=\"latex\" \/> as a training set as well as the last <em>unlabelled example<\/em> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{n+1}\" class=\"latex\" \/> as a test point&#8221;.<\/li>\n\n\n\n<li class=\"\">The player now needs to produce a prediction <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7BY%7D_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{Y}_{n+1}\" class=\"latex\" \/> for the unknown label <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_%7Bn%2B1%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_{n+1})\" class=\"latex\" \/> of the test point using whatever information they can glean from the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> with the goal of minimizing the <em>expected error<\/em><\/li>\n<\/ul>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr%5Cleft%5B%5Chat%7BY%7D_%7Bn%2B1%7D+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29%5Cright%5D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr&#92;left[&#92;hat{Y}_{n+1} &#92;neq f^&#92;star(X_{n+1})&#92;right].\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>Here, the probability is taken with respect to the randomness of the uniform permutation (and potentially any randomness of the learning algorithm). While this might seem quite strange at first glance, this setting focuses on the on exchangeability of the sample instead of independence as in the PAC model.<br>Furthermore, as we will see shortly, there is a natural combinatorial interpretation of this problem that leads to an elegant solution! To build some intuition, let&#8217;s consider a simple toy example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As before, we will focus on the case of binary classification <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D+%3D+%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y} = &#92;{0,1&#92;}\" class=\"latex\" \/>. Let the domain be <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D+%3D+%5C%7Ba%2C+b%2C+c+%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X} = &#92;{a, b, c &#92;}\" class=\"latex\" \/> and the function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> consist of two functions: the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbf%7B0%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbf{0}\" class=\"latex\" \/> function and the function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f_%7Ba%2Cc%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f_{a,c}\" class=\"latex\" \/> that takes on the value <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/> if and only if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x+%5Cin+%5C%7Ba%2C+c%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x &#92;in &#92;{a, c&#92;}\" class=\"latex\" \/>. Now, assume the adversary picks two examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_1+%3D+a+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_1 = a \" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_2+%3D+c&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_2 = c\" class=\"latex\" \/> and some function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star &#92;in &#92;mathcal{F}\" class=\"latex\" \/>. Notice that in this case, the player can always win! Indeed, once the player knows the the value <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_1)\" class=\"latex\" \/>, they can fully determine the value of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_2)\" class=\"latex\" \/> regardless of the realization of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1\" class=\"latex\" \/>: if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_1%29+%3D+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_1) = 1\" class=\"latex\" \/>, then we know <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar+%3D+f_%7Ba%2Cc%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star = f_{a,c}\" class=\"latex\" \/> and if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_1%29+%3D+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_1) = 0\" class=\"latex\" \/>, then we know <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar+%3D+%5Cmathbf%7B0%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star = &#92;mathbf{0}\" class=\"latex\" \/>. This example suggests how we should take advantage of the information about the target function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> (i.e. its evaluation on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+%5Cdots%2C+X_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, &#92;dots, X_n\" class=\"latex\" \/>) in the training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>: look at all possible patterns that <em>any<\/em> function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f &#92;in &#92;mathcal{F}\" class=\"latex\" \/> can produce on the examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+%5Cdots%2C+X_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, &#92;dots, X_n\" class=\"latex\" \/> and the test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{n+1}\" class=\"latex\" \/>. This set of patterns is called the <em>projection<\/em> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> onto <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7BX_1%2C+%5Cdots%2C+X_%7Bn%2B1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{X_1, &#92;dots, X_{n+1}}\" class=\"latex\" \/> and is defined as<br><\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D%7C_%7BX_1%2C+%5Cdots%2C+X_%7Bn%2B1%7D%7D+%3D+%5Cleft%5C%7B%28f%28X_1%29%2C+%5Cdots%2C+f%28X_%7Bn%2B1%7D%29%29+%7C+f+%5Cin+%5Cmathcal%7BF%7D+%5Cright%5C%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}|_{X_1, &#92;dots, X_{n+1}} = &#92;left&#92;{(f(X_1), &#92;dots, f(X_{n+1})) | f &#92;in &#92;mathcal{F} &#92;right&#92;}.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>We can now rephrase our strategy for the toy example we considered above: once we compute the projection, we find all the patterns in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D%7C_%7BX_1%2C+X_2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}|_{X_1, X_2}\" class=\"latex\" \/> that are consistent with the label <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_1)\" class=\"latex\" \/> and notice that there can only be one such pattern in this case! This turns out to be a very simple case, but it already hints at how the &#8220;complexity&#8221; (or lack thereof) of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> affects the game. More generally, there could be (at most) two different length <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n+1\" class=\"latex\" \/> strings in the projection that agree with the labels in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> on the first <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> examples. In other words, the values of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_1%29%2C+%5Cdots%2C+f%5E%5Cstar%28X_n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_1), &#92;dots, f^&#92;star(X_n)\" class=\"latex\" \/> might not preclude the possibility that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar%28X_%7Bn%2B1%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star(X_{n+1})\" class=\"latex\" \/> could either be <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0\" class=\"latex\" \/> or <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/>. This is unfortunate, but the show must go on! We must commit to <em>some<\/em> choice for our prediction <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7BY%7D_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{Y}_{n+1}\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Lets take a step back and rephrase what we have learned from the above (we&#8217;ll see why this rephrasing is a good idea very soon). Given the examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+%5Cdots%2C+X_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, &#92;dots, X_n\" class=\"latex\" \/> and test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{n+1}\" class=\"latex\" \/>, we can build a graph from the projection <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D%7C_%7BX_1%2C+%5Cdots%2C+X_%7Bn%2B1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}|_{X_1, &#92;dots, X_{n+1}}\" class=\"latex\" \/> as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Each string <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f+%5Cin+%5Cmathcal%7BF%7D%7C_%7BX_1%2C+%5Cdots%2C+X_%7Bn%2B1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f &#92;in &#92;mathcal{F}|_{X_1, &#92;dots, X_{n+1}}\" class=\"latex\" \/> is a vertex, i.e. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V+%3D+%5Cmathcal%7BF%7D%7C_%7BX_1%2C+%5Cdots%2C+X_%7Bn%2B1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V = &#92;mathcal{F}|_{X_1, &#92;dots, X_{n+1}}\" class=\"latex\" \/>.<\/li>\n\n\n\n<li class=\"\">We connect an edge between two vertices <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%2C+g+%5Cin+V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f, g &#92;in V\" class=\"latex\" \/> if and only if they disagree on exactly 1 bit (Hamming distance 1).<br><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><br>This graph is called the <em>one-inclusion graph<\/em> (OIG). Notice that for any fixed set of points <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7Bx_1%2C+%5Cdots%2C+x_n+%2C+x_%7Bn%2B1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{x_1, &#92;dots, x_n , x_{n+1}}\" class=\"latex\" \/>, the player will always consider the <em>same<\/em> one-inclusion graph up to permutations in the patterns\/strings that correspond to vertices. That is different orders of the training set give rise to isomorphic graphs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We already noticed that if there is only a single vertex <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> that is consistent with the labels in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>, this vertex must be the labelling determined by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> and we are done! Otherwise, there are two vertices <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%2C+g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f, g\" class=\"latex\" \/> in the graph that are consistent with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> (one of which corresponds to the labels <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> assigns to the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n+1\" class=\"latex\" \/> points) and we must pick between them. We can view this deterministic decision as an <em>orientation<\/em> of the edge between <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g\" class=\"latex\" \/> in this graph. If the edge points to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> and the last bit of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/> (i.e. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28X_%7Bn%2B1%7D%29+%3D+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(X_{n+1}) = 1\" class=\"latex\" \/>), we predict <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/>. Otherwise, we predict <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0\" class=\"latex\" \/>. Thus, any reasonable <sup data-fn=\"1e95e62d-99cc-4e8f-8e20-9b13d818fd54\" class=\"fn\"><a href=\"#1e95e62d-99cc-4e8f-8e20-9b13d818fd54\" id=\"1e95e62d-99cc-4e8f-8e20-9b13d818fd54-link\">2<\/a><\/sup> deterministic strategy (i.e. mappings from a training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%7B%28X_1%2C+f%5E%5Cstar%28X_1%29%7D%2C+%5Cdots%2C+%28X_n%2Cf%5E%5Cstar%28X_n%29%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S = {(X_1, f^&#92;star(X_1)}, &#92;dots, (X_n,f^&#92;star(X_n)))\" class=\"latex\" \/> and test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{n+1}\" class=\"latex\" \/> to a label in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/>) for the player can be viewed as an orientation of the relevant one-inclusion graph.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How should the player orient the graph so as to minimize their error? Since the error is measured with respect to a uniformly random permutation, and since the player&#8217;s strategy does not depend on the order of the dataset, its not too hard to see that the error they suffer is<br><\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr%5Cleft%5B%5Chat%7BY%7D_%7Bn%2B1%7D+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29%5Cright%5D+%3D+%5Cfrac%7B1%7D%7Bn%2B1%7D+%5Csum_%7Bi%3D1%7D%5E%7Bn%2B1%7D+%5Cmathbf%7B1%7D%5Cleft%5C%7B%5Chat%7By%7D_i+%5Cneq+x_i%5Cright%5C%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr&#92;left[&#92;hat{Y}_{n+1} &#92;neq f^&#92;star(X_{n+1})&#92;right] = &#92;frac{1}{n+1} &#92;sum_{i=1}^{n+1} &#92;mathbf{1}&#92;left&#92;{&#92;hat{y}_i &#92;neq x_i&#92;right&#92;}.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In words, this is the fraction of times that the player makes a mistake on a uniformly random choice of test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_i\" class=\"latex\" \/>. But this is equivalent to the fraction of times that we orient the graph away from the vertex <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>, i.e. the out-degree of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>! So we can further write<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr%5Cleft%5B%5Chat%7BY%7D_%7Bn%2B1%7D+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29%5Cright%5D+%3D+%5Cfrac%7B%5Cmathrm%7Bout-degree%7D%28f%5E%5Cstar%29%7D%7Bn%2B1%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr&#92;left[&#92;hat{Y}_{n+1} &#92;neq f^&#92;star(X_{n+1})&#92;right] = &#92;frac{&#92;mathrm{out-degree}(f^&#92;star)}{n+1}.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The player thus wants to minimize the out-degree of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>. Of course, they don&#8217;t know what <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> is. The best thing they can do is minimize the maximum out-degree of the graph. This is precisely the one-inclusion graph strategy proposed by Haussler, Littlestone, and Warmuth [HLW94].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is now one final question that remains: how small can the maximum out-degree of the one-inclusion graph be? It is a priori possible for some vertices in the graph to have degree <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>, so one might be tempted to think that in the worst case the maximum outdegree can be as large as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5COmega%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Omega(n)\" class=\"latex\" \/>. In order to bound the maximum out-degree of the one-inclusion graph, we need to capture the simplicity of graph as function of the function class. To do this, we first recall the notion of the VC dimension of a function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>, which we briefly saw in the statement of Theorem 1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition 1 [VC Dimension]<\/strong> Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> be a domain and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a class of functions from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1&#92;}\" class=\"latex\" \/>. The VC dimension of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> is the largest integer <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> such that there exists a set of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> points <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_1%2C+%5Cdots%2C+x_d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_1, &#92;dots, x_d\" class=\"latex\" \/> such that for any labelling of these points i.e. choice <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_i+%5Cin+%5Cleft%5C%7B+0%2C1+%5Cright%5C%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_i &#92;in &#92;left&#92;{ 0,1 &#92;right&#92;} \" class=\"latex\" \/> for each <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_i\" class=\"latex\" \/>, there exists a function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f+%5Cin+%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f &#92;in &#92;mathcal{F}\" class=\"latex\" \/> that realizes this labelling i.e. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x_i%29+%3D+y_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x_i) = y_i\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Typically, the VC dimension is used in learning theory (for example in the proof of Theorem 1) to argue that the number of functions that a class can express on any dataset is small (through the celebrated Sauer-Shelah lemma). In the language of one-inclusion graphs, this amounts to arguing that the number of <em>vertices<\/em> of the one-inclusion graph is small. But as we noted above, the error in the transductive setting (and as we will see in the PAC setting as well) is bounded by out-degree of the best orientation of the one-inclusion graph. This turns out to be related to the <em>edge density<\/em> of the one-inclusion graph (this general fact can be argued using the Hall marriage theorem and was first presented by Alon and Tarsi [AT92]). Surprisingly, Haussler, Littlestone, and Warmuth [HLW94] proved a beautiful combinatorial result that states that the edge density (and thus the maximum out-degree) can always be made to be smaller than <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>, the VC dimension of the class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Theorem <\/strong>4 [Orientations of One-Inclusion Graphs] Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> be a hypothesis class with VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7Bx_1%2C+%5Cdots+%2C+x_n%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{x_1, &#92;dots , x_n}\" class=\"latex\" \/> be a set of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> points from the domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> and let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G\" class=\"latex\" \/> be the induced one-inclusion graph of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/> on this set. Then, there exists an orientation of the edges of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G\" class=\"latex\" \/> such that the maximum out-degree of any vertex is at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This result amounts to saying that one-inclusion graphs are very sparse. This can be seen as a version of the isoperimetry theorem on the Boolean hypercube and provides a complementary view on the VC dimension compared to the Sauer-Shelah lemma.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Though, in this post we are not focussed on the computational efficiency of our algorithms, we remark that the minimizing orientation can in fact be found efficiently (using flow algorithms) in the size of the one-inclusion graph but this graph can have size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%5E%7BO%28d%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n^{O(d)}\" class=\"latex\" \/> (this again follows from the Sauer-Shelah lemma).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Going back to our game, we can combine the above and conclude that the player has a final error bound<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr%5Cleft%5B%5Chat%7BY%7D_%7Bn%2B1%7D+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29%5Cright%5D+%5Cleq+%5Cfrac%7Bd%7D%7Bn%2B1%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr&#92;left[&#92;hat{Y}_{n+1} &#92;neq f^&#92;star(X_{n+1})&#92;right] &#92;leq &#92;frac{d}{n+1}.\" class=\"latex\" \/><br><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>This result tells us that the VC dimension is a quantitative measure of complexity of the class that guarantees how well the player can perform. In fact, it is the quantitative measure of complexity for this game given the matching lower bounds of [LLS01] (this is the transductive analogue of Theorem 2).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We reiterate that a key aspect to note about the one-inclusion graph itself is that it does not depends on the order of the training examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_1%2C+%5Cdots+%2C+x_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_1, &#92;dots , x_n\" class=\"latex\" \/> since the changing the order leads to an isomorphic graph. Since the minimum over orientations of maximum out-degree is a <em>graph property<\/em> i.e. does not depend on the labelling of the vertices, one can hope to implement the one-inclusion graph algorithm in a permutation invariant way.For example, one can a priori order the domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> in some arbitrary way and always consider the training examples in that order. This fact that there is a prediction strategy that is permutation invariant and has the above error bound will be crucial later in the analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The above argument was specialized to binary function classes, however it naturally extends to many other settings e.g. multiclass classification where instead of a graph, we have a hypergraph [RBR09, DS14]. In all these extensions, the one-inclusion strategy still works: we can minimize the maximum out-degree of the relevant one-inclusion (hyper)graph to achieve an upper bound on the error that scales with the maximum out-degree. Unfortunately, in more general settings it is not always clear whether the maximum out-degree can be made independent of the sample size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> like in the binary case above, but the one-inclusion graph strategy still provides a unifying perspective on in-expectation error of learning algorithms.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">From Transductive learning to expected risk bounds<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><br>While the transductive model is interesting in its own right, our original goal was to understand how well we could do in the PAC model. The OIG algorithm only provides a single prediction for a test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{n+1}\" class=\"latex\" \/>, but this implicitly defines a function over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/>: whenever we want to evaluate the label of a test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>, we can build and orient the relevant OIG that can constructed from the dataset <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> and test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>.Moreover, we get a bound on the error of this function for an <em>average<\/em> training sample <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> (rather than with high probability) by applying the analysis above! To see this, let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;widehat{f}_{&#92;mathrm{OIG}}\" class=\"latex\" \/> be the function implicitly defined by the OIG algorithm. Because the examples <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+%5Cdots%2C+X_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, &#92;dots, X_n\" class=\"latex\" \/> that we received in our training sample and the test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X+%3D+X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X = X_{n+1}\" class=\"latex\" \/> are i.i.d., <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_1%2C+%5Cdots%2C+X_%7Bn%7D%2CX_%7Bn%2B1%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_1, &#92;dots, X_{n},X_{n+1})\" class=\"latex\" \/> has the same distribution as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X_%7B%5Cpi%281%29%7D%2C+%5Cdots%2C+X_%7B%5Cpi%28n%29%7D%2C+X_%7B%5Cpi%28n%2B1%29%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X_{&#92;pi(1)}, &#92;dots, X_{&#92;pi(n)}, X_{&#92;pi(n+1)})\" class=\"latex\" \/> for any permutation <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi%3A+%5Bn%2B1%5D+%5Cto+%5Bn%2B1%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi: [n+1] &#92;to [n+1]\" class=\"latex\" \/>, so we can imagine the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_i\" class=\"latex\" \/> as being permuted by a uniformly random permutation <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;pi\" class=\"latex\" \/>. So, we can bound the error just like in the transductive setting:<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D_%7BS%7D+%5Cleft%5B+%5CPr_%7BX+%5Csim+P%7D+%5Cleft%5B%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D%28X%29+%5Cneq+f%5E%5Cstar%28X%29+%5Cright%5D%5Cright%5D+%3D+%5Cmathbb%7BE%7D_%7BX_1%2C+%5Cdots%2C+X_n%7D+%5Cleft%5B%5Cmathbb%7BE%7D_%7BX_%7Bn%2B1%7D%7D%5Cleft%5B%5Cmathbf%7B1%7D%5Cleft%5C%7B%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D%28X_%7Bn%2B1%7D%29+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29+%5Cright%5C%7D%5Cright%5D%5Cright%5D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}_{S} &#92;left[ &#92;Pr_{X &#92;sim P} &#92;left[&#92;widehat{f}_{&#92;mathrm{OIG}}(X) &#92;neq f^&#92;star(X) &#92;right]&#92;right] = &#92;mathbb{E}_{X_1, &#92;dots, X_n} &#92;left[&#92;mathbb{E}_{X_{n+1}}&#92;left[&#92;mathbf{1}&#92;left&#92;{&#92;widehat{f}_{&#92;mathrm{OIG}}(X_{n+1}) &#92;neq f^&#92;star(X_{n+1}) &#92;right&#92;}&#92;right]&#92;right] \" class=\"latex\" \/><br><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%3D+%5Cmathbb%7BE%7D_%7B%5Cpi%7D%5Cleft%5B%5Cmathbb%7BE%7D_%7BX_%7B%5Cpi%281%29%7D%2C+%5Cdots%2C+X_%7B%5Cpi%28n%29%7D%7D+%5Cleft%5B%5Cmathbb%7BE%7D_%7BX_%7B%5Cpi%28n%2B1%29%7D%7D%5Cleft%5B%5Cmathbf%7B1%7D%5Cleft%5C%7B%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D%28X_%7Bn%2B1%7D%29+%5Cneq+f%5E%5Cstar%28X_%7Bn%2B1%7D%29+%5Cright%5C%7D%5Cright%5D%5Cright%5D%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"= &#92;mathbb{E}_{&#92;pi}&#92;left[&#92;mathbb{E}_{X_{&#92;pi(1)}, &#92;dots, X_{&#92;pi(n)}} &#92;left[&#92;mathbb{E}_{X_{&#92;pi(n+1)}}&#92;left[&#92;mathbf{1}&#92;left&#92;{&#92;widehat{f}_{&#92;mathrm{OIG}}(X_{n+1}) &#92;neq f^&#92;star(X_{n+1}) &#92;right&#92;}&#92;right]&#92;right]&#92;right]\" class=\"latex\" \/><br><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cle+%5Cfrac%7Bd%7D%7Bn%2B1%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;le &#92;frac{d}{n+1}.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>To get the final inequality above, we condition on the <em>values<\/em> of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+%5Cdots%2C+X_n%2C+X_%7Bn%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, &#92;dots, X_n, X_{n+1}\" class=\"latex\" \/>, and then take the expectation over the uniform random permutation. Thus, the OIG algorithm gave us a learning strategy (that is very different from ERM) that gets sharp in-expectation bound in a setting similiar to the PAC setting.It is thus a natural question to ask whether the OIG algorithm produces a bona fide PAC learning algorithm with sharp error guarantees. In fact, Manfred Warmuth conjectured that the OIG algorithm achieves a strong PAC bound that matches the general lower bound on the error <em>any<\/em> algorithm must incur.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Conjecture 1<\/strong> [War04] The one-inclusion graph algorithm of [HLW94] always achieves the lower bound. That is, after receiving<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5COmega%5Cleft%28%5Cfrac%7Bd%2B+%5Clog%5Cleft%28%5Cfrac%7B1%7D%7B%5Cdelta%7D%5Cright%29%7D%7B%5Cepsilon%7D+%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Omega&#92;left(&#92;frac{d+ &#92;log&#92;left(&#92;frac{1}{&#92;delta}&#92;right)}{&#92;epsilon} &#92;right)\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">examples, its error is at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1-%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1-&#92;delta\" class=\"latex\" \/>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Some bad news<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><br>Unfortunately, Warmuth&#8217;s conjecture is false in the strictest sense as was shown in [AACS23b].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Theorem <\/strong>5 [Informal] For any integer <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d+%3E+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d &gt; 1\" class=\"latex\" \/>, sample size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> and confidence parameter <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta+%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta &gt; 0\" class=\"latex\" \/>, there is a hypothesis class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D_d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}_d\" class=\"latex\" \/> that has VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>, a distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P\" class=\"latex\" \/>, a target function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> and a valid instantiation of the one-inclusion graph algorithm such that, when it receives an i.i.d. sample from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P\" class=\"latex\" \/> of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> (labelled by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>), with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> the one-inclusion graph algorithm <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;widehat{f}_{&#92;mathrm{OIG}}\" class=\"latex\" \/> incurs error<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BR%7D%28++%5Cwidehat%7Bf%7D_%7B%5Cmathrm%7BOIG%7D%7D++%29+%5Cgeq+c+%5Ccdot+%5Cfrac%7Bd%7D%7B%5Cdelta+n%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{R}(  &#92;widehat{f}_{&#92;mathrm{OIG}}  ) &#92;geq c &#92;cdot &#92;frac{d}{&#92;delta n} \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">for some universal constant <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=c+%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"c &gt; 0\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>By<\/code> valid, we mean the algorithm will <em>always<\/em> predict with a one-inclusion graph that has out-degree at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. In particular, the above theorem says that it is not enough to just minimize the out-degree of the one-inclusion graph when predicting, which is in sharp contrast with the bounds we can get on average!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that this hints at a fundamental difference between the PAC and the transductive setting. In the PAC setting, we need to worry about the behavior of the algorithm across different datasets while the transductive (and also the expectation version of the PAC setting) only requires us to worry about the behavior of the algorithm on (random permutations of) a single dataset. Intuitively, the lower bound is proven by coordinating&#8221; the mistakes of the one-inclusion graphs that the algorithm predicts with. Recall that the one-inclusion graph algorithm makes a mistake on a test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/> iff the relevant edge in the graph is oriented away from the vertex that corresponds to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/>. We show that there is a &#8220;hard&#8221; distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P\" class=\"latex\" \/> such that with probability at least <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;delta\" class=\"latex\" \/> we get a &#8220;bad&#8221; training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S_%7B%5Ctext%7BBAD%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S_{&#92;text{BAD}}\" class=\"latex\" \/> such that (roughly) a <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d%2F%28%5Cdelta+n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\/(&#92;delta n)\" class=\"latex\" \/> fraction of the possible unseen test points <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x+%5Cin+%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x &#92;in &#92;mathcal{X}\" class=\"latex\" \/> will have a one-inclusion graph that points <em>away<\/em> from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^&#92;star\" class=\"latex\" \/> on the relevant edge. Though the high level idea is simple, it is instructive to note an important subtlety. Recall that if we swap the test point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/> with any training point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%27+%5Cin+S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x&#039; &#92;in S\" class=\"latex\" \/> in the training set, we still have the same one-inclusion graph. Moreover, we <em>must<\/em> use the same orientation for this graph when predicting on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x&#039;\" class=\"latex\" \/> in order for our in-expectation guarantee to hold. Thus, flipping edges in an OIG will affect the maximum out-degree for a one-inclusion graph that can be used by the algorithm when it receives a different training sample <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S&#039;\" class=\"latex\" \/> that differs from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> in a single entry. So we must be careful to coordinate these errors to ensure no one-inclusion graph ends up with maximum out-degree larger than <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> after we are done flipping edges.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">In the next episode of \u2026<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately, we will have to end this blog post on this pessimistic note. But, stay tuned for the next blog post where we will see how we can combine techniques from online learning with the exchangeability that were key to the analysis of the OIG to design simple prediction strategies that achieve optimal high probability PAC bounds.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">References<\/h1>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">[AACSZ23a] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. Optimal PAC bounds without uniform convergence, In Foundations of Computer Science (FOCS), 2023.<\/li>\n\n\n\n<li class=\"\">[AACSZ23b] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. The one-inclusion graph algorithm is not always optimal. In Conference on Learning Theory (COLT), 2023.<\/li>\n\n\n\n<li class=\"\">[AT92] Noga Alon and Michael Tarsi. Colorings and orientations of graphs. Combinatorica, 12(2):125\u2013134, 1992.<\/li>\n\n\n\n<li class=\"\">[DS14] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory (COLT), 2014.<\/li>\n\n\n\n<li class=\"\">[Han16] Steve Hanneke. The optimal sample complexity of PAC learning. The Journal of Machine Learning Research, 17(1):1319\u20131333, 2016.<\/li>\n\n\n\n<li class=\"\">[HLW94] David Haussler, Nick Littlestone, and Manfred Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115(2):248\u2013292, 1994.<\/li>\n\n\n\n<li class=\"\">[Lit89] Nick Littlestone. From on-line to batch learning. In Proceedings of the Second Annual Workshop on Computational Learning Theory, 1989.<\/li>\n\n\n\n<li class=\"\">[LLS01] Yi Li, Philip Long, and Aravind Srinivasan. The one-inclusion graph algorithm is near-optimal for the prediction model of learning. IEEE Transactions on Information Theory, 47(3):1257\u20131261, 2001.<\/li>\n\n\n\n<li class=\"\">[RBR09] Benjamin Rubinstein, Peter Bartlett, and Hyam Rubinstein. Shifting: One-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):37\u201359, 2009.<\/li>\n\n\n\n<li class=\"\">[SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.<\/li>\n\n\n\n<li class=\"\">[Sim15] Hans Simon. An almost optimal PAC algorithm. In Conference on Learning Theory (COLT), 2015.<\/li>\n\n\n\n<li class=\"\">[War04] Manfred Warmuth. The optimal PAC algorithm. In International Conference on Computational Learning Theory (COLT), 2004.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Footnotes<\/h1>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"b3657ab7-3d2d-426f-b123-d87386a1ae31\">As stated the learner is not required to output a hypothesis in the class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>. Adding such a requirement would lead to a setting called <em>proper<\/em> learning but in this post we will allow our learners to be <em>improper<\/em> without further comment. <a href=\"#b3657ab7-3d2d-426f-b123-d87386a1ae31-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><li id=\"1e95e62d-99cc-4e8f-8e20-9b13d818fd54\">By reasonable, we mean the algorithm will never output a bit that is not possible, i.e. in the case that there is a single vertex to choose between it always picks the correct bit. <a href=\"#1e95e62d-99cc-4e8f-8e20-9b13d818fd54-link\" aria-label=\"Jump to footnote reference 2\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>We&#8217;re back again with another post! In case you missed the first two in the series, on calibration and multi-class classification, please check them out. Today, we have the first in a two-parter by Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy, again on sample-efficient methods for PAC learning. In the first post, they [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"[{\"id\":\"b3657ab7-3d2d-426f-b123-d87386a1ae31\",\"content\":\"As stated the learner is not required to output a hypothesis in the class $latex \\\\mathcal{F}$. Adding such a requirement would lead to a setting called <em>proper<\\\/em> learning but in this post we will allow our learners to be <em>improper<\\\/em> without further comment.\"},{\"id\":\"1e95e62d-99cc-4e8f-8e20-9b13d818fd54\",\"content\":\"By reasonable, we mean the algorithm will never output a bit that is not possible, i.e. in the case that there is a single vertex to choose between it always picks the correct bit.\"}]","jetpack_post_was_ever_published":false},"categories":[4],"tags":[],"class_list":["post-661","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/661","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=661"}],"version-history":[{"count":5,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/661\/revisions"}],"predecessor-version":[{"id":812,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/661\/revisions\/812"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=661"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=661"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=661"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}