One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 1

Posted On June 6, 2024

We’re back again with another post! In case you missed the first two in the series, on calibration and multi-class classification, please check them out.

Today, we have the first in a two-parter by Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy, again on sample-efficient methods for PAC learning. In the first post, they revisit and discuss a refutation of the strong version of Warmuth’s one-inclusion graph conjecture.

Introduction

One of the key contributions of computer science is the formalization of the concept of learning. These definitions try to get at the heart of the question of what patterns can be recognized from examples. In this (and the next) blog post, we will discuss a probabilistic formalization of this question through the Probably Approximately Correct (PAC) model of learning, and we will try to understand how many examples are needed to learn a pattern with high probability over the data generating process, i.e. the sample complexity. Towards this goal of characterizing the optimal sample complexity, we will see connections between the PAC model and a model of learning known as transductive learning. Further, we will characterize learning in the transductive setting using an elegant combinatorial argument.

Finally, we will connect high probability PAC bounds to transductive learning using a general connection with online learning using a simple argument connecting beyond worst-case analysis of sequential decision-making can lead to sharp bounds in an a priori offline learning setting.

Probably Approximately Correct (PAC) Learning

The basic mathematical setup for learning involves a set $\mathcal{X}$ and a set $\mathcal{Y}$ which we will refer to as the domain and the label set respectively. The domain $\mathcal{X}$ is the set of all possible examples and the label set $\mathcal{Y}$ is the set of all possible labels. An example to keep in mind is $\mathcal{X}$ is the set of all images and $\mathcal{Y}$ is the set of all possible labels for the images (e.g. cat, dog, etc.). For simplicity, we will focus on the case where $\mathcal{Y} =\left\{ 0,1 \right\}$ . This is referred to as binary classification. Later in the post, we will comment on how the ideas apply to more general settings such as when $\mathcal{Y}$ has larger cardinality or even when $\mathcal{Y}$ is a continuous set such as $\left[ 0,1 \right]$ .

The key modeling assumption is the presumption of the existence of a distribution $\mathcal{D}$ over the domain $\mathcal{X}$ and that the examples are drawn independently from this distribution. This distribution models the idea that the examples are produced from some underlying data generating process which, though unknown, is fixed and does not change over time. The learner has access to labelled examples drawn from this distribution. This set of examples is referred to as the training set.

Since one cannot expect to learn arbitrary labelling functions, simplicity of the labelling function is usually encoded by assuming that the labelling function $f^{\star}$ is chosen from a class of functions $\mathcal{F}$ , known as the hypothesis class. This setting is referred to as the realizable setting since we are assuming that the labels are realized by some function in the hypothesis class. One can consider extensions to allow for label noise or even arbitrary labels but in this post we will focus on the realizable setting.

The learner, thus, has access to a training set $S = {(x_1, f^{\star}(x_1)), \ldots, (x_m, f^{\star}(x_m))}$ drawn i.i.d. from $\mathcal{D}$ . The aim of the learner is to output, with probability at least $1 - \delta$ over the choice of the training set $S$ (i.e. probably), a hypothesis $h$ ¹ which is approximately correct with respect to an unseen example from the same distribution $\mathcal{D}$ , i.e.

$\mathcal{R}(h) = \Pr_{X \sim \mathcal{D}}[h(X) \neq f^{\star}(X)] \leq \epsilon.$

We will refer to $\mathcal{R}(h)$ as the risk of the function $h$ (under $\mathcal{D}$ with respect to $f^\star$ ). This model of learning is known as Probably Approximately Correct (PAC) learning. The choice of $\delta$ controls the probability of failure while $\epsilon$ controls the accuracy of the output hypothesis.

Empirical Risk Minimization

In the above model, given that there is a consistent hypothesis respecting the labels, a natural algorithm would be to find a function consistent with the labels seen so far. That is, output $\tilde{f} \in \left\{ f \in \mathcal{F}: f(x_i) = y_i \; \forall i \in [n] \right\}$ . Note that this set is non-empty because $f^{\star}$ is in this set. Any algorithm that returns such a function is referred to as an Empirical Risk Minimization (ERM) algorithm. The main challenge in analyzing the algorithm is that this set is typically very large and the different functions in this set can make different predictions on the unseen points. Since we are choosing an arbitrary realizable function, we would like to argue any of them will typically make good predictions on unseen points.

The idea is to argue that if it is likely that a candidate function $\tilde{f}$ disagrees with $f^{\star}$ on the test example, then it is likely that $\tilde{f}$ would have disagreed with $f^{\star}$ on the training examples as well and thus the algorithm would have output a different function. This hints at the idea of exchangeability of the training examples with the test example playing an important role in the behavior of learning algorithms. Though this is indeed the case, typical analyses of ERM do not discuss exchangeability directly but rather go through independence and concentration arguments. What we will see later is that exchangeability will play a crucial role in our understanding of optimal sample complexity of PAC learning.

But, getting back to the ERM algorithm, we need a formal notion of complexity of the hypothesis class $\mathcal{F}$ to bound the error of the ERM algorithm. This is formalized by the Vapnik-Chervonenkis (VC) dimension. Intuitively, the VC dimension controls how many “different” functions the hypothesis class can express. We will not go into the details of this notion of complexity right now, but we will return it later in the post (Definition 1 and Theorem 4) and see how it plays a crucial role in the analysis of optimal learning algorithms albeit capturing a slightly different aspect of the complexity of a function class. Now, we will state the sample complexity of the ERM algorithm in the PAC model.

Theorem 1 [VC Theorem, See e.g. [SB14]] Let $\mathcal{X}$ be a domain and $\mathcal{F}$ be a class of functions from $\mathcal{X}$ to $\{0,1\}$ with VC dimension $d$ . Let $\epsilon, \delta \in (0,1)$ . Let

$n = \Omega\left( \frac{d \log\left( 1/ \epsilon \right) + \log \left( 1 / \delta \right) }{\epsilon} \right).$

Let $S = \left\{ (x_i , f^{\star} (x_i) ) \right\}$ be a sample of size $n$ drawn i.i.d. from $\mathcal{D}$ and $f^{\star} \in \mathcal{F}$ . Then, with probability at least $1 - \delta$ over the choice of the training set $S$ , any $f$ that is consistent with $S$ satisfies

$\Pr_{X \sim D } \left[ f(X) \neq f^{\star}(X) \right] \leq \epsilon.$

Optimal Sample Complexity of PAC Learning

A natural question is to ask whether the sample complexity of the ERM algorithm is tight. Using simple arguments, one can show lower bounds on the sample complexity of any learning algorithm in terms of the parameters $d$ , $\epsilon$ and $\delta$ appearing in Theorem 1.

Theorem 2 [Sample Complexity Lower Bound, See e.g. [SB14]] Let $\mathcal{X}$ be a domain and $\mathcal{F}$ be a class of functions from $\mathcal{X}$ to $\{0,1\}$ with VC dimension $d$ . Let $\epsilon, \delta \in (0,1)$ . Then, any algorithm that outputs a hypothesis $h$ that has error at most $\epsilon$ with probability at least $1 - \delta$ over the choice of the training set $S$ , must require a sample size $n$ satisfying

$n = \Omega \left( \frac{d + \log \left( 1 / \delta \right) }{\epsilon} \right).$

Note that the upper and lower bounds differ by a factor of $\log(1/\epsilon)$ . Given this mismatch, the first question that one might ask is which of the bounds is tight. Unfortunately, for the ERM algorithm, the upper bound is indeed tight. But, one might wonder if there is a better algorithm than ERM. This question remained open for several decades until the breakthrough works of Simon [Sim15] and Hanneke [Han16] who showed that the lower bound is tight by presenting an algorithm that achieves this sample complexity. In the rest of the current and the following post, we will see why the sample complexity from Theorem 2 is natural and present a general technique recently discovered by [AACSZ23a] to achieve this sample complexity in the PAC model, leading up to the following theorem.

Theorem 3 [Optimal Sample Complexity of PAC Learning] Let $\mathcal{X}$ be a domain and $\mathcal{F}$ be a class of functions from $\mathcal{X}$ to $\{0,1\}$ with VC dimension $d$ . Let $\epsilon, \delta \in (0,1)$ . Let

$n = \Omega\left( \frac{d + \log \left( 1 / \delta \right) }{\epsilon} \right).$

Let $S = \left\{ (x_i , f^{\star} (x_i) ) \right\}$ be a sample of size $n$ drawn i.i.d. from $\mathcal{D}$ and $f^{\star} \in \mathcal{F}$ . There is an prediction strategy $\hat{f}$ such that, with probability at least $1 - \delta$ over the choice of the training set $S$ , we have

$\Pr_{x \sim D } \left[ \hat{f}(x) \neq f^{\star}(x) \right] \leq \epsilon.$

A key feature of the techniques we will see in this post is that they allow us to generalize from binary classification to more general settings, leading to a unified perspective on optimal PAC learning algorithms.

Transductive Learning and the One-Inclusion Graph Algorithm

In order to motivate Theorem 3, we will take a detour and discuss a learning setting known as transductive learning. We will view transductive learning as a game where a player is competing against an adversary. As before we will will fix a function class $\mathcal{F}$ . The game proceeds as follows:

The adversary picks a set of examples ${x_1, \dots, x_{n+1} }$ , a labelling function $f^\star \in \mathcal{F}$ , and produces the labelled set of examples ${(x_1, f^\star(x_1)) , \dots, (x_{n+1}, f^\star(x_{n+1})) }$
A uniformly random permutation/shuffle is applied to these labelled examples to produce the random sequence ${(X_1, f^\star(X_1)), \dots, (X_{n+1}, f^\star(X_{n+1}))}$ . Note that the examples are exchangeable.
The player receives the first $n$ labelled examples $S = ((X_1, f^\star(X_1)), \dots, (X_n,f^\star(X_n)))$ as a training set as well as the last unlabelled example $X_{n+1}$ as a test point”.
The player now needs to produce a prediction $\hat{Y}_{n+1}$ for the unknown label $f^\star(X_{n+1})$ of the test point using whatever information they can glean from the training set $S$ with the goal of minimizing the expected error

$\Pr\left[\hat{Y}_{n+1} \neq f^\star(X_{n+1})\right].$

Here, the probability is taken with respect to the randomness of the uniform permutation (and potentially any randomness of the learning algorithm). While this might seem quite strange at first glance, this setting focuses on the on exchangeability of the sample instead of independence as in the PAC model.
Furthermore, as we will see shortly, there is a natural combinatorial interpretation of this problem that leads to an elegant solution! To build some intuition, let’s consider a simple toy example.

As before, we will focus on the case of binary classification $\mathcal{Y} = \{0,1\}$ . Let the domain be $\mathcal{X} = \{a, b, c \}$ and the function class $\mathcal{F}$ consist of two functions: the $\mathbf{0}$ function and the function $f_{a,c}$ that takes on the value $1$ if and only if $x \in \{a, c\}$ . Now, assume the adversary picks two examples $x_1 = a$ and $x_2 = c$ and some function $f^\star \in \mathcal{F}$ . Notice that in this case, the player can always win! Indeed, once the player knows the the value $f^\star(X_1)$ , they can fully determine the value of $f^\star(X_2)$ regardless of the realization of $X_1$ : if $f^\star(X_1) = 1$ , then we know $f^\star = f_{a,c}$ and if $f^\star(X_1) = 0$ , then we know $f^\star = \mathbf{0}$ . This example suggests how we should take advantage of the information about the target function $f^\star$ (i.e. its evaluation on $X_1, \dots, X_n$ ) in the training set $S$ : look at all possible patterns that any function $f \in \mathcal{F}$ can produce on the examples $X_1, \dots, X_n$ and the test point $X_{n+1}$ . This set of patterns is called the projection of $\mathcal{F}$ onto ${X_1, \dots, X_{n+1}}$ and is defined as

$\mathcal{F}|_{X_1, \dots, X_{n+1}} = \left\{(f(X_1), \dots, f(X_{n+1})) | f \in \mathcal{F} \right\}.$

We can now rephrase our strategy for the toy example we considered above: once we compute the projection, we find all the patterns in $\mathcal{F}|_{X_1, X_2}$ that are consistent with the label $f^\star(X_1)$ and notice that there can only be one such pattern in this case! This turns out to be a very simple case, but it already hints at how the “complexity” (or lack thereof) of $\mathcal{F}$ affects the game. More generally, there could be (at most) two different length $n+1$ strings in the projection that agree with the labels in $S$ on the first $n$ examples. In other words, the values of $f^\star(X_1), \dots, f^\star(X_n)$ might not preclude the possibility that $f^\star(X_{n+1})$ could either be $0$ or $1$ . This is unfortunate, but the show must go on! We must commit to some choice for our prediction $\hat{Y}_{n+1}$ .

Lets take a step back and rephrase what we have learned from the above (we’ll see why this rephrasing is a good idea very soon). Given the examples $X_1, \dots, X_n$ and test point $X_{n+1}$ , we can build a graph from the projection $\mathcal{F}|_{X_1, \dots, X_{n+1}}$ as follows:

Each string $f \in \mathcal{F}|_{X_1, \dots, X_{n+1}}$ is a vertex, i.e. $V = \mathcal{F}|_{X_1, \dots, X_{n+1}}$ .
We connect an edge between two vertices $f, g \in V$ if and only if they disagree on exactly 1 bit (Hamming distance 1).

This graph is called the one-inclusion graph (OIG). Notice that for any fixed set of points ${x_1, \dots, x_n , x_{n+1}}$ , the player will always consider the same one-inclusion graph up to permutations in the patterns/strings that correspond to vertices. That is different orders of the training set give rise to isomorphic graphs.

We already noticed that if there is only a single vertex $f$ that is consistent with the labels in $S$ , this vertex must be the labelling determined by $f^\star$ and we are done! Otherwise, there are two vertices $f, g$ in the graph that are consistent with $S$ (one of which corresponds to the labels $f^\star$ assigns to the $n+1$ points) and we must pick between them. We can view this deterministic decision as an orientation of the edge between $f$ and $g$ in this graph. If the edge points to $f$ and the last bit of $f$ is $1$ (i.e. $f(X_{n+1}) = 1$ ), we predict $1$ . Otherwise, we predict $0$ . Thus, any reasonable ² deterministic strategy (i.e. mappings from a training set $S = {(X_1, f^\star(X_1)}, \dots, (X_n,f^\star(X_n)))$ and test point $X_{n+1}$ to a label in $\{0,1\}$ ) for the player can be viewed as an orientation of the relevant one-inclusion graph.

How should the player orient the graph so as to minimize their error? Since the error is measured with respect to a uniformly random permutation, and since the player’s strategy does not depend on the order of the dataset, its not too hard to see that the error they suffer is

$\Pr\left[\hat{Y}_{n+1} \neq f^\star(X_{n+1})\right] = \frac{1}{n+1} \sum_{i=1}^{n+1} \mathbf{1}\left\{\hat{y}_i \neq x_i\right\}.$

In words, this is the fraction of times that the player makes a mistake on a uniformly random choice of test point $x_i$ . But this is equivalent to the fraction of times that we orient the graph away from the vertex $f^\star$ , i.e. the out-degree of $f^\star$ ! So we can further write

$\Pr\left[\hat{Y}_{n+1} \neq f^\star(X_{n+1})\right] = \frac{\mathrm{out-degree}(f^\star)}{n+1}.$

The player thus wants to minimize the out-degree of $f^\star$ . Of course, they don’t know what $f^\star$ is. The best thing they can do is minimize the maximum out-degree of the graph. This is precisely the one-inclusion graph strategy proposed by Haussler, Littlestone, and Warmuth [HLW94].

There is now one final question that remains: how small can the maximum out-degree of the one-inclusion graph be? It is a priori possible for some vertices in the graph to have degree $n$ , so one might be tempted to think that in the worst case the maximum outdegree can be as large as $\Omega(n)$ . In order to bound the maximum out-degree of the one-inclusion graph, we need to capture the simplicity of graph as function of the function class. To do this, we first recall the notion of the VC dimension of a function class $\mathcal{F}$ , which we briefly saw in the statement of Theorem 1.

Definition 1 [VC Dimension] Let $\mathcal{X}$ be a domain and $\mathcal{F}$ be a class of functions from $\mathcal{X}$ to $\{0,1\}$ . The VC dimension of $\mathcal{F}$ is the largest integer $d$ such that there exists a set of $d$ points $x_1, \dots, x_d$ such that for any labelling of these points i.e. choice $y_i \in \left\{ 0,1 \right\}$ for each $x_i$ , there exists a function $f \in \mathcal{F}$ that realizes this labelling i.e. $f(x_i) = y_i$ .

Typically, the VC dimension is used in learning theory (for example in the proof of Theorem 1) to argue that the number of functions that a class can express on any dataset is small (through the celebrated Sauer-Shelah lemma). In the language of one-inclusion graphs, this amounts to arguing that the number of vertices of the one-inclusion graph is small. But as we noted above, the error in the transductive setting (and as we will see in the PAC setting as well) is bounded by out-degree of the best orientation of the one-inclusion graph. This turns out to be related to the edge density of the one-inclusion graph (this general fact can be argued using the Hall marriage theorem and was first presented by Alon and Tarsi [AT92]). Surprisingly, Haussler, Littlestone, and Warmuth [HLW94] proved a beautiful combinatorial result that states that the edge density (and thus the maximum out-degree) can always be made to be smaller than $d$ , the VC dimension of the class $\mathcal{F}$ .

Theorem 4 [Orientations of One-Inclusion Graphs] Let $\mathcal{F}$ be a hypothesis class with VC dimension $d$ . Let ${x_1, \dots , x_n}$ be a set of $n$ points from the domain $\mathcal{X}$ and let $G$ be the induced one-inclusion graph of $\mathcal{F}$ on this set. Then, there exists an orientation of the edges of $G$ such that the maximum out-degree of any vertex is at most $d$ .

This result amounts to saying that one-inclusion graphs are very sparse. This can be seen as a version of the isoperimetry theorem on the Boolean hypercube and provides a complementary view on the VC dimension compared to the Sauer-Shelah lemma.

Though, in this post we are not focussed on the computational efficiency of our algorithms, we remark that the minimizing orientation can in fact be found efficiently (using flow algorithms) in the size of the one-inclusion graph but this graph can have size $n^{O(d)}$ (this again follows from the Sauer-Shelah lemma).

Going back to our game, we can combine the above and conclude that the player has a final error bound

$\Pr\left[\hat{Y}_{n+1} \neq f^\star(X_{n+1})\right] \leq \frac{d}{n+1}.$

This result tells us that the VC dimension is a quantitative measure of complexity of the class that guarantees how well the player can perform. In fact, it is the quantitative measure of complexity for this game given the matching lower bounds of [LLS01] (this is the transductive analogue of Theorem 2).

We reiterate that a key aspect to note about the one-inclusion graph itself is that it does not depends on the order of the training examples $x_1, \dots , x_n$ since the changing the order leads to an isomorphic graph. Since the minimum over orientations of maximum out-degree is a graph property i.e. does not depend on the labelling of the vertices, one can hope to implement the one-inclusion graph algorithm in a permutation invariant way.For example, one can a priori order the domain $\mathcal{X}$ in some arbitrary way and always consider the training examples in that order. This fact that there is a prediction strategy that is permutation invariant and has the above error bound will be crucial later in the analysis.

The above argument was specialized to binary function classes, however it naturally extends to many other settings e.g. multiclass classification where instead of a graph, we have a hypergraph [RBR09, DS14]. In all these extensions, the one-inclusion strategy still works: we can minimize the maximum out-degree of the relevant one-inclusion (hyper)graph to achieve an upper bound on the error that scales with the maximum out-degree. Unfortunately, in more general settings it is not always clear whether the maximum out-degree can be made independent of the sample size $n$ like in the binary case above, but the one-inclusion graph strategy still provides a unifying perspective on in-expectation error of learning algorithms.

From Transductive learning to expected risk bounds

While the transductive model is interesting in its own right, our original goal was to understand how well we could do in the PAC model. The OIG algorithm only provides a single prediction for a test point $X_{n+1}$ , but this implicitly defines a function over $\mathcal{X}$ : whenever we want to evaluate the label of a test point $x$ , we can build and orient the relevant OIG that can constructed from the dataset $S$ and test point $x$ .Moreover, we get a bound on the error of this function for an average training sample $S$ (rather than with high probability) by applying the analysis above! To see this, let $\widehat{f}_{\mathrm{OIG}}$ be the function implicitly defined by the OIG algorithm. Because the examples $X_1, \dots, X_n$ that we received in our training sample and the test point $X = X_{n+1}$ are i.i.d., $(X_1, \dots, X_{n},X_{n+1})$ has the same distribution as $(X_{\pi(1)}, \dots, X_{\pi(n)}, X_{\pi(n+1)})$ for any permutation $\pi: [n+1] \to [n+1]$ , so we can imagine the $X_i$ as being permuted by a uniformly random permutation $\pi$ . So, we can bound the error just like in the transductive setting:

$\mathbb{E}_{S} \left[ \Pr_{X \sim P} \left[\widehat{f}_{\mathrm{OIG}}(X) \neq f^\star(X) \right]\right] = \mathbb{E}_{X_1, \dots, X_n} \left[\mathbb{E}_{X_{n+1}}\left[\mathbf{1}\left\{\widehat{f}_{\mathrm{OIG}}(X_{n+1}) \neq f^\star(X_{n+1}) \right\}\right]\right]$
$= \mathbb{E}_{\pi}\left[\mathbb{E}_{X_{\pi(1)}, \dots, X_{\pi(n)}} \left[\mathbb{E}_{X_{\pi(n+1)}}\left[\mathbf{1}\left\{\widehat{f}_{\mathrm{OIG}}(X_{n+1}) \neq f^\star(X_{n+1}) \right\}\right]\right]\right]$
$\le \frac{d}{n+1}.$

To get the final inequality above, we condition on the values of the $X_1, \dots, X_n, X_{n+1}$ , and then take the expectation over the uniform random permutation. Thus, the OIG algorithm gave us a learning strategy (that is very different from ERM) that gets sharp in-expectation bound in a setting similiar to the PAC setting.It is thus a natural question to ask whether the OIG algorithm produces a bona fide PAC learning algorithm with sharp error guarantees. In fact, Manfred Warmuth conjectured that the OIG algorithm achieves a strong PAC bound that matches the general lower bound on the error any algorithm must incur.

Conjecture 1 [War04] The one-inclusion graph algorithm of [HLW94] always achieves the lower bound. That is, after receiving

$\Omega\left(\frac{d+ \log\left(\frac{1}{\delta}\right)}{\epsilon} \right)$

examples, its error is at most $\epsilon$ with probability at least $1-\delta$ .

Some bad news

Unfortunately, Warmuth’s conjecture is false in the strictest sense as was shown in [AACS23b].

Theorem 5 [Informal] For any integer $d > 1$ , sample size $n$ and confidence parameter $\delta > 0$ , there is a hypothesis class $\mathcal{H}_d$ that has VC dimension $d$ , a distribution $P$ , a target function $f^\star$ and a valid instantiation of the one-inclusion graph algorithm such that, when it receives an i.i.d. sample from $P$ of size $n$ (labelled by $f^\star$ ), with probability at least $\delta$ the one-inclusion graph algorithm $\widehat{f}_{\mathrm{OIG}}$ incurs error

$\mathcal{R}( \widehat{f}_{\mathrm{OIG}} ) \geq c \cdot \frac{d}{\delta n}$

for some universal constant $c > 0$ .

By valid, we mean the algorithm will always predict with a one-inclusion graph that has out-degree at most $d$ . In particular, the above theorem says that it is not enough to just minimize the out-degree of the one-inclusion graph when predicting, which is in sharp contrast with the bounds we can get on average!

Note that this hints at a fundamental difference between the PAC and the transductive setting. In the PAC setting, we need to worry about the behavior of the algorithm across different datasets while the transductive (and also the expectation version of the PAC setting) only requires us to worry about the behavior of the algorithm on (random permutations of) a single dataset. Intuitively, the lower bound is proven by coordinating” the mistakes of the one-inclusion graphs that the algorithm predicts with. Recall that the one-inclusion graph algorithm makes a mistake on a test point $x$ iff the relevant edge in the graph is oriented away from the vertex that corresponds to $f^\star$ . We show that there is a “hard” distribution $P$ such that with probability at least $\delta$ we get a “bad” training set $S_{\text{BAD}}$ such that (roughly) a $d/(\delta n)$ fraction of the possible unseen test points $x \in \mathcal{X}$ will have a one-inclusion graph that points away from $f^\star$ on the relevant edge. Though the high level idea is simple, it is instructive to note an important subtlety. Recall that if we swap the test point $x$ with any training point $x' \in S$ in the training set, we still have the same one-inclusion graph. Moreover, we must use the same orientation for this graph when predicting on $x$ and $x'$ in order for our in-expectation guarantee to hold. Thus, flipping edges in an OIG will affect the maximum out-degree for a one-inclusion graph that can be used by the algorithm when it receives a different training sample $S'$ that differs from $S$ in a single entry. So we must be careful to coordinate these errors to ensure no one-inclusion graph ends up with maximum out-degree larger than $d$ after we are done flipping edges.

In the next episode of …

Unfortunately, we will have to end this blog post on this pessimistic note. But, stay tuned for the next blog post where we will see how we can combine techniques from online learning with the exchangeability that were key to the analysis of the OIG to design simple prediction strategies that achieve optimal high probability PAC bounds.

References

[AACSZ23a] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. Optimal PAC bounds without uniform convergence, In Foundations of Computer Science (FOCS), 2023.
[AACSZ23b] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. The one-inclusion graph algorithm is not always optimal. In Conference on Learning Theory (COLT), 2023.
[AT92] Noga Alon and Michael Tarsi. Colorings and orientations of graphs. Combinatorica, 12(2):125–134, 1992.
[DS14] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory (COLT), 2014.
[Han16] Steve Hanneke. The optimal sample complexity of PAC learning. The Journal of Machine Learning Research, 17(1):1319–1333, 2016.
[HLW94] David Haussler, Nick Littlestone, and Manfred Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
[Lit89] Nick Littlestone. From on-line to batch learning. In Proceedings of the Second Annual Workshop on Computational Learning Theory, 1989.
[LLS01] Yi Li, Philip Long, and Aravind Srinivasan. The one-inclusion graph algorithm is near-optimal for the prediction model of learning. IEEE Transactions on Information Theory, 47(3):1257–1261, 2001.
[RBR09] Benjamin Rubinstein, Peter Bartlett, and Hyam Rubinstein. Shifting: One-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):37–59, 2009.
[SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[Sim15] Hans Simon. An almost optimal PAC algorithm. In Conference on Learning Theory (COLT), 2015.
[War04] Manfred Warmuth. The optimal PAC algorithm. In International Conference on Computational Learning Theory (COLT), 2004.

Footnotes

As stated the learner is not required to output a hypothesis in the class $\mathcal{F}$ . Adding such a requirement would lead to a setting called proper learning but in this post we will allow our learners to be improper without further comment. ↩︎
By reasonable, we mean the algorithm will never output a bit that is not possible, i.e. in the case that there is a single vertex to choose between it always picks the correct bit. ↩︎

Share on Social Media

One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 1

Introduction

Probably Approximately Correct (PAC) Learning

Empirical Risk Minimization

Optimal Sample Complexity of PAC Learning

Transductive Learning and the One-Inclusion Graph Algorithm

From Transductive learning to expected risk bounds

Some bad news

In the next episode of …

References

Footnotes

About The Author

shetty

Leave a ReplyCancel reply

Introduction

Probably Approximately Correct (PAC) Learning

Empirical Risk Minimization

Optimal Sample Complexity of PAC Learning

Transductive Learning and the One-Inclusion Graph Algorithm

From Transductive learning to expected risk bounds

Some bad news

In the next episode of …

References

Footnotes

Related Posts

Calibration for Decision Making: A Principled Approach to Trustworthy ML

The Interface Between Reinforcement Learning Theory and Language Model Post-Training

One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 2

About The Author

shetty

Leave a ReplyCancel reply