{"id":553,"date":"2024-04-29T16:56:27","date_gmt":"2024-04-29T16:56:27","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=553"},"modified":"2024-04-29T17:06:06","modified_gmt":"2024-04-29T17:06:06","slug":"the-curious-landscape-of-multiclass-learning","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2024\/04\/29\/the-curious-landscape-of-multiclass-learning\/","title":{"rendered":"The Curious Landscape of Multiclass Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Welcome to the second installment of the Learning Theory Alliance blog post series! In case you missed the first installment on calibration, check it out <a href=\"https:\/\/www.let-all.com\/blog\/2024\/03\/13\/calibration-for-decision-making-a-principled-approach-to-trustworthy-ml\/\">here<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This time, <a href=\"https:\/\/www.cs.princeton.edu\/~nbrukhim\/\">Nataly Brukhim<\/a> and <a href=\"https:\/\/web.stanford.edu\/~cpabbara\/\">Chirag Pabbaraju<\/a> tell us about some exciting breakthrough results on multiclass PAC learning.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">The gold standard of learning theory is the classical PAC model defined by Valiant in 1984 [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1968.1972\">Val84<\/a>]. In particular, it allows us to reason about two fundamental questions: <strong><em><span style=\"text-decoration: underline;\">what<\/span><\/em><\/strong> can be learned, and <strong><em><span style=\"text-decoration: underline;\">how<\/span><\/em><\/strong> to learn it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Generally in this setting, we are given some domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> (like images) and a label space <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> (&#8220;dog&#8221; or &#8220;cat&#8221;?). The problem of learning a given class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5Cmathcal%7BY%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;mathcal{Y}^&#92;mathcal{X}\" class=\"latex\" \/> then boils down to finding a hypothesis <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h+%5Cin+%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h &#92;in &#92;mathcal{H}\" class=\"latex\" \/> after having observed training data drawn from some distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/> over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/>, such that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> has a small classification error with high probability, with respect to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For <em>binary<\/em> classification tasks, we get an especially clean and refined mathematical picture, in which these two questions are essentially resolved:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The question of <strong><em>what<\/em><\/strong> can be learned is answered via the Vapnik-Chervonenkis (VC) dimension [<a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-319-21852-6_3\">VC68<\/a>, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/76359.76371\">BEHW89<\/a>], which completely characterizes learnability. A bit more concretely, the VC dimension is a combinatorial measure that quantifies the sample complexity needed to learn any given function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5C%7B0%2C1%5C%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;{0,1&#92;}^&#92;mathcal{X}\" class=\"latex\" \/> of binary classifiers.<\/li>\n\n\n\n<li class=\"\">The question of <strong><em>how<\/em><\/strong> to learn is resolved by a rich algorithmic and analytical toolbox. This includes, for example, the elegantly simple algorithmic principle of Empirical Risk Minimization (ERM), the clever boosting methodology, and sample compression scheme guarantees (we will elaborate more on these tools later on).<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">One might expect similar ideas to hold for multiclass classification tasks as well. After all, it is the most natural generalization of the binary setting to allow for more than two labels. Surprisingly, this is not at all the case! The story turns out to be much more complex and enigmatic when it comes to multiclass learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, the main question of <strong><em>what<\/em><\/strong> is learnable (i.e., extending the VC characterization of learnability) had been a longstanding open problem the study of which dates back to 1980s. Seminal works [<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/B9780934613644500463\">NT88<\/a>, <a href=\"https:\/\/link.springer.com\/article\/10.1007\/BF00114804\">Nat89<\/a>, <a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/130385.130423\">BDCHL95<\/a>] identified that the Natarajan dimension characterizes PAC learnability when the number of classes is bounded. However, it remained unknown whether it also captures learnability when the number of classes is unbounded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Similarly, the question of <strong><em>how<\/em><\/strong> also reveals a stark contrast between binary and multiclass classification.<br>One important example is the standard ERM principle which in fact fails to hold in the multiclass case [<a href=\"https:\/\/proceedings.mlr.press\/v19\/daniely11a\/daniely11a.pdf\">DSBDSS11<\/a>]. Boosting theory also does not easily extend from the binary case [<a href=\"https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/17f5e6db87929fb55cebeb7fd58c1d41-Abstract.html\">BHMS21<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2307.00642\">BDMM23<\/a>], and it had been unknown whether learnable classes admit finite sized sample compression schemes in the multiclass setting, as is true in the binary one.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On the whole, the clean mathematical picture of binary classification did not generalize, revealing a landscape of multiclass learning far more puzzling than initially expected. In the past couple years, several advances were achieved, resolving some of these longstanding questions. In this post, we\u2019re going to describe these results with the goal of demystifying multiclass learning and possibly shedding light on broader phenomena in learning theory.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Question of <em>What<\/em> is Learnable and the DS Dimension<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Many important machine learning tasks require classification into many target classes: in image object recognition, the number of classes is the number of possible objects. In language models, the number of classes scales with the dictionary size. In protein folding prediction, the goal is to predict the 3D structures of proteins based on their 1D amino sequence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The fundamental question of <em>what is learnable<\/em> can be formalized as follows. For a domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/>, and label space <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/>, we ask whether a given class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5Cmathcal%7BY%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;mathcal{Y}^&#92;mathcal{X}\" class=\"latex\" \/> is <em>PAC learnable<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition (PAC learnability):<\/strong> We say that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/> is PAC learnable if there is a learning rule <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}\" class=\"latex\" \/> for which the following holds. For any distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}\" class=\"latex\" \/>, target <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%5E%5Cstar+%5Cin+%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h^&#92;star &#92;in &#92;mathcal{H}\" class=\"latex\" \/>, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%2C%5Cdelta+%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon,&#92;delta &gt; 0\" class=\"latex\" \/> when given training set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%7Bm%7D%7D_%7B+%7B%5Cepsilon%2C+%5Cdelta%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{{m}}_{ {&#92;epsilon, &#92;delta}}\" class=\"latex\" \/> samples from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%5Cmathcal%7BD%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{&#92;mathcal{D}}\" class=\"latex\" \/> labeled by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%5E%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h^&#92;star\" class=\"latex\" \/>, then the output of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BA%7D%28S%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{A}(S)\" class=\"latex\" \/> is a classifier <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f_S%3A+%5Cmathcal%7BX%7D+%5Crightarrow+%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f_S: &#92;mathcal{X} &#92;rightarrow &#92;mathcal{Y}\" class=\"latex\" \/> that satisfies: <\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CPr_%7BS%7D%5Cleft%5B+%5CPr_%7Bx+%5Csim+%5Cmathcal%7BD%7D%7D%5Cleft%5B+f_S%28x%29+%5Cne+h%5E%5Cstar%28x%29%5Cright%5D+%5Cle+%5Cepsilon+%5Cright%5D+%5Cge+1-%5Cdelta.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Pr_{S}&#92;left[ &#92;Pr_{x &#92;sim &#92;mathcal{D}}&#92;left[ f_S(x) &#92;ne h^&#92;star(x)&#92;right] &#92;le &#92;epsilon &#92;right] &#92;ge 1-&#92;delta.\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The seminal result of Vapnik and Chervonenkis(1971) asserts that the finiteness of the VC dimension characterizes the sample complexity for the binary case, i.e., <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%5Cmathcal%7BY%7D%7C+%3D+2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"|&#92;mathcal{Y}| = 2\" class=\"latex\" \/>. The VC dimension is a combinatorial measure of a the complexity of function class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/>. The definition of the VC dimension is not directly necessary to understand this blog post, but we state it here for the curious reader.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition (VC dimension):<\/strong> We say that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%5Cin+%5Cmathcal%7BX%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S &#92;in &#92;mathcal{X}^n\" class=\"latex\" \/> is <em>VC-shattered<\/em> by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5C%7B0%2C1%5C%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;{0,1&#92;}^&#92;mathcal{X}\" class=\"latex\" \/> if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D%7C_S+%3D+%5C%7B0%2C1%5C%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}|_S = &#92;{0,1&#92;}^n\" class=\"latex\" \/>, where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D%7C_S+%3D+%5C%7Bh%7C_S+%3A+h+%5Cin+%5Cmathcal%7BH%7D%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}|_S = &#92;{h|_S : h &#92;in &#92;mathcal{H}&#92;}\" class=\"latex\" \/><sup data-fn=\"d9aa3d0e-7639-41e4-99cc-7d16f00a0f23\" class=\"fn\"><a href=\"#d9aa3d0e-7639-41e4-99cc-7d16f00a0f23\" id=\"d9aa3d0e-7639-41e4-99cc-7d16f00a0f23-link\">1<\/a><\/sup>. The VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_%7BVC%7D%28%5Cmathcal%7BH%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_{VC}(&#92;mathcal{H})\" class=\"latex\" \/> is the maximum size of a VC-shattered sequence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The characterization of learnability in the binary case is then given by the following tight bound on the sample complexity needed to PAC-learn any class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/> [<a href=\"https:\/\/link.springer.com\/book\/10.1007\/978-1-4757-2440-0\" data-type=\"link\" data-id=\"https:\/\/link.springer.com\/book\/10.1007\/978-1-4757-2440-0\">Vap95<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/1507.00473\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/1507.00473\">Han15<\/a>]:<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_%7B+%5Cepsilon%2C+%5Cdelta%7D+%3D+%5CTheta%5Cleft%28%5Cfrac%7Bd_%5Ctext%7BVC%7D%28%5Cmathcal%7BH%7D%29+%2B+%5Clog%281%2F%5Cdelta%29%7D%7B%5Cepsilon%7D%5Cright%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_{ &#92;epsilon, &#92;delta} = &#92;Theta&#92;left(&#92;frac{d_&#92;text{VC}(&#92;mathcal{H}) + &#92;log(1\/&#92;delta)}{&#92;epsilon}&#92;right).\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the late 1980&#8217;s, Natarajan identified a natural generalization of the VC dimension. Later, [<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/130385.130423\">BDCHL95<\/a>] showed that it exhibits the following bounds: <\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=C_1+%5Cleft%28%5Cfrac%7Bd_%5Ctext%7BN%7D%28%5Cmathcal%7BH%7D%29+%2B+%5Clog%28%5Cfrac%7B1%7D%7B%5Cdelta%7D%29%7D%7B%5Cepsilon%7D%5Cright%29+%5Cle+%7B%7Bm%7D%7D_%7B+%7B%5Cepsilon%2C+%5Cdelta%7D%7D+%5Cle+C_2+%5Cleft%28%5Cfrac%7Bd_%5Ctext%7BN%7D%28%5Cmathcal%7BH%7D%29%5Clog%28%7C%5Cmathcal%7BY%7D%7C%29+%5Clog%28%5Cfrac%7B1%7D%7B%5Cepsilon%7D%29+%2B+%5Clog%28%5Cfrac%7B1%7D%7B%5Cdelta%7D%29%7D%7B%5Cepsilon%7D%5Cright%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"C_1 &#92;left(&#92;frac{d_&#92;text{N}(&#92;mathcal{H}) + &#92;log(&#92;frac{1}{&#92;delta})}{&#92;epsilon}&#92;right) &#92;le {{m}}_{ {&#92;epsilon, &#92;delta}} &#92;le C_2 &#92;left(&#92;frac{d_&#92;text{N}(&#92;mathcal{H})&#92;log(|&#92;mathcal{Y}|) &#92;log(&#92;frac{1}{&#92;epsilon}) + &#92;log(&#92;frac{1}{&#92;delta})}{&#92;epsilon}&#92;right),\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=C_1%2CC_2+%3E0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"C_1,C_2 &gt;0\" class=\"latex\" \/> are absolute constants, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_N&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_N\" class=\"latex\" \/> is the Natarajan dimension. Notice that the Natarajan dimension exhibits similar bounds to the binary case, up to a factor of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7C%5Cmathcal%7BY%7D%7C%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;log(|&#92;mathcal{Y}|)\" class=\"latex\" \/>. However, this factor can be huge or possibly infinite for tasks that do not admit an a priori reasonable bound on the number of classes, rendering the existing Natarajan-based bounds vacuous. It is therefore a natural question to ask whether we can get rid of the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7C%5Cmathcal%7BY%7D%7C%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;log(|&#92;mathcal{Y}|)\" class=\"latex\" \/> factor?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In 2014, Daniely and Shalev-Shwartz [<a href=\"https:\/\/arxiv.org\/abs\/1405.2420\">DSS14<\/a>] studied general algorithmic principles of the multiclass setting. Their analysis led them to identify a new notion of dimension, later named the DS dimension. Initially, they proved that a finite DS dimension is a necessary condition for PAC learnability, as is true for Natarajan dimension. They then conjectured that a finite DS dimension is also a sufficient condition, but they too left the full characterization of learnability open.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The answer finally arrived in 2022, when [<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>] showed that the DS dimension is indeed the correct characterization for multiclass learnability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will now introduce the DS dimension. Two useful concepts that we will need are the notions of a <em>one-inclusion graph<\/em> and a <em>pseudo-cube<\/em>. The idea here is to translate a learning problem<br>to the language of graph theory. For any <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n &gt; 0\" class=\"latex\" \/> and set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H%5Csubseteq+%5Cmathcal%7BY%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H&#92;subseteq &#92;mathcal{Y}^n\" class=\"latex\" \/>, the <em>one-inclusion graph<\/em> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H\" class=\"latex\" \/> is a hypergraph <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BG%7D%28H%29%3D%28V%2CE%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{G}(H)=(V,E)\" class=\"latex\" \/> that is defined as follows. The vertex-set is simply <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V%3DH&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V=H\" class=\"latex\" \/>. For the edge set, we have that two nodes are connected by an edge when they differ in their labeling of exactly one coordinate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then, an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>-pseudo-cube is a <em>finite<\/em> set of vectors <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%5Csubseteq+%5Cmathcal%7BY%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H &#92;subseteq &#92;mathcal{Y}^n\" class=\"latex\" \/> such that for every <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h+%5Cin+H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h &#92;in H\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i+%5Cle+n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i &#92;le n\" class=\"latex\" \/> there is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%27+%5Cin+H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h&#039; &#92;in H\" class=\"latex\" \/> that agrees with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> on all coordinates but <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i\" class=\"latex\" \/>. It can also be viewed as a finite subset of a one-inclusion graph on <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%5Csubseteq+%5Cmathcal%7BY%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H &#92;subseteq &#92;mathcal{Y}^n\" class=\"latex\" \/> with edges on all coordinates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Notice that when <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D+%3D+%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y} = &#92;{0,1&#92;}\" class=\"latex\" \/>, the two notions &#8220;boolean cube&#8221; and &#8220;pseudo-cube&#8221; coincide, but when <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clvert+%5Cmathcal%7BY%7D%5Crvert+%3E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;lvert &#92;mathcal{Y}&#92;rvert &gt;2\" class=\"latex\" \/>, that is no longer the case. Intuitively, a pseudo-cube can be thought of as a generalization of the boolean cube, where every element in the pseudo-cube <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H\" class=\"latex\" \/> has a &#8220;neighbor&#8221; on <span style=\"text-decoration: underline\">every<\/span> coordinate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a simple example, let&#8217;s say the domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> is of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>, and assume we have <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=3&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"3\" class=\"latex\" \/> labels: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D+%3D+%5C%7B0%2C1%2C2%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y} = &#92;{0,1,2&#92;}\" class=\"latex\" \/>. We can represent any class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/> by a graph, where each node corresponds to some <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h+%5Cin+%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h &#92;in &#92;mathcal{H}\" class=\"latex\" \/>. Each edge is colored to denote which coordinate changes:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"514\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=678%2C514&#038;ssl=1\" alt=\"\" class=\"wp-image-588\" style=\"width:361px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=1024%2C776&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=300%2C227&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=768%2C582&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?w=1336&amp;ssl=1 1336w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">The above graph is an example of a <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>-pseudo-cube corresponding to a set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H\" class=\"latex\" \/> of six <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>-dimensional vectors. If any of these were removed, the remaining set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H%27+%5Csubsetneq+H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H&#039; &#92;subsetneq H\" class=\"latex\" \/> would no longer form a pseudo-cube since it will violate the definition above. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alternatively, if another <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>-dimensional vector was added, say <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%281%2C+0%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(1, 0)\" class=\"latex\" \/>, the new set would also form a <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>-pseudo-cube since each element (or node) in the new set also has a &#8220;neighbor&#8221; on every coordinate. Notice that this time we also have hyper-edges that contain more than <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/> nodes. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"523\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?resize=678%2C523&#038;ssl=1\" alt=\"\" class=\"wp-image-635\" style=\"width:375px\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?resize=1024%2C790&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?resize=300%2C231&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?resize=768%2C592&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?w=1514&amp;ssl=1 1514w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g3-1.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">We are now ready to define the DS dimension. We say that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%5Cin+%5Cmathcal%7BX%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S &#92;in &#92;mathcal{X}^n\" class=\"latex\" \/> is <em>DS-shattered<\/em> by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D%5Csubseteq+%5Cmathcal%7BY%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}&#92;subseteq &#92;mathcal{Y}^&#92;mathcal{X}\" class=\"latex\" \/> if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D%7C_S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}|_S\" class=\"latex\" \/> contains an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>-dimensional pseudo-cube. The DS dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_%7BDS%7D%28%5Cmathcal%7BH%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_{DS}(&#92;mathcal{H})\" class=\"latex\" \/> is the maximum size of a DS-shattered set.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Multiclass Characterization<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The result showing that the DS dimension indeed characterizes learnability gives the following upper bound on the sample complexity [<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>]:<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%7Bm%7D%7D_%7B+%7B%5Cepsilon%2C+%5Cdelta%7D%7D+%3D+%5CTilde%7BO%7D%5Cleft%28%5Cfrac%7Bd_%7BDS%7D%5E%7B1.5%7D%7D%7B%5Cepsilon%7D%5Cright%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{{m}}_{ {&#92;epsilon, &#92;delta}} = &#92;Tilde{O}&#92;left(&#92;frac{d_{DS}^{1.5}}{&#92;epsilon}&#92;right),\" class=\"latex\" \/> <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_%7BDS%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_{DS}\" class=\"latex\" \/> is the DS dimension of the class. Combined the previously known lower bound [<a href=\"https:\/\/arxiv.org\/abs\/1405.2420\">DSS14<\/a>] of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CTilde%7B%5COmega%7D%5Cleft%28%5Cfrac%7Bd_%7BDS%7D%7D%7B%5Cepsilon%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Tilde{&#92;Omega}&#92;left(&#92;frac{d_{DS}}{&#92;epsilon}&#92;right)\" class=\"latex\" \/>, we get the complete characterization of the multiclass sample complexity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The proof of the upper bound above involves many components and uses a variety of classic algorithmic ideas such as boosting, sample compression, transductive learning, ERM, and shifting. In addition to these classic tools, the proof also relies on resolving a new intermediate learning problem, which is interesting in its own right: namely, a setting where instead of predicting a single label for a given input, the goal is to provide a short list of predictions. The setting is called &#8220;list PAC learning&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Intuitively, list-learning extends the standard PAC model by allowing the learning algorithm more freedom, and is a useful relaxation of the problem. One can also easily imagine scenarios where list-learning is itself a practical goal. For example, it can offer physicians a list of likely diagnoses, it can provide businesses the list of consumers&#8217; preferences, and so on. This new setting has been since studied in several subsequent works [<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3564246.3585190\">CP23<\/a>, <a href=\"https:\/\/proceedings.mlr.press\/v195\/moran23a\/moran23a.pdf\">MSTY23<\/a>]. Next, we&#8217;ll see how list-learning is a key step towards proving our desired bound.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The starting point of the proof is a combinatorial abstraction of the learning problem by translating it into the language of hypergraph theory. This results in algorithmic problems on the objects referred to as &#8220;one-inclusion graphs&#8221;, that we defined earlier. The one-inclusion graph leads to a simple but useful toy model for transduction in machine learning. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The toy model is as follows. Imagine a learning game played over a one-inclusion graph <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28V%2C+E%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(V, E)\" class=\"latex\" \/>. The input to the problem is an edge. The input edge <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=e&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"e\" class=\"latex\" \/> is generated by first fixing some node <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/>, and then choosing <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=e&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"e\" class=\"latex\" \/> to be a uniformly random edge containing <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/>. The goal is to output a vertex <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=u&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"u\" class=\"latex\" \/> that is equal to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/> with as high probability as possible.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, notice that learning algorithms in this toy model are in fact <strong><em>orientations<\/em><\/strong> of the graph. Let&#8217;s see a simple example. Consider the following one-inclusion graph, similar to the ones we saw above (but is not a pseudo-cube): <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"352\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g4.png?resize=678%2C352&#038;ssl=1\" alt=\"\" class=\"wp-image-636\" style=\"width:430px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g4.png?resize=1024%2C532&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g4.png?resize=300%2C156&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g4.png?resize=768%2C399&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/g4.png?w=1148&amp;ssl=1 1148w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Observe that we simply <strong><em>oriented<\/em><\/strong> the edges of the graph, by choosing a single &#8220;representative&#8221; for each edge, to which it points to. An orientation is equivalent to a learning algorithm in the toy model, since it tells us which node to predict. Let&#8217;s assume that we chose <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/> to be the node <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%281%2C2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(1,2)\" class=\"latex\" \/>. Then the pictured orientation would make a mistake if given the darker edge, and be correct if given the brighter edge.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Intuitively, the more edges that are oriented outwards of a node &#8211; the more mistakes the learner will make  if that node were the chosen <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/>. Therefore, a useful observation here is that bounding the error of the learner corresponds to bounding the largest <strong>out-degree<\/strong> of the orientation. That way, we guarantee that no matter which <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/> was chosen, the worst case error is at most the maximal out-degree. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But how does that toy model relate to our original problem? We can think of an input edge as revealing the values of all coordinates, or data points, labeled by the &#8220;ground truth&#8221; node <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v^*\" class=\"latex\" \/>. Given values for all coordinates except one, we will try to predict it, which is equivalent to our test point. Then, by simple symmetrization arguments, &#8220;good orientations&#8221; of the one-inclusion graph can actually be shown to yield non-trivial learning algorithms for our original task. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, we now want to solve the simpler problem of finding a good orientation for the one-inclusion graph <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28V%2C+E%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(V, E)\" class=\"latex\" \/> induces by our hypothesis class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H\" class=\"latex\" \/>. Recall that we assume the class has a bounded DS dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_%7BDS%7D+%3C+%5Cinfty&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_{DS} &lt; &#92;infty\" class=\"latex\" \/>. Based on the properties of the DS dimension, it turns out that we can actually construct a good orientation with a maximal out-degree that is at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7Bd_%7BDS%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{d_{DS}}\" class=\"latex\" \/>. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Translating this orientation back to our original problem, we get a (very) weak learner which has error of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1-%5Cfrac%7B1%7D%7Bd_%7BDS%7D%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1-&#92;frac{1}{d_{DS}+1}\" class=\"latex\" \/>. This error is pretty high, but the crucial point is that it is uniformly bounded away from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/>. So we managed to get a non-trivial error, but it is not clear how to proceed. How can we improve it? <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It may be tempting to try to improve the error by boosting. But standard boosting turns out to be useless in our context, since the error is so high.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is at this point that list learning comes to the rescue. We can show that the very-weak learner actually yields a list-learner where the list is of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Capprox+d_%7BDS%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;approx d_{DS}\" class=\"latex\" \/>. Then, the idea is to &#8220;map&#8221; each point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x+%5Cin+X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x &#92;in X\" class=\"latex\" \/> to a <em>short<\/em> list of labels, which allows us to eliminate the vast majority of a priori possible labels. Instead of all of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/>, we can safely use the learned list <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmu%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mu(x)\" class=\"latex\" \/> as the effective &#8220;local&#8221; label space for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>. Therefore, lists can be thought of as tools for alphabet reduction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To conclude the proof, it is natural to try to reduce the learning task to one in which the number of labels is bounded. But, did we just reduce the infinite alphabet case to the finite case? The short answer is no, since the class restricted to our learned list may not contain the target function and may even be vacuously empty. The solution instead is based on learning algorithms that operate on the one-inclusion graph, and have a &#8220;local&#8221; property which is able to avoid the problem we raised. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The full algorithm is best thought of as a sample compression scheme, which demonstrates that a small subsample of the training data suffices to determine a hypothesis that correctly classifies all examples in it. We elaborate more on sample compression scheme later in this blog post. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thus, we conclude the high-level outline of the proof for the upper bound on sample complexity via the DS dimension. While this proof is intricate, containing many details and components that are beyond the scope of this post, we hope that this overview captures the main ideas and sparks your interest in exploring the paper further.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">DS Dimension vs. Natarajan Dimension<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">So we saw that the DS dimension characterizes multiclass learnability. But what about the Natarajan dimension, which had been the leading candidate to be the correct answer since the 1990\u2019s?<br>Could it be essentially equivalent to the DS dimension? Perhaps it scales with the DS dimension up to certain factors?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The answer is no! The work by [<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>] shows an arbitrarily large gap between the DS and the Natarajan dimension, proving that the Natarajan dimension in fact does not characterize learnability in the general case.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To understand this result, we first define the Natarajan dimension. We say that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%5Cin+%5Cmathcal%7BX%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S &#92;in &#92;mathcal{X}^n\" class=\"latex\" \/> is <em>N-shattered<\/em> by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5Cmathcal%7BY%7D%5E%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;mathcal{Y}^&#92;mathcal{X}\" class=\"latex\" \/> if it contains an <em><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>-Natarajan cube<\/em>. A set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%5Csubseteq+%5Cmathcal%7BY%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H &#92;subseteq &#92;mathcal{Y}^n\" class=\"latex\" \/> is an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>-Natarajan cube if for every <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i+%5Cin+%5Bn%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i &#92;in [n]\" class=\"latex\" \/> there exists <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=a_i%2Cb_i+%5Cin+%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"a_i,b_i &#92;in &#92;mathcal{Y}\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=a_i+%5Cne+b_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"a_i &#92;ne b_i\" class=\"latex\" \/>, and it holds that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%3D+%5CPi_%7Bi%3D1%7D%5En+%5C%7Ba_i%2Cb_i%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H = &#92;Pi_{i=1}^n &#92;{a_i,b_i&#92;}\" class=\"latex\" \/>. The Natarajan dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d_%7BN%7D%28%5Cmathcal%7BH%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d_{N}(&#92;mathcal{H})\" class=\"latex\" \/> is the maximum size of an N-shattered sequence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s look at a simple example of a domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>, and label space <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D+%3D+%5C%7B0%2C1%2C2%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y} = &#92;{0,1,2&#92;}\" class=\"latex\" \/>. We can see that the Natarajan cube is a simple generalization of the boolean cube:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"222\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?resize=678%2C222&#038;ssl=1\" alt=\"\" class=\"wp-image-595\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?resize=1024%2C335&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?resize=300%2C98&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?resize=768%2C252&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?resize=1536%2C503&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?w=1942&amp;ssl=1 1942w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.42.48%E2%80%AFPM.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">It may help to notice that the 2-Natarajan cube is equivalent to containing a <strong>square<\/strong> as a sub-graph of the corresponding one-inclusion graph. This intuition will help explain the separation result for higher dimensions. Let&#8217;s start with the simplest case we can show between the Natarajan and DS dimensions: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/> vs. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2\" class=\"latex\" \/>. In fact, we already saw it before:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"514\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=678%2C514&#038;ssl=1\" alt=\"\" class=\"wp-image-588\" style=\"width:361px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=1024%2C776&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=300%2C227&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?resize=768%2C582&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-17-at-3.19.16%E2%80%AFPM.png?w=1336&amp;ssl=1 1336w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Here, as we said earlier, the DS dimension is 2. But the Natarajan dimension is only 1. Intuitively this can be seen by the fact that there is no sub-graph that is a square. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And what about a larger separation? How large can this gap be?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before we continue, we introduce a connection between learning and topology. It turns out that there is an interesting equivalence between hypothesis classes <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D+%5Csubseteq+%5Cmathcal%7BY%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H} &#92;subseteq &#92;mathcal{Y}^n\" class=\"latex\" \/> and topological objects called &#8220;colored simplicial complexes&#8221;, which are combinatorial abstractions of triangulations of topological spaces. Let&#8217;s see our previous example, now by a colored simplicial complex:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"504\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?resize=678%2C504&#038;ssl=1\" alt=\"\" class=\"wp-image-597\" style=\"width:424px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?resize=1024%2C761&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?resize=300%2C223&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?resize=768%2C571&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?resize=1536%2C1141&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?w=1682&amp;ssl=1 1682w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">In 2 dimensions, a simplicial complex is simply a graph. Here, each node correspond to a certain label, at a certain position. The circle node appears as the first symbol, and the square appears as the second symbol. There is an edge between two nodes if they correspond to a &#8220;word&#8221;, a hypothesis in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Importantly, it can be shown that a 2-Natarajan cube also corresponds to an embedded square in a simplicial complex. So, we can see that the above simplicial complex does not have a square as a sub-graph and therefore, we get that 1 vs. 2 separation as we said. That is, the DS dimension is 2 but the Natarajan dimension is only 1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now let&#8217;s try to get a separation of 1 vs. 3.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"536\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?resize=678%2C536&#038;ssl=1\" alt=\"\" class=\"wp-image-600\" style=\"width:444px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?resize=1024%2C810&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?resize=300%2C237&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?resize=768%2C607&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?resize=1536%2C1214&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?w=1738&amp;ssl=1 1738w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/simp2.png?w=1356&amp;ssl=1 1356w\" sizes=\"auto, (max-width: 678px) 100vw, 678px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">This is an example of a 3-dimensional pseudo-cube with Natarajan dimension 1. The &#8220;elements&#8221; in this larger pseudo-cube are the triangles (there are 54 of them). Each such triangle corresponds to a &#8220;word&#8221;, a hypothesis in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/>, over 3 coordinates. The labels are the vertices (there are <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctfrac%7B3+%5Ccdot+54%7D%7B6%7D+%3D+27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tfrac{3 &#92;cdot 54}{6} = 27\" class=\"latex\" \/> of them). The vertices are colored by 3 colors: circle-blue, triangle-green and square-pink. In each word, the circle-blue vertex appears as the first symbol, the triangle-green vertex as the second, and the square-pink vertex as the third. Opposite sides of the hexagon are identified, as the picture indicates. The pseudo-cube property holds because each triangle has three neighboring triangles that are obtained by switching one vertex from each color. The Natarajan dimension is 1 because, again, there is no square (a cycle of length four) in the graph so that its vertices have alternating colors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Notice that in this example, the complexity here increased pretty quickly &#8211; we only needed 6 elements to prove a gap of 1-to-2, and for 1-to-3 we already need 54. What about 1-to-4? This is much more difficult. A computer search found such a gap 1-to-4, corresponding to a class of size 118098! Should we give up on a systematic way to find a larger gap?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s go back to the connection we mentioned between learning and topology. Interestingly, this equivalence turned out to be a lot richer than it initially seemed to be. In particular, notions similar to our dimensions have also been studied in algebraic topology. A Natarajan dimension of 1 corresponds to a local combinatorial criteria for hyperbolicity. A DS dimension of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> corresponds to the notion of a <em>pseudo-manifold<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Building on this unique and beautiful connection, and using a deep and involved result in algebraic topology by Januszkiewicz and \u015awi\u0105tkowski, 2003, it was proved in [<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>] that the gap can be unbounded! That is, there exists a class with infinite DS dimension, but a Natarajan dimension of 1.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Question of <em>How<\/em> to Learn<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The VC\/DS dimensions characterize what classes can be PAC learned. But how does one learn these classes? In the binary classification setting, a variety of tools including Empirical Risk Minimization (ERM), boosting, as well as sample compression schemes, each yield successful learning algorithms. As it turns out, each of these tools end up having some deficiencies in terms of how they extend to the multiclass setting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Empirical Risk Minimization and Uniform Convergence<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Finite VC dimension in the binary classification setting, which is a necessary condition for PAC learning, also endows a hypothesis class with a very special property &#8212; that of <em>uniform convergence<\/em>. Uniform convergence guarantees that the empirical loss of <em>every<\/em> hypothesis in the class on a sufficiently large (but finite) sample is close to its population loss with respect to the entire distribution. Because of this, the ERM algorithm, which simply outputs <em>any<\/em> hypothesis in the class that minimizes the error on the training dataset, is a valid learning algorithm.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the setting of multiclass learning, if the Natarajan dimension of the class <em>as well as<\/em> the number of labels is finite, then the class still satisfies uniform convergence. However, the situation drastically changes when we allow an unbounded number of labels. In this case, it is possible to construct learnable hypothesis classes, which do not satisfy uniform convergence, meaning that for these classes, two different hypotheses from the class that each minimize the empirical risk could differ by an arbitrarily large factor in their population risk [<a href=\"https:\/\/proceedings.mlr.press\/v19\/daniely11a\/daniely11a.pdf\">DSBDSS11<\/a>]! Thus, we can no longer always rely on the ERM principle for learning in the multiclass setting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Boosting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Another common approach to obtain a PAC learner is via <em>boosting<\/em>. Namely, in the binary setting, having a <em>weak<\/em> learner that simply has error bounded away from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac12&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac12\" class=\"latex\" \/> (i.e., better than random guessing) is sufficient to obtain a learner that has error <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/> (namely, a <em>strong<\/em> learner), via algorithms like AdaBoost [<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0890540185711364\">Fre95<\/a>]. If the learning task is instead over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/> labels, we could imagine that having a classifier that does slightly better than random guessing, i.e., has error slightly better than <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1-%5Cfrac1k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1-&#92;frac1k\" class=\"latex\" \/>, would similarly allow us to obtain a strong learner via boosting. However, such a learner turns out to be too weak in order to plug into standard boosting algorithms [Chapter 10 in <a href=\"https:\/\/direct.mit.edu\/books\/oa-monograph\/5342\/BoostingFoundations-and-Algorithms\">SF12<\/a>, <a href=\"https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/17f5e6db87929fb55cebeb7fd58c1d41-Abstract.html\">BHMS21<\/a>, <a href=\"https:\/\/proceedings.mlr.press\/v195\/brukhim23a\/brukhim23a.pdf\">BHM23<\/a> , <a href=\"https:\/\/arxiv.org\/abs\/2307.00642\">BDMM23<\/a>]. Essentially, one still ends up requiring a weak learner with error below <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac12&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac12\" class=\"latex\" \/>, which is harder to obtain as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/> gets large. Therefore, boosting too does not seem to give us a PAC learner in the multiclass setting, at least in a straightforward way.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Sample Compression<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, another standard way to obtain a learner is via <em>sample compression<\/em>. Roughly speaking, a hypothesis class admits a sample compression scheme, if for every sample labeled by a hypothesis from the class, it is possible to retain only a small subsample, using which the labels on the entire sample can be inferred. As a standard example, consider the binary hypothesis class of axis-aligned rectangles in two dimensions. Suppose we are given a training dataset labeled by some unknown rectangle, as in the figure below.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"602\" height=\"386\" src=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/rectangles.png?resize=602%2C386&#038;ssl=1\" alt=\"\" class=\"wp-image-604\" style=\"width:349px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/rectangles.png?w=602&amp;ssl=1 602w, https:\/\/i0.wp.com\/www.let-all.com\/blog\/wp-content\/uploads\/2024\/04\/rectangles.png?resize=300%2C192&amp;ssl=1 300w\" sizes=\"auto, (max-width: 602px) 100vw, 602px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Observe how we can get rid of all samples except the left-most, right-most, top-most and the bottom-most circle, and simply draw a rectangle with these four points determining the boundary &#8212; this correctly classifies all the points, including the ones we threw away.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">More concretely, a sample compression <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Ckappa%2C+%5Crho%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;kappa, &#92;rho)\" class=\"latex\" \/> for a hypothesis class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/> comprises of:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">A <strong>compressor<\/strong> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ckappa%3A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;kappa:\" class=\"latex\" \/> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cunderbrace%7B%28%5Cmathcal%7BX%7D+%5Ctimes+%5Cmathcal%7BY%7D%29%5E%2A%7D_%7B%5Ctext%7Btraining+sample%7D%7D+%5Cto+%5Cunderbrace%7B%28%5Cmathcal%7BX%7D+%5Ctimes+%5Cmathcal%7BY%7D%29%5E%2A.%7D_%7B%5Ctext%7Bcompressed+subsample%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;underbrace{(&#92;mathcal{X} &#92;times &#92;mathcal{Y})^*}_{&#92;text{training sample}} &#92;to &#92;underbrace{(&#92;mathcal{X} &#92;times &#92;mathcal{Y})^*.}_{&#92;text{compressed subsample}}\" class=\"latex\" \/><\/li>\n\n\n\n<li class=\"\">A <strong>reconstructor<\/strong> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Crho%3A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;rho:\" class=\"latex\" \/> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cunderbrace%7B%28%5Cmathcal%7BX%7D+%5Ctimes+%5Cmathcal%7BY%7D%29%5E%2A%7D_%7B%5Ctext%7Bcompressed+subsample%7D%7D+%5Cto+%5Cunderbrace%7B%5Cmathcal%7BY%7D%5E%5Cmathcal%7BX%7D.%7D_%7B%5Ctext%7Bfunction%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;underbrace{(&#92;mathcal{X} &#92;times &#92;mathcal{Y})^*}_{&#92;text{compressed subsample}} &#92;to &#92;underbrace{&#92;mathcal{Y}^&#92;mathcal{X}.}_{&#92;text{function}}\" class=\"latex\" \/><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">For any sequence <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S+%5Cin+%28%5Cmathcal%7BX%7D+%5Ctimes+%5Cmathcal%7BY%7D%29%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S &#92;in (&#92;mathcal{X} &#92;times &#92;mathcal{Y})^*\" class=\"latex\" \/> realizable by the class <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BH%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{H}\" class=\"latex\" \/>, these functions should satisfy:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">If <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ckappa%28S%29%3DS%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;kappa(S)=S&#039;\" class=\"latex\" \/>, then all elements in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S%27&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S&#039;\" class=\"latex\" \/> should also belong to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/>.<\/li>\n\n\n\n<li class=\"\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Crho%28S%27%29%28x%29%3Dy&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;rho(S&#039;)(x)=y\" class=\"latex\" \/> for all <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28x%2C+y%29+%5Cin+S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(x, y) &#92;in S\" class=\"latex\" \/>.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The first condition requires that the output of the compressor only has elements from the original sample itself, and not outside of it. The second condition requires that the output of the reconstructor correctly labels all of the training sample, including the points that were kept from it. Finally, the size of the compression <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s%28m%29%3D%5Cmax_%7B%7CS%7C%3Dm%2C+S%27%3D%5Ckappa%28S%29%7D%28%7CS%27%7C%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s(m)=&#92;max_{|S|=m, S&#039;=&#92;kappa(S)}(|S&#039;|)\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It turns out that sample compression schemes are intrinsically tied to learnability. Namely, it can be shown [<a href=\"https:\/\/citeseerx.ist.psu.edu\/document?repid=rep1&amp;type=pdf&amp;doi=e5ec4059989296a8d1e2d8cad880352617462d3b\">LW86<\/a>] that the output of the reconstructor in a compression scheme of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s%28m%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s(m)\" class=\"latex\" \/> is a PAC learner having sample complexity <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BO%7D%28s%28m%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{O}(s(m))\" class=\"latex\" \/> &#8212; this holds true across both binary and multiclass learning. At a high level, this is so because, if the output of the reconstructor did in fact have high population-level risk, then this would be at odds with its property that it is correctly inferring all of the labels on the training dataset from only a small subsample &#8212; a &#8220;generalization-like&#8221; property.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Binary PAC Learnability <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CRightarrow+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;Rightarrow \" class=\"latex\" \/> Compression<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Perhaps curiously, notice that the size of the compression in the example of two-dimensional rectangles, i.e., 4, also happens to be the VC dimension of the class. It is a longstanding conjecture [<a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-540-45167-9_60\">War03<\/a>] that every binary hypothesis class having VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> admits a sample compression scheme of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>. A weaker version of this question amounts to asking: for any hypothesis class, can we always obtain a sample compression scheme having size that is a constant (depending only on the VC dimension of the class) independent of the training dataset size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/>? Note that to learn a hypothesis class upto an error <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>, the size of the training dataset that one needs to draw scales inversely with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>; thus, a constant-sized compression scheme would allow us to compress such datasets to a size independent of the error.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For binary hypothesis classes, the answer turns out to be yes! For every binary class of VC dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/>, [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] construct a sample compression scheme of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BO%7D%28d+%5Ccdot+d%5E%2A%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{O}(d &#92;cdot d^*)\" class=\"latex\" \/>, where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d^*\" class=\"latex\" \/> is the dual VC dimension of the class. The dual VC dimension can be bounded above by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2%5E%7BO%28d%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2^{O(d)}\" class=\"latex\" \/>, and thus the size of the compression is also <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2%5E%7BO%28d%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"2^{O(d)}\" class=\"latex\" \/>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Multiclass PAC Learnability <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5CnRightarrow+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nRightarrow \" class=\"latex\" \/> Compression<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Recall how we saw that ERM and boosting don&#8217;t extend in a straightforward way to give a learner in the multiclass setting. The way that [<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>] obtain a learner for multiclass classes of DS dimension <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> is indeed via the construction of a sample compression scheme of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7BO%7D%28d%5E%7B1.5%7D+%5Ccdot+%5Ctext%7Bpolylog%7D%28m%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{O}(d^{1.5} &#92;cdot &#92;text{polylog}(m))\" class=\"latex\" \/>. Could we perhaps get rid of the dependence on the training set size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> from this compression scheme, and hope to obtain a similar result like that of [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] in the multiclass setting &#8212; namely a sample compression scheme of size only a finite function of the DS dimension?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The compression scheme of [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] crucially relies on the existence of <em>proper<\/em> PAC learners for binary classes &#8212; these are learners that output a hypothesis from within the class being learned. For example, ERM is an instance of a proper PAC learner. We saw above that ERM is not always a PAC learner for multiclass classes. In fact, an even stronger result holds true in the multiclass setting: there exist learnable classes that provably cannot be learned by <em>any<\/em> proper PAC learner, let alone ERM [<a href=\"https:\/\/arxiv.org\/abs\/1405.2420\">DSS14<\/a>]. Another important component in the compression scheme of [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] is an upper bound on the dual VC dimension in terms of the VC dimension. As it turns out, one can easily exploit the combinatorial definition of the DS dimension to construct simple classes that have DS dimension 1 but unbounded dual DS dimension. Thus, even if we were able to obtain a bound in terms of the dual DS dimension, such a bound would be vacuous.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But never mind these obstacles &#8212; maybe an entirely different compression scheme from that of [<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] still works for the multiclass case? As it turns out, the answer is no! [<a href=\"https:\/\/proceedings.mlr.press\/v237\/pabbaraju24a\/pabbaraju24a.pdf\">Pab24<\/a>] shows that it is simply not possible to have a sample compression scheme that is only a finite function of the DS dimension, and is independent of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Concretely, they construct a multiclass hypothesis class that has DS dimension 1, such that any sample compression scheme that compresses samples of size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> realizable by this class to size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s%28m%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s(m)\" class=\"latex\" \/> <em>must<\/em> satisfy <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s%28m%29%3D%5COmega%28%28%5Clog+m%29%5E%7B1-o%281%29%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s(m)=&#92;Omega((&#92;log m)^{1-o(1)})\" class=\"latex\" \/>. Namely, the size of the compression must necessarily <em>increase<\/em> with the sample size. This establishes yet another separation, between the paradigms of binary and multiclass learning: while constant-sized sample compression schemes exist for learnable binary classes, they provably do not exist for learnable multiclass classes!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The construction of the multiclass class that witnesses the above sample compression lower bound is inspired from a result that shows the same lower bound for <em>partial<\/em> hypothesis classes [<a href=\"https:\/\/www.computer.org\/csdl\/proceedings-article\/focs\/2022\/205500a658\/1BtfBwueRgs\">AHHM22<\/a>]. Partial hypothesis classes consist of hypotheses that map the domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{X}\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B0%2C1%2C%5Cstar%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{0,1,&#92;star&#92;}\" class=\"latex\" \/> &#8212; that is, the hypothesis is allowed to say &#8220;I don&#8217;t know&#8221; on certain points in the domain, corresponding to the label <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;star\" class=\"latex\" \/>. The authors of [<a href=\"https:\/\/www.computer.org\/csdl\/proceedings-article\/focs\/2022\/205500a658\/1BtfBwueRgs\">AHHM22<\/a>] interpret a deep graph-theoretic result related to a problem known as the Alon-Saks-Seymor problem in the language of partial hypothesis classes, and construct a partial class that has VC dimension 1 but does not have a finite-sized sample compression. This partial class is the building block of the multiclass hypothesis class that realizes the sample compression lower bound. The main idea is to &#8220;fill in&#8221; (or <em>disambiguate<\/em>) the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;star\" class=\"latex\" \/>&#8216;s in each hypothesis in the partial class with labels in a way that the DS dimension of the resulting class is not too large. Such a class inherits the sample compression lower bound of the partial class. Thus, the task boils down to constructing a disambiguating class having low DS dimension. With the power of potentially infinite labels at hand, this can be easily achieved using a &#8220;disambiguate using your name&#8221; strategy &#8212; every partial hypothesis in the class fills in the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cstar&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;star\" class=\"latex\" \/>&#8216;s with a unique label which is never used by any other hypothesis again. It can be readily checked that this operation does not increase the DS dimension of the class, yielding the desired lower bound.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In summary, we have seen how there are stark differences in the landscape of learnability as we transition from the binary to the multiclass setting. We saw how ERM, boosting and sample compression &#8212; popular tools to obtain learning algorithms in the binary classification setting &#8212; behave in a markedly different way in the multiclass setting. We also saw that learnability in the multiclass setting is characterized by the <em>DS dimension<\/em>, a quantity that reduces to the VC dimension in the binary setting, but is combinatorially much richer in the multiclass setting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Footnotes<\/h3>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"d9aa3d0e-7639-41e4-99cc-7d16f00a0f23\">The projection <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%7C_S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h|_S\" class=\"latex\" \/> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>-size set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"S\" class=\"latex\" \/> is thought of as the map from <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Bn%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"[n]\" class=\"latex\" \/> to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{Y}\" class=\"latex\" \/> defined by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i+%5Cmapsto+h%28x_i%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i &#92;mapsto h(x_i)\" class=\"latex\" \/>. <a href=\"#d9aa3d0e-7639-41e4-99cc-7d16f00a0f23-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><\/ol>\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">[<a href=\"https:\/\/www.computer.org\/csdl\/proceedings-article\/focs\/2022\/205500a658\/1BtfBwueRgs\">AHHM22<\/a>] Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability<br>of partial concept classes, 2022.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/arxiv.org\/abs\/2203.01550\">BCDMY22<\/a>] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A<br>characterization of multiclass learnability, 2022.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/130385.130423\">BDCHL95<\/a>] Shai Ben-David, Nicolo Cesabianchi, David Haussler, and Philip M Long. Characteri-<br>zations of learnability for classes of {0,&#8230;,n}-valued functions, 1995.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/arxiv.org\/abs\/2307.00642\">BDMM23<\/a>] Nataly Brukhim, Amit Daniely, Yishay Mansour, and Shay Moran. Multiclass boosting:<br>Simple and intuitive weak learning criteria, 2023.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/76359.76371\">BEHW89<\/a>] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.<br>Learnability and the Vapnik-Chervonenkis dimension, 1989.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/proceedings.mlr.press\/v195\/brukhim23a\/brukhim23a.pdf\">BHM23<\/a>] Nataly Brukhim, Steve Hanneke, and Shay Moran. Improper multiclass boosting, 2023.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/17f5e6db87929fb55cebeb7fd58c1d41-Abstract.html\">BHMS21<\/a>] Nataly Brukhim, Elad Hazan, Shay Moran, and Robert E. Schapire. Multiclass boosting<br>and the cost of weak learning, 2021.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3564246.3585190\">CP23<\/a>] Moses Charikar and Chirag Pabbaraju. A characterization of list learnability, 2023.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/proceedings.mlr.press\/v19\/daniely11a\/daniely11a.pdf\">DSBDSS11<\/a>] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass<br>learnability and the erm principle, 2011.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/arxiv.org\/abs\/1405.2420\">DSS14<\/a>] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems, 2014.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0890540185711364\">Fre95<\/a>] Yoav Freund. Boosting a weak learning algorithm by majority, 1995.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/arxiv.org\/abs\/1507.00473\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/1507.00473\">Han15<\/a>] Steve Hanneke. The optimal sample complexity of PAC learning, 2015.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/citeseerx.ist.psu.edu\/document?repid=rep1&amp;type=pdf&amp;doi=e5ec4059989296a8d1e2d8cad880352617462d3b\">LW86<\/a>] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability, 1986.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/proceedings.mlr.press\/v195\/moran23a\/moran23a.pdf\">MSTY23<\/a>] Shay Moran, Ohad Sharon, Iska Tsubari, and Sivan Yosebashvili. List online classifica-<br>tion, 2023.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2890490\">MY16<\/a>] Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes, 2016.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/link.springer.com\/article\/10.1007\/BF00114804\">Nat89<\/a>] Balas K. Natarajan. On learning sets and functions, 1989.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/B9780934613644500463\">NT88<\/a>] Balas K. Natarajan and Prasad Tadepalli. Two new frameworks for learning, 1988.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/proceedings.mlr.press\/v237\/pabbaraju24a\/pabbaraju24a.pdf\">Pab24<\/a>] Chirag Pabbaraju. Multiclass learnability does not imply sample compression, 2024.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/direct.mit.edu\/books\/oa-monograph\/5342\/BoostingFoundations-and-Algorithms\">SF12<\/a>] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms, 2012.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/link.springer.com\/book\/10.1007\/978-1-4757-2440-0\" data-type=\"link\" data-id=\"https:\/\/link.springer.com\/book\/10.1007\/978-1-4757-2440-0\">Vap95<\/a>] Vladimir Vapnik. The nature of statistical learning theory, 1995.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-319-21852-6_3\">VC68<\/a>] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative<br>frequencies of events to their probabilities, 1968.<\/li>\n\n\n\n<li class=\"\">[<a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-540-45167-9_60\">War03<\/a>] Manfred K Warmuth. Compressing to VC dimension many points, 2003.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to the second installment of the Learning Theory Alliance blog post series! In case you missed the first installment on calibration, check it out here. This time, Nataly Brukhim and Chirag Pabbaraju tell us about some exciting breakthrough results on multiclass PAC learning. The gold standard of learning theory is the classical PAC model [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"[{\"id\":\"d9aa3d0e-7639-41e4-99cc-7d16f00a0f23\",\"content\":\"The projection $latex h|_S$ of $latex h$ to $latex n$-size set $latex S$ is thought of as the map from $latex [n]$ to $latex \\\\mathcal{Y}$ defined by $latex i \\\\mapsto h(x_i)$.\"}]","jetpack_post_was_ever_published":false},"categories":[4],"tags":[],"class_list":["post-553","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=553"}],"version-history":[{"count":5,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/553\/revisions"}],"predecessor-version":[{"id":658,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/553\/revisions\/658"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}