{"id":844,"date":"2024-12-20T14:48:08","date_gmt":"2024-12-20T14:48:08","guid":{"rendered":"https:\/\/www.let-all.com\/blog\/?p=844"},"modified":"2024-12-20T14:48:08","modified_gmt":"2024-12-20T14:48:08","slug":"structure-agnostic-causal-estimation","status":"publish","type":"post","link":"https:\/\/www.let-all.com\/blog\/2024\/12\/20\/structure-agnostic-causal-estimation\/","title":{"rendered":"Structure-Agnostic Causal Estimation"},"content":{"rendered":"\n<p class=\"\">We have another new technical blog post, courtesy <a href=\"https:\/\/jkjin.com\/\">Jikai Jin<\/a> and <a href=\"https:\/\/vsyrgkanis.com\/index.html\">Vasilis Syrgkanis<\/a>, about optimality of double machine learning for causal inference.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">An introduction to causal inference<\/h2>\n\n\n\n<p class=\"\">Causal inference deals with the fundamental question of \u201cwhat if\u201d, trying to estimate\/predict the counterfactual outcome that one does not directly observe. For instance, one may want to understand the effect of a new medicine on a population of patients. For each patient, we never simultaneously observe the outcome under the new medicine (treatment) and the outcome under the baseline treatment (control). This makes causal inference a challenging task, and the ground-truth causal parameter of interest is identifiable only under additional assumptions on the data generating process.<\/p>\n\n\n\n<p class=\"\">The most central quantity of interest in the causal inference literature is the Average Treatment Effect (ATE). To mathematically define the ATE, we will use the language of potential outcomes. We posit that nature generates two potential outcomes <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y_i%280%29%2C+Y_i%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y_i(0), Y_i(1)\" class=\"latex\" \/>, where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y_i%28d%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y_i(d)\" class=\"latex\" \/> can be thought as the outcome we would have observed from unit <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i\" class=\"latex\" \/>, had we treated them with treatment <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d%5Cin+%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d&#92;in &#92;{0,1&#92;}\" class=\"latex\" \/>. Then the ATE is defined as the average difference of these two potential outcomes in the population:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta+%3D+%5Cmathbb%7BE%7D_%7BP_0%7D%5BY%281%29-Y%280%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta = &#92;mathbb{E}_{P_0}[Y(1)-Y(0)]\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">Unless otherwise specified, we will always use a subscript <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0\" class=\"latex\" \/> to denote the ground-truth quantity. The main problem is that for each unit we do not observe both potential outcomes. Rather, we observe the potential outcome for the assigned treatment, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y_i+%3D+Y_i%28D_i%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y_i = Y_i(D_i)\" class=\"latex\" \/>.&nbsp;<\/p>\n\n\n\n<p class=\"\">The first key question in causal inference is the <em>identification question<\/em>: can we write the ATE, which depends on the distribution of unobserved quantities, as a function of the distribution of observed random variables? Many techniques have been developed in causal inference that solve the identification question under various assumptions on the data generating process and the kinds of variables that are observed. For the interested reader, one can search for terms such as identification by conditioning, instrumental variables, proximal causal inference, difference-in-differences, regression discontinuity and synthetic controls, and refer to related textbooks [AP09,CHK+24].<\/p>\n\n\n\n<p class=\"\">For the purpose of this blog we will focus on identification by conditioning, which has been well-studied in the literature and very frequently used in the practice of causal inference. This identification approach makes the assumption that, once we condition on a large enough set of observed characteristics <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> (typically referred to as \u201ccontrol variables\u201d or \u201cconfounders\u201d), the treatment is assigned as if it was a randomized trial; a condition typically referred to as the conditional ignorability assumption<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7BY%280%29%2CY%281%29%5C%7D+%5Cperp+D+%5Cmid+X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{Y(0),Y(1)&#92;} &#92;perp D &#92;mid X\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">Under this assumption, the ATE is identifiable via the well-known g-formula:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta+%3D+%5Cmathbb%7BE%7D%5Bg%281%2CX%29-g%280%2CX%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta = &#92;mathbb{E}[g(1,X)-g(0,X)]\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">where the function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28d%2Cx%29+%3D+%5Cmathbb%7BE%7D%5BY%5Cmid+X%3Dx%2C+D%3Dd%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(d,x) = &#92;mathbb{E}[Y&#92;mid X=x, D=d]\" class=\"latex\" \/> is a regression function and is thus uniquely determined by the distribution of observed data. Intuitively, this formula says: train a predictive model that predicts the outcome from the treatment and the control variables and then take the average difference of the predictions of this model as you flip the treatment variable on or off. This quantity is also strongly related to the <a href=\"https:\/\/christophm.github.io\/interpretable-ml-book\/pdp.html\">partial dependence plot<\/a>, used frequently in interpretable machine learning. It basically corresponds to the difference of the value of the partial dependence plot of the outcome with the treatment, when the treatment takes value one vs when the treatment takes value zero.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The causal machine learning paradigm<\/h2>\n\n\n\n<p class=\"\">The second key question in causal inference is the <em>estimation question<\/em>: given <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> samples of the observed variables, how should we estimate the ATE? In other words, we need to translate the identification strategy into an estimation strategy. For instance, in the context of identification by conditioning, note that even though our goal is to estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta\" class=\"latex\" \/>, to achieve that we also need to estimate the complicated non-parametric regression function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g\" class=\"latex\" \/>. Such auxiliary functions, whose estimation is required in order to estimate the target parameter of interest, are referred to as nuisance functions. The requirement to estimate complicated nuisance functions in a flexible manner arises in most identification strategies in causal inference and this is exactly where machine learning techniques can be of great help, giving rise to the Causal Machine Learning paradigm.<\/p>\n\n\n\n<p class=\"\">At a high level, causal machine learning is an emerging research area that incorporates machine learning (ML) techniques into statistical problems that emerge in causal inference. In the past decade, ML has gained tremendous success on numerous tasks, such as image classification, language processing, and video games. These problems more or less possess certain intrinsic structures that one can exploit. In image classification problems, for example, semantically meaningful objects can typically be found locally as a combination of pixels, and this suggests that using convolutional neural networks, rather than standard feed-forward neural networks, might lead to better results. The idea of causal machine learning is to leverage the ability of ML techniques to adapt to intrinsic notions of dimension, when learning the complex nuisance quantities that arise in causal identification strategies.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Double\/debiased machine learning: an overview<\/h2>\n\n\n\n<p class=\"\">What makes causal ML different from ML? To answer this question, it is instructive to revisit an extremely popular algorithm in causal ML: double\/debiased machine learning (DML)[CCD+17] (variants of the ideas we will present below have also appeared in the targeted learning literature [LR11], but for simplicity of exposition we adopt the DML paradigm in this blogpost).&nbsp;<\/p>\n\n\n\n<p class=\"\">Suppose that we are given i.i.d. data <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%28X_i%2CD_i%2CY_i%29%5C%7D_%7Bi%3D1%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{(X_i,D_i,Y_i)&#92;}_{i=1}^n\" class=\"latex\" \/> where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_i\" class=\"latex\" \/> is a high-dimensional covariate vector, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D_i\" class=\"latex\" \/> is a binary treatment variable and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y_i\" class=\"latex\" \/> is an outcome of interest. Without loss of generality, we can describe the data generating process of these variables via the following nonparametric regression equations:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y+%3D+g%28D%2CX%29+%2B+%5Cepsilon%2C+%5Cquad+%5Cmathbb%7BE%7D%5B%5Cepsilon%5Cmid+D%2CX%5D%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y = g(D,X) + &#92;epsilon, &#92;quad &#92;mathbb{E}[&#92;epsilon&#92;mid D,X]=0\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D%3Dp%28X%29%2B%5Ceta%2C+%5Cquad+%5Cmathbb%7BE%7D%5B%5Ceta+%5Cmid+X%5D%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D=p(X)+&#92;eta, &#92;quad &#92;mathbb{E}[&#92;eta &#92;mid X]=0\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28d%2Cx%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(d,x)\" class=\"latex\" \/> is known as the outcome regression and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p(x)\" class=\"latex\" \/> is known as the propensity score. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_0\" class=\"latex\" \/> be the distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28X%2CD%2CY%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(X,D,Y)\" class=\"latex\" \/>. Then the ATE problem asks us to estimate the quantity <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_0+%3D+E%5Bg%281%2CX%29+-+g%280%2CX%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_0 = E[g(1,X) - g(0,X)]\" class=\"latex\" \/>.&nbsp;<\/p>\n\n\n\n<p class=\"\">The ATE is just one example of a broad class of causal parameter estimation problems, for which the ground-truth parameter <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_0\" class=\"latex\" \/> satisfies some moment equation<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%5Bm%28Z%2C%5Ctheta_0%2Ch_0%28X%29%29%5D%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}[m(Z,&#92;theta_0,h_0(X))]=0\" class=\"latex\" \/>,<\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m(&#92;cdot)\" class=\"latex\" \/> is some moment function, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Z&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Z\" class=\"latex\" \/> is the observed data, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> is a subvector of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Z&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Z\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h_0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h_0\" class=\"latex\" \/> is the ground-truth nuisance function. In the case of ATE, we can for example choose <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Z%3DX&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Z=X\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%3D%28g%2Cp%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h=(g,p)\" class=\"latex\" \/>&nbsp; and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m%5Cleft%28Z%2C+%5Ctheta%2C+h%28X%29%5Cright%29%3Dg%281%2C+X%29-g%280%2C+X%29-%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m&#92;left(Z, &#92;theta, h(X)&#92;right)=g(1, X)-g(0, X)-&#92;theta\" class=\"latex\" \/>. Given this expression, a naive approach for estimating <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_0\" class=\"latex\" \/> can be derived by first using ML to fit an estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bh%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{h}\" class=\"latex\" \/> of the ground-truth nuisance functions <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h_0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h_0\" class=\"latex\" \/>, and then solve the empirical moment equation<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En+m%28Z_i%2C%5Ctheta%2C%5Chat%7Bh%7D%28X_i%29%29+%3D+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{1}{n}&#92;sum_{i=1}^n m(Z_i,&#92;theta,&#92;hat{h}(X_i)) = 0\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">However, the resulting estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}\" class=\"latex\" \/> would be biased if the nuisance estimates are biased. The latter happens quite often in practice, since ML typically requires using regularization to prevent the model from overfitting. As a result, it would be desirable if the quality of our estimate is more robust to nuisance estimation errors.<\/p>\n\n\n\n<p class=\"\">The key observation is that this would be the case if a <em>Neyman orthogonality<\/em> condition holds, namely&nbsp;<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%5Cleft%5B%5Cpartial_h+m%5Cleft%28Z%2C+%5Ctheta_0%2C+h_0%28X%29%5Cright%29%5Cright%5D%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}&#92;left[&#92;partial_h m&#92;left(Z, &#92;theta_0, h_0(X)&#92;right)&#92;right]=0\" class=\"latex\" \/>,<\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cpartial_h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;partial_h\" class=\"latex\" \/> denotes the functional derivative of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">Intuitively, this condition implies that the induced error is less sensitive to misspecification of nuisance function. Then a simple Taylor expansion would imply that the estimation error <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D_%7BDML%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}_{DML}\" class=\"latex\" \/> solved from the empirical moment equation would have second-order dependency on the nuisance errors.<\/p>\n\n\n\n<p class=\"\">In the case of ATE, the moment function<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m%28Z%2C+%5Ctheta%2C+h%28X%29%29%3D%28g%281%2C+X%29-g%280%2C+X%29%29%2B%5Cfrac%7BD%28Y-g%281%2C+X%29%29%7D%7Bp%28X%29%7D-%5Cfrac%7B%281-D%29%28Y-g%280%2C+X%29%29%7D%7B1-p%28X%29%7D-%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m(Z, &#92;theta, h(X))=(g(1, X)-g(0, X))+&#92;frac{D(Y-g(1, X))}{p(X)}-&#92;frac{(1-D)(Y-g(0, X))}{1-p(X)}-&#92;theta\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">satisfies such requirements, where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Z%3D%28X%2CD%2CY%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Z=(X,D,Y)\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%3D%28g%2Cp%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h=(g,p)\" class=\"latex\" \/>. Given a dataset <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7B%28X_i%2CD_i%2CY_i%29%5C%7D_%7Bi%3D1%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;{(X_i,D_i,Y_i)&#92;}_{i=1}^n\" class=\"latex\" \/>, we can split it into two datasets <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D_1%3D%5C%7B%28X_i%2CD_i%2CY_i%29%5C%7D_%7Bi%3D1%7D%5E%7Bn%2F2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}_1=&#92;{(X_i,D_i,Y_i)&#92;}_{i=1}^{n\/2}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BD%7D_2%3D%5C%7B%28X_i%2CD_i%2CY_i%29%5C%7D_%7Bi%3Dn%2F2%2B1%7D%5E%7Bn%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{D}_2=&#92;{(X_i,D_i,Y_i)&#92;}_{i=n\/2+1}^{n}\" class=\"latex\" \/> each with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%2F2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\/2\" class=\"latex\" \/> samples. Then DML consists of the following two stages:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">Use our favorite ML method (e.g. Lasso, random forest, neural network etc.) to estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(&#92;cdot)\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p(&#92;cdot)\" class=\"latex\" \/>.<\/li>\n\n\n\n<li class=\"\">Solve for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta\" class=\"latex\" \/> from the empirical moment equation<\/li>\n<\/ol>\n\n\n\n<p class=\"has-text-align-center\">&nbsp;<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bi%3Dn%2F2%2B1%7D%5En+m%28Z_i%2C%5Ctheta%2C%5Chat%7Bh%7D%28X_i%29%29%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{i=n\/2+1}^n m(Z_i,&#92;theta,&#92;hat{h}(X_i))=0\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">&nbsp;where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bh%7D%3D%28%5Chat%7Bg%7D%2C%5Chat%7Bp%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{h}=(&#92;hat{g},&#92;hat{p})\" class=\"latex\" \/> is our first-stage nuisance estimate.<\/p>\n\n\n\n<p class=\"\">Note that the main reason why DML would improve over the naive approach is that the moment function is chosen to satisfy the Neyman orthogonality property. By contrast, one can easily verify that the moment function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m%5Cleft%28Z%2C+%5Ctheta%2C+h%28X%29%5Cright%29%3Dg%281%2C+X%29-g%280%2C+X%29-%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m&#92;left(Z, &#92;theta, h(X)&#92;right)=g(1, X)-g(0, X)-&#92;theta\" class=\"latex\" \/> is <em>not<\/em> Neyman orthogonal.<\/p>\n\n\n\n<p class=\"\">It is well-known that for the case of the ATE, the DML approach also possesses <em>double robustness properties<\/em>; a property that dates back to the seminal work of [RRZ94]. In fact the resulting estimator is the well-known doubly robust estimator [RRZ94] with the extra element of sample splitting, when estimating the nuisance functions. Specifically, suppose that our first-stage nuisance estimates have mean-square errors <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_p\" class=\"latex\" \/> respectively, then under mild regularity assumptions, the DML estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D_%7BDML%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}_{DML}\" class=\"latex\" \/> satisfies<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cleft%7C%5Chat%7B%5Ctheta%7D_%7BDML%7D-%5Ctheta_0%5Cright%7C+%5Cleq+C%5Cleft%28%5Cepsilon_g+%5Cepsilon_p%2Bn%5E%7B-1+%2F+2%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;left|&#92;hat{&#92;theta}_{DML}-&#92;theta_0&#92;right| &#92;leq C&#92;left(&#92;epsilon_g &#92;epsilon_p+n^{-1 \/ 2}&#92;right)\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">with high probability. Intuitively, because the estimation error of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D_%7BDML%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}_{DML}\" class=\"latex\" \/> stems from the misspecification of nuisance functions in the moment equation, by Taylor\u2019s formula, it would contain the term <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g%5E%7B%5Calpha%7D%5Cepsilon_p%5E%7B%5Cbeta%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g^{&#92;alpha}&#92;epsilon_p^{&#92;beta}\" class=\"latex\" \/> if and only if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%5Cleft%5B%5Cpartial_g%5E%7B%5Calpha%7D%5Cpartial_p%5E%7B%5Cbeta%7D+m%5Cleft%28Z%2C+%5Ctheta_0%2C+h_0%28X%29%5Cright%29%5Cright%5D%5Cneq+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}&#92;left[&#92;partial_g^{&#92;alpha}&#92;partial_p^{&#92;beta} m&#92;left(Z, &#92;theta_0, h_0(X)&#92;right)&#92;right]&#92;neq 0\" class=\"latex\" \/>. By calculating the functional derivatives, it is then easy to check that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g%5Cepsilon_p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g&#92;epsilon_p\" class=\"latex\" \/> is the dominating term. In particular, Neyman orthogonality implies that all first-order error terms vanish.<\/p>\n\n\n\n<p class=\"\">Importantly, this guarantee is <em>structure-agnostic<\/em>: this rate does not rely on any structural assumptions on the nuisance functions. What we need to assume is merely access to black-box ML estimates with some mean-squared error bounds. This is the reason why DML is widely adopted in practice: while there exist alternative estimators that can achieve improved error rates under structural assumptions on the non-parametric components, these assumptions can easily be violated, making these estimators cumbersome to deploy.<\/p>\n\n\n\n<p class=\"\">The problems that causal ML studies are not new. In the non-parametric estimation literature, there have been extensive results that focus on non-parametric efficiency and optimal rates for estimating causal quantities, under structural assumptions on the model such as smoothness of the non-parametric parts of the data generating process [RLM17,KBRW22]. However, the causal ML approach takes a more structure-agnostic view on the estimation of these nuisance quantities, and essentially solely assumes access to a good black-box oracle that provides us with relatively accurate estimates. This naturally gives rise to the structure-agnostic minimax optimality framework.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The structure-agnostic framework<\/h2>\n\n\n\n<p class=\"\">We have seen that the key characteristic that differentiates the causal ML approach to estimation (e.g. the DML approach) with the traditional approaches is its structure-agnostic nature. In this section, we discuss the structure-agnostic framework that allows us to compare the performance of structure-agnostic estimators. This framework was originally proposed by [BKW23].<\/p>\n\n\n\n<p class=\"\">To keep things simple, we restrict ourselves to the same setting as the previous section. Now suppose we have nuisance estimates <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bh%7D%3D%28%5Chat%7Bg%7D%2C%5Chat%7Bp%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{h}=(&#92;hat{g},&#92;hat{p})\" class=\"latex\" \/> with mean-square errors <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_p\" class=\"latex\" \/>. The structure agnostic minimax optimality framework asks the following question: if we don\u2019t make any further restriction on the data generating process other than the fact that we have access to estimates for the nuisance functions that have mean-squared-error that is upper bounded of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_p\" class=\"latex\" \/>, then what is the best estimation rate that is achievable by any estimation method?<\/p>\n\n\n\n<p class=\"\">To formalize this, we define the uncertainty set, as the set containing all distributions that are consistent with the given estimators:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D_%7B%5Cepsilon_g%2C+%5Cepsilon_p%7D+%3D+%5CBig%5C%7B+%28P_X%2C+p%2C+g%29+%5C%3B%7C%5C%3B+%5C%7Cg%28d%2C+X%29+-+%5Chat%7Bg%7D%28d%2C+X%29%5C%7C_%7BP_X%2C+2%7D%5E2+%5Cleq+%5Cepsilon_g%2C+%5C%3B+d+%5Cin+%5C%7B0%2C1%5C%7D%2C+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}_{&#92;epsilon_g, &#92;epsilon_p} = &#92;Big&#92;{ (P_X, p, g) &#92;;|&#92;; &#92;|g(d, X) - &#92;hat{g}(d, X)&#92;|_{P_X, 2}^2 &#92;leq &#92;epsilon_g, &#92;; d &#92;in &#92;{0,1&#92;}, \" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cleft%5C%7C+p%28X%29+-+%5Chat%7Bp%7D%28X%29+%5Cright%5C%7C_%7BP_X%2C+2%7D%5E2+%5Cleq+%5Cepsilon_p%2C+%5C%3B+0+%5Cleq+p%28x%29%2C+%5C%3B+g%28d%2C+x%29+%5Cleq+1%2C+%5C%3B+%5Cforall+x+%5Cin+%5Cmathcal%7BX%7D%2C+%5C%3B+d+%5Cin+%5C%7B0%2C1%5C%7D+%5CBig%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;left&#92;| p(X) - &#92;hat{p}(X) &#92;right&#92;|_{P_X, 2}^2 &#92;leq &#92;epsilon_p, &#92;; 0 &#92;leq p(x), &#92;; g(d, x) &#92;leq 1, &#92;; &#92;forall x &#92;in &#92;mathcal{X}, &#92;; d &#92;in &#92;{0,1&#92;} &#92;Big&#92;}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_X\" class=\"latex\" \/> is the marginal distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/>. Here we restrict ourselves to the case where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Y\" class=\"latex\" \/> are binary. This additional constraint would only strengthen our minimax lower bounds presented in this blog. In this case,&nbsp; each tuple <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28P_X%2Cp%2Cg%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(P_X,p,g)\" class=\"latex\" \/> uniquely determines a distribution over observational data. For any set <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}\" class=\"latex\" \/>, we define the minimax <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1-%5Cgamma&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1-&#92;gamma\" class=\"latex\" \/> quantile risk for estimating the ATE by<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathfrak%7BM%7D_%7Bn%2C+%5Cgamma%7D%5E%7BA+T+E%7D%28%5Cmathcal%7BF%7D%29%3D%5Cinf+_%7B%5Chat%7B%5Ctheta%7D%3A%28%5Cmathcal%7BX%7D+%5Ctimes%5Cmathcal%7BD%7D+%5Ctimes+%5Cmathcal%7BY%7D%29%5En+%5Cmapsto+%5Cmathbb%7BR%7D%7D+%5Csup+_%7Bs%3D%5Cleft%28P_X%5E%2A%2C+p%5E%2A%2Cg%5E%2A%5Cright%29+%5Cin+%5Cmathcal%7BF%7D%7D+Q_%7BP_s%2C1-%5Cgamma%7D%5Cleft%28%5Cleft%7C%5Chat%7B%5Ctheta%7D-%5Ctheta_s%5E%7B%5Cmathrm%7BATE%7D%7D%5Cright%7C%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathfrak{M}_{n, &#92;gamma}^{A T E}(&#92;mathcal{F})=&#92;inf _{&#92;hat{&#92;theta}:(&#92;mathcal{X} &#92;times&#92;mathcal{D} &#92;times &#92;mathcal{Y})^n &#92;mapsto &#92;mathbb{R}} &#92;sup _{s=&#92;left(P_X^*, p^*,g^*&#92;right) &#92;in &#92;mathcal{F}} Q_{P_s,1-&#92;gamma}&#92;left(&#92;left|&#92;hat{&#92;theta}-&#92;theta_s^{&#92;mathrm{ATE}}&#92;right|&#92;right)\" class=\"latex\" \/>,<\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_s\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_s%5E%7B%5Cmathrm%7BATE%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_s^{&#92;mathrm{ATE}}\" class=\"latex\" \/> are the data distribution and the ATE induced by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s\" class=\"latex\" \/> respectively, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Q_%7BP%2C1-%5Cgamma%7D%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Q_{P,1-&#92;gamma}(&#92;cdot)\" class=\"latex\" \/> is the quantile function under data distribution <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P\" class=\"latex\" \/>. Clearly, our previous discussions of DML implies that the worst-case risk is at most <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g%5Cepsilon_p%2Bn%5E%7B-1%2F2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g&#92;epsilon_p+n^{-1\/2}\" class=\"latex\" \/>. This framework precisely captures the main idea behind causal ML estimators that we described in the previous section.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Main results<\/h2>\n\n\n\n<p class=\"\">In this section, we introduce our main results on structure-agnostic lower bounds [JS24]. Prior to our work, the only known structure agnostic lower bounds were established in [BKW23]. In their paper, it is shown that DML is optimal for estimating a set of functionals of interest, which relate to the ATE but which do not include the ATE functional.<\/p>\n\n\n\n<p class=\"\">Our first result establishes the optimality of DML for estimating the ATE, i.e., the doubly robust estimator with sample splitting, achieves the statistically optimal rate. As discussed in the previous section, the DML estimator for the ATE is given by<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D%5E%7B%5Cmathrm%7BATE%7D%7D%3D%5Cfrac%7B1%7D%7Bn%7D+%5Csum_%7Bi%3D1%7D%5En%5Cleft%5B%5Chat%7Bg%7D%5Cleft%281%2CX_i%5Cright%29-%5Chat%7Bg%7D%5Cleft%280%2CX_i%5Cright%29%2B%5Cfrac%7BD_i-%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%7D%7B%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%5Cleft%281-%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%5Cright%29%7D%5Cleft%28Y_i-%5Chat%7Bg%7D%5Cleft%28D_i%2CX_i%5Cright%29%5Cright%29%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}^{&#92;mathrm{ATE}}=&#92;frac{1}{n} &#92;sum_{i=1}^n&#92;left[&#92;hat{g}&#92;left(1,X_i&#92;right)-&#92;hat{g}&#92;left(0,X_i&#92;right)+&#92;frac{D_i-&#92;hat{p}&#92;left(X_i&#92;right)}{&#92;hat{p}&#92;left(X_i&#92;right)&#92;left(1-&#92;hat{p}&#92;left(X_i&#92;right)&#92;right)}&#92;left(Y_i-&#92;hat{g}&#92;left(D_i,X_i&#92;right)&#92;right)&#92;right]\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">and has the structure-agnostic rate of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g%5Cepsilon_p%2Bn%5E%7B-1%2F2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g&#92;epsilon_p+n^{-1\/2}\" class=\"latex\" \/>. We now establish a matching lower bound.<\/p>\n\n\n\n<p class=\"\"><strong>Theorem 1.<\/strong> Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathrm%7Bsupp%7D%28X%29%3D%5B0%2C1%5D%5EK&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathrm{supp}(X)=[0,1]^K\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctilde%7B%5Cmathcal%7BF%7D%7D_%7B%5Cepsilon_g%2C+%5Cepsilon_p%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;tilde{&#92;mathcal{F}}_{&#92;epsilon_g, &#92;epsilon_p}\" class=\"latex\" \/> contains all distributions in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BF%7D_%7B%5Cepsilon_g%2C+%5Cepsilon_p%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{F}_{&#92;epsilon_g, &#92;epsilon_p}\" class=\"latex\" \/> with marginal distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> being uniform.&nbsp; For any constant <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1%2F2%3C%5Cgamma%3C1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\/2&lt;&#92;gamma&lt;1\" class=\"latex\" \/>, if our nuisance estimates <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Chat%7Bg%7D%2C%5Chat%7Bp%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;hat{g},&#92;hat{p})\" class=\"latex\" \/> take values in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Bc%2C1-c%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"[c,1-c]\" class=\"latex\" \/>, where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=c%5Cin%280%2C1%2F2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"c&#92;in(0,1\/2)\" class=\"latex\" \/> is a constant, then&nbsp;<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathfrak%7BM%7D_%7Bn%2C+%5Cgamma%7D%5E%7BA+T+E%7D%5Cleft%28%5Ctilde%7B%5Cmathcal%7BF%7D%7D_%7B%5Cepsilon_g%2C%5Cepsilon_p%7D%5Cright%29%3D%5COmega%5Cleft%28%5Cepsilon_g%5Cepsilon_p%2Bn%5E%7B-1+%2F+2%7D%5Cright%29+.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathfrak{M}_{n, &#92;gamma}^{A T E}&#92;left(&#92;tilde{&#92;mathcal{F}}_{&#92;epsilon_g,&#92;epsilon_p}&#92;right)=&#92;Omega&#92;left(&#92;epsilon_g&#92;epsilon_p+n^{-1 \/ 2}&#92;right) .\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">Interestingly, knowing the marginal distribution of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> would not change the statistical limit.<\/p>\n\n\n\n<p class=\"\">We also consider another important causal parameter, the average treatment effect of the treated (ATT), defined by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta%5E%7BATT%7D%3D%5Cmathbb%7BE%7D%5Cleft%5BY%281%29-Y%280%29+%5Cmid+D%3D1%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta^{ATT}=&#92;mathbb{E}&#92;left[Y(1)-Y(0) &#92;mid D=1&#92;right]\" class=\"latex\" \/>. Under conditional ignorability, it can be written as<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta%5E%7BATT%7D%3D%5Cmathbb%7BE%7D%5Cleft%5BY-g_0%280%2C+X%29+%5Cmid+D%3D1%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta^{ATT}=&#92;mathbb{E}&#92;left[Y-g_0(0, X) &#92;mid D=1&#92;right]\" class=\"latex\" \/>.<\/p>\n\n\n\n<p class=\"\">The DML estimate of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta%5E%7BATT%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta^{ATT}\" class=\"latex\" \/> is&nbsp;<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D_%7BDML%7D%3D%5Cleft%28%5Csum_%7Bi%3D1%7D%5En+D_i%5Cright%29%5E%7B-1%7D%5Csum_%7Bi%3D1%7D%5En%5Cleft%5BD_i%5Cleft%28Y_i-%5Chat%7Bg%7D%5Cleft%280%2CX_i%5Cright%29%5Cright%29-%5Cfrac%7B%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%7D%7B1-%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%7D%5Cleft%281-D_i%5Cright%29%5Cleft%28Y_i-%5Chat%7Bg%7D%5Cleft%280%2C+X_i%5Cright%29%5Cright%29%5Cright%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}_{DML}=&#92;left(&#92;sum_{i=1}^n D_i&#92;right)^{-1}&#92;sum_{i=1}^n&#92;left[D_i&#92;left(Y_i-&#92;hat{g}&#92;left(0,X_i&#92;right)&#92;right)-&#92;frac{&#92;hat{p}&#92;left(X_i&#92;right)}{1-&#92;hat{p}&#92;left(X_i&#92;right)}&#92;left(1-D_i&#92;right)&#92;left(Y_i-&#92;hat{g}&#92;left(0, X_i&#92;right)&#92;right)&#92;right]\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">and can be shown to achieve the same <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon_g%5Cepsilon_p%2Bn%5E%7B-1%2F2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon_g&#92;epsilon_p+n^{-1\/2}\" class=\"latex\" \/> rate as for ATE. We also show that this rate is unimprovable:<\/p>\n\n\n\n<p class=\"\"><strong>Theorem 2.<\/strong> In the same setting as Theorem 1, we have<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathfrak%7BM%7D_%7Bn%2C+%5Cgamma%7D%5E%7BA+T+T%7D%5Cleft%28%5Ctilde%7B%5Cmathcal%7BF%7D%7D_%7B%5Cepsilon_g%2C%5Cepsilon_p%7D%5Cright%29%3D%5COmega%5Cleft%28%5Cepsilon_g%5Cepsilon_p%2Bn%5E%7B-1+%2F+2%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathfrak{M}_{n, &#92;gamma}^{A T T}&#92;left(&#92;tilde{&#92;mathcal{F}}_{&#92;epsilon_g,&#92;epsilon_p}&#92;right)=&#92;Omega&#92;left(&#92;epsilon_g&#92;epsilon_p+n^{-1 \/ 2}&#92;right)\" class=\"latex\" \/>.&nbsp;<\/p>\n\n\n\n<p class=\"\">Finally, we can also extend Theorem 1 to the weighted ATE (WATE) defined as<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta%5E%7BWATE%7D%3D%5Cmathbb%7BE%7D_%7BP_0%7D%5Bw%28X%29%28Y%281%29-Y%280%29%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta^{WATE}=&#92;mathbb{E}_{P_0}[w(X)(Y(1)-Y(0))]\" class=\"latex\" \/>,<\/p>\n\n\n\n<p class=\"\">which arises in policy evaluation [AW21]. Here <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w(x)\" class=\"latex\" \/> is a uniformly bounded weight function but is not required to be non-negative. The following theorem addresses the minimax structure-agnostic rate for estimating WATE:<\/p>\n\n\n\n<p class=\"\"><strong>Theorem 3.<\/strong> In the same setting as Theorem 1, we have<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathfrak%7BM%7D_%7Bn%2C+%5Cgamma%7D%5E%7BW+A+T+E%7D%5Cleft%28%5Ctilde%7B%5Cmathcal%7BF%7D%7D_%7B%5Cepsilon_g%2C%5Cepsilon_p%7D%5Cright%29%3D%5COmega%5Cleft%28%5C%7Cw%5C%7C_%7BL%5E2%5Cleft%28P_X%5Cright%29%7D+%5Cepsilon_g+%5Cepsilon_p%2B%5C%7Cw%5C%7C_%7BL%5E%7B%5Cinfty%7D%5Cleft%28P_X%5Cright%29%7D+n%5E%7B-1+%2F+2%7D%5Cright%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathfrak{M}_{n, &#92;gamma}^{W A T E}&#92;left(&#92;tilde{&#92;mathcal{F}}_{&#92;epsilon_g,&#92;epsilon_p}&#92;right)=&#92;Omega&#92;left(&#92;|w&#92;|_{L^2&#92;left(P_X&#92;right)} &#92;epsilon_g &#92;epsilon_p+&#92;|w&#92;|_{L^{&#92;infty}&#92;left(P_X&#92;right)} n^{-1 \/ 2}&#92;right),\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"\">where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_X\" class=\"latex\" \/> is the uniform distribution over <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathrm%7Bsupp%7D%28X%29%3D%5B0%2C1%5D%5EK&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathrm{supp}(X)=[0,1]^K\" class=\"latex\" \/>. Moreover, this rate is achieved by the DML estimator<\/p>\n\n\n\n<p class=\"has-text-align-center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7B%5Ctheta%7D%5E%7B%5Cmathrm%7BWATE%7D%7D%3D%5Cfrac%7B1%7D%7Bn%7D+%5Csum_%7Bi%3D1%7D%5En+w%5Cleft%28X_i%5Cright%29%5Cleft%5B%5Chat%7Bg%7D%5Cleft%281%2C+X_i%5Cright%29-%5Chat%7Bg%7D%5Cleft%280%2CX_i%5Cright%29%2B%5Cfrac%7BD_i-%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%7D%7B%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%5Cleft%281-%5Chat%7Bp%7D%5Cleft%28X_i%5Cright%29%5Cright%29%7D%5Cleft%28Y_i-%5Chat%7Bg%7D%5Cleft%28D_i%2C+X_i%5Cright%29%5Cright%29%5Cright%5D+.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{&#92;theta}^{&#92;mathrm{WATE}}=&#92;frac{1}{n} &#92;sum_{i=1}^n w&#92;left(X_i&#92;right)&#92;left[&#92;hat{g}&#92;left(1, X_i&#92;right)-&#92;hat{g}&#92;left(0,X_i&#92;right)+&#92;frac{D_i-&#92;hat{p}&#92;left(X_i&#92;right)}{&#92;hat{p}&#92;left(X_i&#92;right)&#92;left(1-&#92;hat{p}&#92;left(X_i&#92;right)&#92;right)}&#92;left(Y_i-&#92;hat{g}&#92;left(D_i, X_i&#92;right)&#92;right)&#92;right] .\" class=\"latex\" \/><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion and discussions<\/h2>\n\n\n\n<p class=\"\">In this blogpost, we introduced the setting and main results of our recent paper [JS24], that establishes the optimality of the celebrated DML algorithm, and in particular the doubly robust estimator with sample splitting, in a structure-agnostic framework for two important causal parameters: the ATE and the ATT, as well as the weighted version of the former. For practitioners, the main takeaway is that if no particular structural insights are available, then it might be better to use DML rather than more refined estimators that leverage potentially brittle assumptions on the non-parametric components of the data generating process.&nbsp;<\/p>\n\n\n\n<p class=\"\">[AW21] Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica 89.1 (2021): 133-161.<\/p>\n\n\n\n<p class=\"\">[BKW23] Sivaraman Balakrishnan, Edward H Kennedy, and Larry Wasserman. The fundamental limits of structure-agnostic functional estimation. arXiv preprint arXiv:2305.04116, 2023.&nbsp;<\/p>\n\n\n\n<p class=\"\">[CCD+17] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. Double\/debiased\/neyman machine learning of treatment effects. American Economic Review,107(5):261\u2013265, 2017.<\/p>\n\n\n\n<p class=\"\">[JS24] Jikai Jin and Vasilis Syrgkanis. Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation. arXiv preprint arXiv:2402.14264, 2024.<\/p>\n\n\n\n<p class=\"\">[KBRW22] Edward H Kennedy, Sivaraman Balakrishnan, James M Robins, and Larry Wasserman. Minimax rates for heterogeneous causal effect estimation. The Annals of Statistics 52.2 (2024): 793-816.<\/p>\n\n\n\n<p class=\"\">[RLM17] James M Robins, Lingling Li, and Rajarshi Mukherjee. Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951\u20131987, 2017.<\/p>\n\n\n\n<p class=\"\">[RRZ94] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. &#8220;Estimation of regression coefficients when some regressors are not always observed.&#8221; Journal of the American Statistical Association 89.427 (1994): 846-866.<\/p>\n\n\n\n<p class=\"\">[LR11] M.J. Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and experimental data (Vol. 4). New York: Springer.<\/p>\n\n\n\n<p class=\"\">[AP09] Joshua D. Angrist, and J\u00f6rn-Steffen Pischke. Mostly harmless econometrics: An empiricist&#8217;s companion. Princeton university press, 2009.<\/p>\n\n\n\n<p class=\"\">[CHK+24] Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis (2024). Applied Causal Inference Powered by ML and AI.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have another new technical blog post, courtesy Jikai Jin and Vasilis Syrgkanis, about optimality of double machine learning for causal inference. An introduction to causal inference Causal inference deals with the fundamental question of \u201cwhat if\u201d, trying to estimate\/predict the counterfactual outcome that one does not directly observe. For instance, one may want to [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","om_disable_all_campaigns":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[4],"tags":[],"class_list":["post-844","post","type-post","status-publish","format-standard","hentry","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/844","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/comments?post=844"}],"version-history":[{"count":38,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/844\/revisions"}],"predecessor-version":[{"id":885,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/posts\/844\/revisions\/885"}],"wp:attachment":[{"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/media?parent=844"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/categories?post=844"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.let-all.com\/blog\/wp-json\/wp\/v2\/tags?post=844"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}