24 September, 2018

Example: partially linear model

\[ y_i = \theta d_i + f(x_i) + \epsilon_i \]

  • Interested in \(\theta\)
  • Assume \(\Er[\epsilon|d,x] = 0\)
  • Nuisance parameter \(f()\)
  • E.g. Donohue and Levitt (2001)

Example: Matching

  • Binary treatment \(d_i \in \{0,1\}\)
  • Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
  • Interested in average treatment effect : \(\theta = \Er[y_i(1) - y_i(0)]\)
  • Covariates \(x_i\)
  • Assume unconfoundedness : \(d_i \indep y_i(1), y_i(0) | x_i\)
  • E.g. Connors et al. (1996)

Example: Matching

  • Estimatable formulae for ATE : \[ \begin{align*} \theta = & \Er\left[\frac{y_i d_i}{\Pr(d = 1 | x_i)} - \frac{y_i (1-d_i)}{1-\Pr(d=1|x_i)} \right] \\ \theta = & \Er\left[\Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\right] \\ \theta = & \Er\left[ \begin{array}{l} d_i \frac{y_i - \Er[y_i | d_i = 1, x_i]}{\Pr(d=1|x_i)} - (1-d_i)\frac{y_i - \Er[y_i | d_i = 0, x_i]}{1-\Pr(d=1|x_i)} + \\ + \Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\end{array}\right] \end{align*} \]

Example: IV

\[ \begin{align*} y_i = & \theta d_i + f(x_i) + \epsilon_i \\ d_i = & g(x_i, z_i) + u_i \end{align*} \]

  • Interested in \(\theta\)
  • Assume \(\Er[\epsilon|x,z] = 0\), \(\Er[u|x,z]=0\)
  • Nuisance parameters \(f()\), \(g()\)
  • E.g. Angrist and Krueger (1991)

Example: LATE

  • Binary instrumet \(z_i \in \{0,1\}\)
  • Potential treatments \(d_i(0), d_i(1) \in \{0,1\}\), \(d_i = d_i(Z_i)\)
  • Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
  • Covariates \(x_i\)
  • \((y_i(1), y_i(0), d_i(1), d_i(0)) \indep z_i | x_i\)
  • Local average treatment effect: \[ \begin{align*} \theta = & \Er\left[\Er[y_i(1) - y_i(0) | x, d_i(1) > d_i(0)]\right] \\ = & \Er\left[\frac{\Er[y|z=1,x] - \Er[y|z=0,x]} {\Er[d|z=1,x]-\Er[d|z=0,x]} \right] \end{align*} \]

General setup

  • Parameter of interest \(\theta \in \R^{d_\theta}\)

  • Nuisance parameter \(\eta \in T\)

  • Moment conditions \[ \Er[\psi(W;\theta_0,\eta_0) ] = 0 \in \R^{d_\theta} \] with \(\psi\) known

  • Estimate \(\hat{\eta}\) using some machine learning method

  • Estimate \(\hat{\theta}\) from \[ \En[\psi(w_i;\hat{\theta},\hat{\eta}) ] = 0 \]

Example: partially linear model

\[ y_i = \theta_0 d_i + f_0(x_i) + \epsilon_i \]

  • Compare the estimates from

    1. \(\En[d_i(y_i - \tilde{\theta} d_i - \hat{f}(x_i)) ] = 0\)


    1. \(\En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0\)

    where \(m(x) = \Er[d|x]\) and \(\mu(y) = \Er[y|x]\)

Lessons from the example

  • Need an extra condition on moments – Neyman orthogonality \[ \partial \eta \Er[\psi(W;\theta_0,\eta_0)](\eta-\eta_0) = 0 \]

  • Want estimators faster than \(n^{-1/4}\) in the prediction norm, \[ \sqrt{\En[(\hat{\eta}(x_i) - \eta(x_i))^2]} \lesssim_P n^{-1/4} \]

  • Also want estimators that satisfy something like \[ \sqrt{n} \En[(\eta(x_i)-\hat{\eta}(x_i))\epsilon_i] = o_p(1) \]
    • Sample splitting will make this easier


Introduction to machine learning

Some prediction examples

Machine learning is tailored for prediction, let’s look at some data and see how well it works

Predicting house prices

  • Example from Mullainathan and Spiess (2017)
  • Training on 10000 observations from AHS
  • Predict log house price using 150 variables
  • Holdout sample of 41808

AHS variables

##  Min.   : 0.00   1: 5773   1:10499   1:11928   -7: 1851   1:51513  
##  1st Qu.:11.56   2:13503   2: 1124   2:39037   1 :49353   2:  295  
##  Median :12.10   3:15408   3:  202   9:  843   2 :  604            
##  Mean   :12.06   4:17124   4:  103                                 
##  3rd Qu.:12.61             7:39880                                 
##  Max.   :15.48                                                     
##  -1:49868   -8:  133   -8:  133   -8:  133   -8:  133   -8:  133  
##  1 :  927   -7:   50   -7:   50   -7:   50   -7:   50   -7:   50  
##  2 : 1013   1 :  446   1 :  813   1 : 8689   1 :   61   1 :41895  
##             2 :51179   2 :50812   2 :42936   2 :51564   2 : 9730  
##  NEWC       DISH      WASH      DRY       NUNIT2    BURNER     COOK     
##  -9:50485   1:42221   1:50456   1:49880   1:44922   -6:51567   1:51567  
##  1 : 1323   2: 9587   2: 1352   2: 1928   2: 2634   1 :   87   2:  241  
##                                           3: 2307   2 :  154            
##                                           4: 1945                       
##  OVEN      
##  -6:51654  
##  1 :  127  
##  2 :   27  

Performance of different algorithms in predicting housing values
in sample MSE in sample R^2 out of sample MSE out of sample R^2
OLS 0.589 0.473 0.674 0.417
Tree 0.675 0.396 0.758 0.345
Lasso 0.603 0.460 0.656 0.433
Forest 0.166 0.851 0.632 0.454
Ensemble 0.216 0.807 0.625 0.460

Predicting pipeline revenues

  • Data on US natural gas pipelines
    • Combination of FERC Form 2, EIA Form 176, and other sources, compiled by me
    • 1996-2016, 236 pipeline companies, 1219 company-year observations
  • Predict: \(y =\) profits from transmission of natural gas
  • Covariates: year, capital, discovered gas reserves, well head gas price, city gate gas price, heating degree days, state(s) that each pipeline operates in

##   transProfit         transPlant_bal_beg_yr   cityPrice      
##  Min.   : -31622547   Min.   :0.000e+00     Min.   : 0.4068  
##  1st Qu.:   2586031   1st Qu.:2.404e+07     1st Qu.: 3.8666  
##  Median :  23733170   Median :1.957e+08     Median : 5.1297  
##  Mean   :  93517513   Mean   :7.772e+08     Mean   : 5.3469  
##  3rd Qu.: 129629013   3rd Qu.:1.016e+09     3rd Qu.: 6.5600  
##  Max.   :1165050214   Max.   :1.439e+10     Max.   :12.4646  
##  NA's   :2817         NA's   :2692          NA's   :1340     
##    wellPrice     
##  Min.   :0.0008  
##  1st Qu.:2.1230  
##  Median :3.4370  
##  Mean   :3.7856  
##  3rd Qu.:5.1795  
##  Max.   :9.6500  
##  NA's   :2637

Predicting pipeline revenues : methods

  • OLS : 67 covariates (year dummies and state(s) create a lot)
  • Lasso
  • Random forests
  • Randomly choose 75% of sample to fit the models, then look at prediction accuracy in remaining 25%

Training sample
OLS Lasso Random forest Neural Network
relative MSE 0.103 0.030 0.062 0.012
relative MAE 0.260 0.137 0.185 0.094
Hold-out sample
OLS Lasso Random forest Neural Network
relative MSE 0.142 0.076 0.116 0.079
relative MAE 0.291 0.170 0.236 0.230


  • Lasso solves a penalized (regularized) regression problem \[ \hat{\beta} = \argmin_\beta \En [ (y_i - x_i'\beta)^2 ] + \frac{\lambda}{n} \norm{ \hat{\Psi} \beta}_1 \]
  • Penalty parameter \(\lambda\)
  • Diagonal matrix \(\hat{\Psi} = diag(\hat{\psi})\)
  • Dimension of \(x_i\) is \(p\) and implicitly depends on \(n\)
    • can have \(p >> n\)

Statistical properties of Lasso

  • Model : \[ y_i = x_i'\beta_0 + \epsilon_i \]
    • \(\Er[x_i \epsilon_i] = 0\)
    • \(\beta_0 \in \R^n\)
    • \(p\), \(\beta_0\), \(x_i\), and \(s\) implicitly depend on \(n\)
    • \(\log p = o(n^{1/3})\)
      • \(p\) may increase with \(n\) and can have \(p>n\)
  • Sparsity \(s\)
    • Exact : \(\norm{\beta_0}_0 = s = o(n)\)
    • Approximate : \(|\beta_{0,j}| < Aj^{-a}\), \(a > 1/2\), \(s \propto n^{1/(2a)}\)

Rate of convergence

  • With \(\lambda = 2c \sqrt{n} \Phi^{-1}(1-\gamma/(2p))\) \[ \sqrt{\En[(x_i'(\hat{\beta}^{lasso} - \beta_0))^2 ] } \lesssim_P \sqrt{ (s/n) \log (p) }, \]

\[ \norm{\hat{\beta}^{lasso} - \beta_0}_2 \lesssim_P \sqrt{ (s/n) \log (p) }, \]


\[ \norm{\hat{\beta}^{lasso} - \beta_0}_1 \lesssim_P \sqrt{ (s^2/n) \log (p) } \]

  • Constant \(c>1\)

    • Small \(\gamma \to 0\) with \(n\), and \(\log(1/\gamma) \lesssim \log(p)\)

    • Rank like condition on \(x_i\)

  • near-oracle rate

Rate of convergence

  • Using cross-validation to choose \(\lambda\) known bounds are worse
    • With Gaussian errors: \(\sqrt{\En[(x_i'(\hat{\beta}^{lasso} - \beta_0))^2 ] } \lesssim_P \sqrt{ (s/n) \log (p) } \log(pn)^{7/8}\),
    • Without Gaussian error \(\sqrt{\En[(x_i'(\hat{\beta}^{lasso} - \beta_0))^2 ] } \lesssim_P \left( \frac{s \log(pn)^2}{n} \right)^{1/4}\)
    • Chetverikov, Liao, and Chernozhukov (2016)

Other statistical properties

  • Inference on \(\beta\): not the goal in our motivating examples
    • Difficult, but some recent results
    • See Lee et al. (2016), Taylor and Tibshirani (2017), Caner and Kock (2018)
  • Model selection: not the goal in our motivating examples
    • Under stronger conditions, Lasso correctly selects the nonzero components of \(\beta_0\)
    • See Belloni and Chernozhukov (2011)


  • Two steps :

    1. Estimate \(\hat{\beta}^{lasso}\)

    2. \({\hat{\beta}}^{post} =\) OLS regression of \(y\) on components of \(x\) with nonzero \(\hat{\beta}^{lasso}\)

  • Same rates of convergence as Lasso
  • Under some conditions post-Lasso has lower bias
    • If Lasso selects correct model, post-Lasso converges at the oracle rate

Random forests

Regression trees

  • \(y_i \in R\) on \(x_i \in \R^p\)
  • Want to estimate \(\Er[y | x]\)
  • Locally constant estimate \[ \hat{t}(x) = \sum_m^M c_m 1\{x \in R_m \} \]
  • Rectangular regions \(R_m\) determined by tree

Simulated data

Estimated tree

Estimated tree

Tree algorithm

  • For each region, solve \[ \min_{j,s} \left[ \min_{c_1} \sum_{i: x_{i,j} \leq s, x_i \in R} (y_i - c_1)^2 + \min_{c_2} \sum_{i: x_{i,j} > s, x_i \in R} (y_i - c_2)^2 \right] \]
  • Repeat with \(R = \{x:x_{i,j} \leq s^*\} \cap R\) and \(R = \{x:x_{i,j} \leq s^*\} \cap R\)
  • Stop when \(|R| =\) some chosen minimum size
  • Prune tree \[ \min_{tree \subset T} \sum (\hat{f}(x)-y)^2 + \alpha|\text{terminal nodes in tree}| \]

Random forests

  • Average randomized regression trees
  • Trees randomized by
    • Bootstrap or subsampling
    • Randomize branches: \[ \min_{j \in S,s} \left[ \min_{c_1} \sum_{i: x_{i,j} \leq s, x_i \in R} (y_i - c_1)^2 + \min_{c_2} \sum_{i: x_{i,j} > s, x_i \in R} (y_i - c_2)^2 \right] \] where \(S\) is random subset of \(\{1, ..., p\}\)
  • Variance reduction

Rate of convergence: regression tree

  • \(x \in [0,1]^p\), \(\Er[y|x]\) Lipschitz in \(x\)
  • Crude calculation for single tree, let denote \(R_i\) node that contains \(x_i\) \[ \begin{align*} \Er(\hat{t}(x_i) - \Er[y|x_i])^2 = & \overbrace{\Er(\hat{t}(x_i) - \Er[y|x\in R_i])^2}^{variance} + \overbrace{(\Er[y|x \in R_i] - \Er[y|x])^2}^{bias^2} \\ = & O_p(1/m) + O\left(L^2 \left(\frac{m}{n}\right)^{2/p}\right) \end{align*} \] optimal \(m = O(n^{2/(2+p)})\) gives \[ \Er[(\hat{t}(x_i) - \Er[y|x_i])^2] = O_p(n^{\frac{-2}{2+p}}) \]

Rate of convergence: random forest

  • Result from Biau (2012)
  • Assume \(\Er[y|x]=\Er[y|x_{(s)}]\), \(x_{(s)}\) subset of \(s\) variables, then \[ \Er[(\hat{r}(x_i) - \Er[y|x_i])^2] = O_p\left(\frac{1}{m\log(n/m)^{s/2p}}\right) + O_p\left(\left(\frac{m}{n}\right)^{\frac{0.75}{s\log 2}} \right) \] or with optimal \(m\) \[ \Er[(\hat{t}(x_i) - \Er[y|x_i])^2] = O_p(n^{\frac{-0.75}{s\log 2+0.75}}) \]

Other statistical properties

  • Pointwise asymptotic normality : Wager and Athey (2018)

Simulation study

  • Partially linear model
  • DGP :
    • \(x_i \in \R^p\) with \(x_{ij} \sim U(0,1)\)
    • \(d_i = m(x_i) + v_i\)
    • \(y_i = d_i\theta + f(x_i) + \epsilon_i\)
    • \(m()\), \(f()\) either linear or step functions
  • Estimate by OLS, Lasso, and random forest
    • Lasso & random forest use orthogonal moments \[ \En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0 \]

Neural Networks

  • Target function \(f: \R^p \to \R\)
    • e.g. \(f(x) = \Er[y|x]\)
  • Approximate with single hidden layer neural network : \[ \hat{f}(x) = \sum_{j=1}^r \beta_j (a_j'a_j \vee 1)^{-1} \psi(a_j'x + b_j) \]
    • Activation function \(\psi\)
      • Examples: Sigmoid \(\psi(t) = 1/(1+e^{-t})\), Tanh \(\psi(t) = \frac{e^t -e^{-t}}{e^t + e^{-t}}\), Heavyside \(\psi(t) = t 1(t\geq 0)\)
    • Weights \(a_j\)
    • Bias \(b_j\)
  • Able to approximate any \(f\), Hornik, Stinchcombe, and White (1989)

Deep Neural Networks

  • Many hidden layers
    • \(x^{(0)} = x\)
    • \(x^{(\ell)}_j = \psi(a_j^{(\ell)} x^{(\ell-1)} + b_j^{(\ell)})\)

Rate of convergence

  • Chen and White (1999)
  • \(f(x) = \Er[y|x]\) with Fourier representation \[ f(x) = \int e^{i a'x} d\sigma_f(a) \] where \(\int (\sqrt{a'a} \vee 1) d|\sigma_f|(a) < \infty\)
  • Network sieve : \[ \begin{align*} \mathcal{G}_n = \{ & g: g(x) = \sum_{j=1}^{r_n} \beta_j (a_j'a_j \vee 1)^{-1} \psi(a_j'x + b_j), \\ & \norm{\beta}_1 \leq B_n \} \end{align*} \]

Rate of convergence

  • Estimate \[ \hat{f} = \argmin_{g \in \mathcal{G}_n} \En [(y_i - g(x_i))^2] \]

  • For fixed \(p\), if \(r_n^{2(1+1/(1+p))} \log(r_n) = O(n)\), \(B_n \geq\) some constant \[ \Er[(\hat{f}(x) - f(x))^2] = O\left((n/\log(n))^{\frac{-(1 + 2/(p+1))} {2(1+1/(p+1))}}\right) \]

Simulation Study

  • Same setup as for random forests earlier
  • Partially linear model
  • DGP :
    • \(x_i \in \R^p\) with \(x_{ij} \sim U(0,1)\)
    • \(d_i = m(x_i) + v_i\)
    • \(y_i = d_i\theta + f(x_i) + \epsilon_i\)
    • \(m()\), \(f()\) either linear or step functions
  • Estimate by OLS, Neural network with & without cross-fitting
    • Using orthogonal moments \[ \En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0 \]

Using machine learning to estimate causal effects

Double debiased machine learning

  • Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, et al. (2018), Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2017)

  • Parameter of interest \(\theta \in \R^{d_\theta}\)

  • Nuisance parameter \(\eta \in T\)

  • Moment conditions \[ \Er[\psi(W;\theta_0,\eta_0) ] = 0 \in \R^{d_\theta} \] with \(\psi\) known

  • Estimate \(\hat{\eta}\) using some machine learning method

  • Estimate \(\hat{\theta}\) using cross-fitting


  • Randomly partition into \(K\) subsets \((I_k)_{k=1}^K\)
  • \(I^c_k = \{1, ..., n\} \setminus I_k\)
  • \(\hat{\eta}_k =\) estimate of \(\eta\) using \(I^c_k\)
  • Estimator: \[ \begin{align*} 0 = & \frac{1}{K} \sum_{k=1}^K \frac{K}{n} \sum_{i \in I_k} \psi(w_i;\hat{\theta},\hat{\eta}_k) \\ 0 = & \frac{1}{K} \sum_{k=1}^K \En_k[ \psi(w_i;\hat{\theta},\hat{\eta}_k)] \end{align*} \]


  • Linear score \[ \psi(w;\theta,\eta) = \psi^a(w;\eta) \theta + \psi^b(w;\eta) \]
  • Near Neyman orthogonality: \[ \lambda_n := \sup_{\eta \in \mathcal{T}_n} \norm{\partial \eta \Er\left[\psi(W;\theta_0,\eta_0)[\eta-\eta_0] \right] } \leq \delta_n n^{-1/2} \]


  • Rate conditions: for \(\delta_n \to 0\) and \(\Delta_n \to 0\), we have \(\Pr(\hat{\eta}_k \in \mathcal{T}_n) \geq 1-\Delta_n\) and \[ \begin{align*} r_n := & \sup_{\eta \in \mathcal{T}_n} \norm{ \Er[\psi^a(W;\eta)] - \Er[\psi^a(W;\eta_0)]} \leq \delta_n \\ r_n' := & \sup_{\eta \in \mathcal{T}_n} \Er\left[ \norm{ \psi(W;\theta_0,\eta) - \psi(W;\theta_0,\eta_0)}^2 \right]^{1/2} \leq \delta_n \\ \lambda_n' := & \sup_{r \in (0,1), \eta \in \mathcal{T}_n} \norm{ \partial_r^2 \Er\left[\psi(W;\theta_0, \eta_0 + r(\eta - \eta_0)) \right]} \leq \delta_n/\sqrt{n} \end{align*} \]
  • Moments exist and other regularity conditions

Proof outline:

  • Let \(\hat{J} = \frac{1}{K} \sum_{k=1}^K \En_k [\psi^a(w_i;\hat{\eta}_k)]\), \(J_0 = \Er[\psi^a(w_i;\eta_0)]\), \(R_{n,1} = \hat{J}-J_0\)

  • Show: \[ \small \begin{align*} \sqrt{n}(\hat{\theta} - \theta_0) = & -\sqrt{n} J_0^{-1} \En[\psi(w_i;\theta_0,\eta_0)] + \\ & + (J_0^{-1} - \hat{J}^{-1}) \left(\sqrt{n} \En[\psi(w_i;\theta_0,\eta_0)] + \sqrt{n}R_{n,2}\right) + \\ & + \sqrt{n}J_0^{-1}\underbrace{\left(\frac{1}{K} \sum_{k=1}^K \En_k[ \psi(w_i;\theta_0,\hat{\eta}_k)] - \En[\psi(w_i;\theta_0,\eta_0)]\right)}_{R_{n,2}} \end{align*} \]

  • Show \(\norm{R_{n,1}} = O_p(n^{-1/2} + r_n)\)

  • Show \(\norm{R_{n,2}}= O_p(n^{-1/2} r_n' + \lambda_n + \lambda_n')\)

Proof outline: Lemma 6.1

Lemma 6.1

  1. If \(\Pr(\norm{X_m} > \epsilon_m | Y_m) \to_p 0\), then \(\Pr(\norm{X_m}>\epsilon_m) \to 0\).

  2. If \(\Er[\norm{X_m}^q/\epsilon_m^q | Y_m] \to_p 0\) for \(q\geq 1\), then \(\Pr(\norm{X_m}>\epsilon_m) \to 0\).

  3. If \(\norm{X_m} = O_p(A_m)\) conditional on \(Y_m\) (i.e. for any \(\ell_m \to \infty\), \(\Pr(\norm{X_m} > \ell_m A_m | Y_m) \to_p 0\)), then \(\norm{X_m} = O_p(A_m)\) unconditionally

Proof outline: \(R_{n,1}\)

\[ R_{n,1} = \hat{J}-J_0 = \frac{1}{K} \sum_k \left( \En_k[\psi^a(w_i;\hat{\eta}_k)] - \Er[\psi^a(W;\eta_0)] \right) \]

  • \(\norm{\En_k[\psi^a(w_i;\hat{\eta}_k)] - \Er[\psi^a(W;\eta_0)]} \leq U_{1,k} + U_{2,k}\) where \[ \begin{align*} U_{1,k} = & \norm{\En_k[\psi^a(w_i;\hat{\eta}_k)] - \Er[\psi^a(W;\hat{\eta}_k)| I^c_k]} \\ U_{2,k} = & \norm{ \Er[\psi^a(W;\hat{\eta}_k)| I^c_k] - \Er[\psi^a(W;\eta_0)]} \end{align*} \]

Proof outline: \(R_{n,2}\)

  • \(R_{n,2} = \frac{1}{K} \sum_{k=1}^K \En_k\left[ \psi(w_i;\theta_0,\hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0) \right]\)
  • \(\sqrt{n} \norm{\En_k\left[ \psi(w_i;\theta_0,\hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0) \right]} \leq U_{3,k} + U_{4,k}\) where

\[ \small \begin{align*} U_{3,k} = & \norm{ \frac{1}{\sqrt{n}} \sum_{i \in I_k} \left( \psi(w_i;\theta_0, \hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0) - \Er[ \psi(w_i;\theta_0, \hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0)] \right) } \\ U_{4,k} = & \sqrt{n} \norm{ \Er[ \psi(w_i;\theta_0, \hat{\eta}_k) | I_k^c] - \Er[\psi(w_i;\theta_0,\eta_0)]} \end{align*} \]

  • \(U_{4,k} = \sqrt{n} \norm{f_k(1)}\) where

\[ f_k(r) = \Er[\psi(W;\theta_0,\eta_0 + r(\hat{\eta}_k - \eta_0)) | I^c_k] - \Er[\psi(W;\theta_0,\eta_0)] \]

Asymptotic normality

\[ \sqrt{n} \sigma^{-1} (\hat{\theta} - \theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \bar{\psi}(w_i) + O_p(\rho_n) \leadsto N(0,I) \]

  • \(\rho_n := n^{-1/2} + r_n + r_n' + n^{1/2} (\lambda_n +\lambda_n') \lesssim \delta_n\)

  • Influence function \[\bar{\psi}(w) = -\sigma^{-1} J_0^{-1} \psi(w;\theta_0,\eta_0)\]

  • \(\sigma^2 := J_0^{-1} \Er[\psi(w;\theta_0,\eta_0) \psi(w;\theta_0,\eta_0)'](J_0^{-1})'\)

Creating orthogonal moments

  • Need \[ \partial \eta\Er\left[\psi(W;\theta_0,\eta_0)[\eta-\eta_0] \right] \approx 0 \]

  • Given an some model, how do we find a suitable \(\psi\)?

Orthogonal scores via concentrating-out

  • Original model: \[ (\theta_0, \beta_0) = \argmax_{\theta, \beta} \Er[\ell(W;\theta,\beta)] \]
  • Define \[ \eta(\theta) = \beta(\theta) = \argmax_\beta \Er[\ell(W;\theta,\beta)] \]
  • First order condition from \(\max_\theta \Er[\ell(W;\theta,\beta(\theta))]\) is \[ 0 = \Er\left[ \underbrace{\frac{\partial \ell}{\partial \theta} + \frac{\partial \ell}{\partial \beta} \frac{d \beta}{d \theta}}_{\psi(W;\theta,\beta(\theta))} \right] \]

Orthogonal scores via projection

  • Original model: \(m: \mathcal{W} \times \R^{d_\theta} \times \R^{d_h} \to \R^{d_m}\) \[ \Er[m(W;\theta_0,h_0(Z))|R] = 0 \]
  • Let \(A(R)\) be \(d_\theta \times d_m\) moment selection matrix, \(\Omega(R)\) \(d_m \times d_m\) weighting matrix, and \[ \begin{align*} \Gamma(R) = & \partial_{v'} \Er[m(W;\theta_0,v)|R]|_{v=h_0(Z)} \\ G(Z) = & \Er[A(R)'\Omega(R)^{-1} \Gamma(R)|Z] \Er[\Gamma(R)'\Omega(R)^{-1} \Gamma(R) |Z]^{-1} \\ \mu_0(R) = & A(R)'\Omega(R)^{-1} - G(Z) \Gamma(R)'\Omega(R)^{-1} \end{align*} \]
  • \(\eta = (\mu, h)\) and \[ \psi(W;\theta, \eta) = \mu(R) m(W;\theta, h(Z)) \]

Example: average derivative

  • \(x,y \in \R^1\), \(\Er[y|x] = f_0(x)\), \(p(x) =\) density of \(x\)

  • \(\theta_0 = \Er[f_0'(x)]\)

  • Joint objective \[ \min_{\theta,f} \Er\left[ (y - f(x))^2 + (\theta - f'(x)^2) \right] \]

  • \(f_\theta(x) = \Er[y|x] + \theta \partial_x \log p(x) - f''(x) - f'(x) \partial_x \log p(x)\)

  • Concentrated objective: \[ \min_\theta \Er\left[ (y - f_\theta(x))^2 + (\theta - f_\theta'(x)^2) \right] \]

  • First order condition at \(f_\theta = f_0\) gives \[ 0 = \Er\left[ (y - f_0(x))\partial_x \log p(x) + (\theta - f_0'(x)) \right] \]

Example : average derivative with endogeneity

  • \(x,y \in \R^1\), \(p(x) =\) density of \(x\)
  • Model : \(\Er[y - f(x) | z] = 0\) \(\theta_0 = \Er[f_0'(x)]\)

  • Joint objective: \[ \min_{\theta,f} \Er\left[ \Er[y - f(x)|z]^2 + (\theta - f'(x))^2 \right] \]

  • \(f_\theta(x) = (T^* T)^{-1}\left((T^*\Er[y|z])(x) - \theta \partial_x \log p(x)\right)\)
    • where \(T:\mathcal{L}^2_{p} \to \mathcal{L}^2_{\mu_z}\) with \((T f)(z) = \Er[f(x) |z]\)
    • and \(T^*:\mathcal{L}^2_{\mu_z} \to \mathcal{L}^2_{p}\) with \((T^* g)(z) = \Er[g(z) |x]\)
  • Orthogonal moment condition : \[ 0 = \Er\left[ \Er[y - f(x) | z] (T (T^* T)^{-1} \partial_x \log p)(z) + (\theta - f'(x)) \right] \]

Example: average elasticity

  • Demand \(D(p)\), quantities \(q\), instruments \(z\) \[\Er[q-D(p) |z] = 0\]

  • Average elasticity \(\theta \Er[D'(p)/D(p) | z ]\)

  • Joint objective : \[ \min_{\theta,D} \Er\left[ \Er[q - D(p)|z]^2 + (\theta - D'(p)/D(p))^2 \right] \]

Example: control function

\[ \begin{align*} 0 = & \Er[d - p(x,z) | x,z] \\ 0 = & \Er[y - x\beta - g(p(x,z)) | x,z] \end{align*} \]

Treatment heterogeneity

  • Potential outcomes model
    • Treatment \(d \in \{0,1\}\)
    • Potential outcomes \(y(1), y(0)\)
    • Covariates \(x\)
    • Unconfoundedness or instruments
  • Objects of interest:
    • Conditional average treatment effect \(s_0(x) = \Er[y(1) - y(0) | x]\)
    • Range and other measures of spread of conditional average treatment effect
    • Most and least affected groups

Fixed, finite groups

  • \(G_1, ..., G_K\) finite partition of support \((x)\)

  • Estimate \(\Er[y(1) - y(0) | x \in G_k]\) as above

  • pros: easy inference, reveals some heterogeneity

  • cons: poorly chosen partition hides some heterogeneity, searching partitions violates inference

Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

  • Chernozhukov, Demirer, et al. (2018)

  • Use machine learning to find partition with sample splitting to allow easy inference

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate \[S(x) = \widehat{\Er[y(1) - y(0) | x]}\] and \[B(x) = \widehat{\Er[y(0)|x]}\]

Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

  • Define \(G_k = 1\{\ell_{k-1} \leq S(x) \leq \ell_k\}\)
  • Use main sample to regress with weights \((P(x)(1-P(X)))^{-1}\) \[ y = \alpha_0 + \alpha_1 B(x) + \sum_k \gamma_k (d-P(X)) 1(G_k) + \epsilon \]

  • \(\hat{\gamma}_k \to_p \Er[y(1) - y(0) | G_k]\)

Best linear projection of CATE

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate \[S(x) = \widehat{\Er[y(1) - y(0) | x]}\] and \[B(x) = \widehat{\Er[y(0)|x]}\]

  • Use main sample to regress with weights \((P(x)(1-P(X)))^{-1}\) \[ y = \alpha_0 + \alpha_1 B(x) + \beta_0 (d-P(x)) + \beta_1 (d-P(x))(S(x) - \Er[S(x)]) + \epsilon \]

  • \(\hat{\beta}_0, \hat{\beta}_1 \to_p \argmin_{b_0,b_1} \Er[(s_0(x) - b_0 - b_1 (S(x)-E[S(x)]))^2]\)

Inference on CATE

  • Inference on \(\Er[y(1) - y(0) | x] = s_0(x)\) challenging when \(x\) high dimensional and/or few restrictions on \(s_0\)

  • Pointwise results for random forests : Wager and Athey (2018), Athey, Tibshirani, and Wager (2016)

  • Recent review of high dimensional inference : Alexandre Belloni, Chernozhukov, Chetverikov, Hansen, et al. (2018)

Random forest asymptotic normality

  • Wager and Athey (2018)

  • \(\mu(x) = \Er[y|x]\)

  • \(\hat{\mu}(x)\) estimate from honest random forest

    • honest \(=\) trees independent of outcomes being averaged

    • sample-splitting or trees formed using another outcome

  • Then \[ \frac{\hat{\mu}(x) - \mu(x)}{\hat{\sigma}_n(x)} \leadsto N(0,1) \]
    • \(\hat{\sigma}_n(x) \to 0\) slower than \(n^{-1/2}\)

Random forest asymptotic normality

  • Pointwise result, how to do inference on:
    • \(H_0: \mu(x_1) = \mu(x_2)\)
    • \(\{x: \mu(x) \geq 0 \}\)
    • \(\Pr(\mu(x) \leq 0)\)

Uniform inference

  • Alexandre Belloni, Chernozhukov, Chetverikov, Hansen, et al. (2018)
  • Alexandre Belloni, Chernozhukov, Chetverikov, and Wei (2018)


