\documentclass[11pt,reqno]{amsart}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
%\usepackage{epstopdf}
\usepackage{hyperref}
\usepackage[left=1in,right=1in,top=0.9in,bottom=0.9in]{geometry}
\usepackage{multirow}
\usepackage{verbatim}
\usepackage{fancyhdr}
\usepackage{harvard}
%\usepackage[small,compact]{titlesec}
%\usepackage{pxfonts}
%\usepackage{isomath}
\usepackage{mathpazo}
%\usepackage{arev} % (Arev/Vera Sans)
%\usepackage{eulervm} %_ (Euler Math)
%\usepackage{fixmath} % (Computer Modern)
%\usepackage{hvmath} %_ (HV-Math/Helvetica)
%\usepackage{tmmath} %_ (TM-Math/Times)
%\usepackage{cmbright}
%\usepackage{ccfonts} \usepackage[T1]{fontenc}
%\usepackage[garamond]{mathdesign}
\usepackage{color}
\usepackage[normalem]{ulem}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{conjecture}{Conjecture}[section]
\newtheorem{corollary}{Corollary}[section]
\newtheorem{lemma}{Lemma}[section]
\newtheorem{proposition}{Proposition}[section]
\newtheorem{assumption}{}[section]
\renewcommand{\theassumption}{A\arabic{assumption}}
\theoremstyle{definition}
\newtheorem{definition}{Definition}[section]
\newtheorem{step}{Step}[section]
\newtheorem{remark}{Comment}[section]
\newtheorem{example}{Example}[section]
\newtheorem*{example*}{Example}
\linespread{1.1}
\pagestyle{fancy}
%\renewcommand{\sectionmark}[1]{\markright{#1}{}}
\fancyhead{}
\fancyfoot{}
%\fancyhead[LE,LO]{\tiny{\thepage}}
\fancyhead[CE,CO]{\tiny{\rightmark}}
\fancyfoot[C]{\small{\thepage}}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\fancypagestyle{plain}{%
\fancyhf{} % clear all header and footer fields
\fancyfoot[C]{\small{\thepage}} % except the center
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}}
\makeatletter
\renewcommand{\@maketitle}{
\null
\begin{center}%
\rule{\linewidth}{1pt}
{\Large \textbf{\textsc{\@title}}} \par
{\small \textsc{Paul Schrimpf}} \par
{\small \textsc{\@date}} \par
{\small \textsc{University of British Columbia}} \par
{\small \textsc{Economics 628: Topics in econometrics}} \par
\rule{\linewidth}{1pt}
\end{center}%
\par \vskip 0.9em
}
\makeatother
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\argmin}{\operatornamewithlimits{arg\,min}}
\def\inprobLOW{\rightarrow_p}
\def\inprobHIGH{\,{\buildrel p \over \rightarrow}\,}
\def\inprob{\,{\inprobHIGH}\,}
\def\indist{\,{\buildrel d \over \rightarrow}\,}
\def\F{\mathbb{F}}
\def\R{\mathbb{R}}
\newcommand{\gmatrix}[1]{\begin{pmatrix} {#1}_{11} & \cdots &
{#1}_{1n} \\ \vdots & \ddots & \vdots \\ {#1}_{m1} & \cdots &
{#1}_{mn} \end{pmatrix}}
\newcommand{\iprod}[2]{\left\langle {#1} , {#2} \right\rangle}
\newcommand{\norm}[1]{\left\Vert {#1} \right\Vert}
\newcommand{\abs}[1]{\left\vert {#1} \right\vert}
\renewcommand{\det}{\mathrm{det}}
\newcommand{\rank}{\mathrm{rank}}
\newcommand{\spn}{\mathrm{span}}
\newcommand{\row}{\mathrm{Row}}
\newcommand{\col}{\mathrm{Col}}
\renewcommand{\dim}{\mathrm{dim}}
\newcommand{\prefeq}{\succeq}
\newcommand{\pref}{\succ}
\newcommand{\seq}[1]{\{{#1}_n \}_{n=1}^\infty }
\renewcommand{\to}{{\rightarrow}}
\providecommand{\En}{\mathbb{E}_n}
\providecommand{\Gn}{\mathbb{G}_n}
\providecommand{\Er}{{\mathrm{E}}}
\renewcommand{\Pr}{{\mathrm{P}}}
\providecommand{\set}[1]{\left\{#1\right\}}
\providecommand{\plim}{\operatornamewithlimits{plim}}
\newcommand\indep{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\setbox0\hbox{$#1#2$}%
\copy0\kern-\wd0\mkern4mu\box0}}
\renewcommand{\cite}{\citeasnoun}
\title{Treatment heterogeneity}
\date{\today}
\begin{document}
\maketitle
Consider some experimental treatment, such as taking a drug or
attending a job training program. It is very likely that different
people respond differently to the treatment. For example, with the
training program, some people may earn the same whether or not they
receive the training, while other people's earnings may be much
greater with the training than without. Recognizing this seemingly
simple fact greatly affects how we can interpret instrumental variable
estimates of the effect of the treatment.
\section{Context}
In addition to being empirically relevant, treatment heterogeneity has
been important in the development of econometric thought.
The thing that distinguishes econometrics from statistics more than
anything else is that econometrics focuses far more on estimating
causal relationships from observational data. Traditional econometrics
focuses on combining economic theory with observational data to infer
causal effects. Simultaneous equation methods to estimate e.g.\ demand and
supply, and the Heckman selection model to estimate e.g.\ the effect
of education on earnings are canonical examples of this
approach. Roughly in the 1980s, some researchers grew increasingly
skeptical of this approach. Their view was that many traditional
econometric models made too strong of assumptions. There was a
recognition that some of the basic assumption of idealized economic
theory may not hold. Moreover, many traditional econometric
models invoked functional form and distributional assumptions for
tractibility. These assumptions are difficult to defend. Additionally,
people became aware that these assumptions can lead to erroneous
estimates. An influential paper by \cite{lalonde1986} compared the
estimated effect of a job training program obtained from a randomized
experiment to various non-experimental estimates. He found that the
non-experimental estimates were sensitive to auxillary assumptions,
and often did not agree with the experimental estimate. Results such
as this led some economists to reject the traditional approach to
econometrics and instead think of causality as only what could be
estimated in an idealized experiment. This approach to econometrics is
sometimes called the reduced form approach. The traditional approach
to econometrics is called the structural approach.
Naturally, there has been some tension between adherents to each of
these two approaches. This tension has helped spur progress in both
approaches. Since the 1980s, reduced form advocates have greatly
clarified exactly what they are estimating. Meanwhile, structural
advocates have greatly relaxed functional form and distributional
assumptions, and clarified to what extent identification comes from
data and to what extent identification comes from other
assumptions. Many of the advances on both fronts came from thinking
about models with heterogeneous treatment effects.
\section{Setup}
I am going to shamelessly use Imbens's slides on IV with treatment
heterogeneity in lecture, so I will follow that notation here. We have
a cross section of observations indexed by $i$. There is a treatment
$W_i \in \mathcal{W}$. To begin with we will focus on binary
treatments, so $W_i \in \{0,1\}$.
Later, we will look at multi-valued treatments. Associated
with each treatment is a potential outcome, $Y_i(W_i)$, where
$Y_i:\mathcal{W} \to \mathcal{Y}$. $Y_i$ is a function from treatments
to potential outcomes. We only observe one of these outcomes, $Y_i(Wi)$, but we are interested in the effect treatment, which is
just the difference in potential outcomes,
\[ Y_i(1) - Y_i(0). \]
Of course, we cannot estimate $Y_i(1) - Y_i(0)$ for each individual
without some unrealistically strong assumptions. However, we can come
up with reasonable assumptions to estimate e.g.\
\[ \Er[Y_i(1) - Y_i(0) ] = \mathrm{ATE} \]
This quantity is called the average treatment effect, and is often
abbreviated ATE. A related quantity of interest is the average effect
of treatment for those that receive treatment.
\[ \Er[Y_i(1) - Y_i(0) | W_i = 1 ] = \mathrm{ATT} \]
This is called the average effect of treatment on the treated.
When could we estimate the ATE and ATT? Well, the simplest case is if
we have a randomized experiment. That is, suppose $W_i$ is randomly
assigned, independent of $Y_i(1)$ and $Y_i(0)$. Then
\[ \Er[Y_i(1) | W_i = 1] = \Er[Y_i(1)] \]
and
\[ \Er[Y_i(0) | W_i = 0] = \Er[Y_i(0)]. \]
So we can estimate the average treatment effect by\footnote{I use
$\En[x_i] = \frac{1}{n} \sum_{i=1}^n x_i$ to denote the empirical
expectation of $x_i$. }
\[ \En[Y_i(1)|W_i = 1] - \En[Y_i(0) | W_i = 0 ]. \]
Also, it is easy to see that the average treatment effect is the same
as the average treatment effect for the treated.
If there we do not have a randomized experiment, but we do have an
instrument $Z_i$ such that $Z$ affects $W$ but not $Y$, then with some
assumptions, we can estimate the average treatment effect using IV. In
particular, suppose
\[ Y_i = \beta_0 + \beta_1 W_i + \epsilon_i. \]
Also assume that $Z_i$ is independent of the potential outcomes and
potential treatments.
\begin{assumption}[Independence] \label{a:indep}
Let $Z_i \in \mathcal{Z}$ and $W_i: \mathcal{Z} \to \{0,1\}$ then
$Z_i$ is independent of $(Y_i(0),Y_i(1), W_i)$, which we
denote \[ Z_i \indep \left(Y_i(0),Y_i(1),W_i \right). \]
\end{assumption}
It is important to emphasize the fact that $W_i$ is a function
of $Z_i$. $Z_i$ affects the observed treatment through this function,
but the distribution of the function is independent of $Z_i$. In
particular, things such as
$W_i(z_1) - W_i(z_2)$ for two particular values of the instrument, $z_1,z_2$
are independent of $Z_i$. Note that this
is a slight change in notation compared to earlier. Earlier, $W_i$ was
just the observed treatment, $Y_i$ was a function of $W_i$, and
$Y_i(W_i)$ was the observed outcome. Now, $W_i$ is also a function and
$W_i(Z_i)$ is the observed outcome. Henceforth, we will let lower case
letters, $y_i = Y_i(W_i(Z_I))$ and $w_i = W_i(Z_i)$ denote the
observed outcome and treatment.
Throughout we have also been assuming the following exclusion.
\begin{assumption}[Exclusion] \label{a:ex}
$Y_i$ is a function of only $W_i(Z_i)$, and not $Z_i$ directly.
\end{assumption}
This assumption is built into our notation, but it is good to state it
explicitly, so that we do not forget that we are making it.
The third assumption that we need is the instrument is relevant.
\begin{assumption}[Instrument relevance]\label{a:relevance}
$\Er[W_i(z)] \neq 0$ (as a function of $z$)
\end{assumption}
Then
\[ \hat{\beta}_1^{IV} = \frac{\En\left[Y_i(Z_i - \En[Z_i])\right]}
{\En\left[W_i(Z_i - \En[Z_i])\right]} \]
is a consistent estimate of the average treatment effect and the
treatment effect on the treated.
\section{Local average treatment effects}
From the previous paragraph, we see that IV consistently estimates the
average treatment effect when the treatment effect is
homogeneous. What happens when the treatment effect is heterogeneous?
For ease of exposition let's assume that $Z_i$ is also binary. Then
plim of the IV estimate can be written as\footnote{This uses the
Wald estimate formula for IV with a binary instrument. We went
through this in class, but I'm not going to write it here. See e.g.\
\cite{angristPischke2009} for a derivation.}
\begin{align}
\plim \hat{\beta}^{IV}_1 = & \frac{\Er[Y_i(w_i) | Z_i = 1] -
\Er[Y_i(w_i)|Z_i = 0]} {\Er[w_i|Z_i = 1] - \Er[w_i|Z_i = 0]} \notag
\\
= & \frac{\Er[Y_i(1)w_i + Y_i(0)(1-w_i) | Z_i = 1] -
\Er[Y_i(1)w_i + Y_i(0)(1-w_i)|Z_i = 0]} {\Er[w_i|Z_i = 1] -
\Er[w_i|Z_i = 0]} \notag \\
= & \frac{\Er[Y_i(1)W_i(1) - Y_i(0)W_i(1)] - \Er[Y_i(1)W_i(0) - Y_i(0)W_i(0)]}
{\Er[w_i|Z_i = 1] - \Er[w_i|Z_i = 0]} \notag \\
= & \frac{\Er\left[ \left(Y_i(1) - Y_i(0)\right)
\left(W_i(1)-W_i(0)\right)\right] }
{\Er[w_i|Z_i = 1] - \Er[w_i|Z_i = 0]} \notag \\
= & \frac{\Pr(\Delta W_i =1)\Er\left[Y_i(1) - Y_i(0)| \Delta W_i = 1
\right] - \Pr(\Delta W_i = -1)\Er\left[Y_i(1) - Y_i(0)|
\Delta W_i = -1 \right] }
{\Pr(\Delta W_i = 1) - \Pr(\Delta W_i = -1)} \label{e:biv}
\end{align}
where $\Delta W_i = W_i(1) - W_i(0)$ is the change in treatment when
the instrument changes from $0$ to $1$. The expressions $\Er[Y_i(1) -
Y_i(0) | \Delta W_i = 1]$ and $\Er[Y_i(1) - Y_i(0) | \Delta W_i =
-1]$ are average treatment effects conditional on $W$ changing when
the instrument changes. This is useful because although these
conditional expectation are not the average treatment effect or the
average treatment effect on the treated, they are average treatment
effects for certain subgroups. However, $\beta^{IV}_1$ does not
estimate these conditional expectations separately. It only estimates
the weighted sum in (\ref{e:biv}). Also notice that even if $\Er[Y_i(1) -
Y_i(0) | \Delta W_i = 1]$ and $\Er[Y_i(1) - Y_i(0) | \Delta W_i =
-1]$ are both positive, $\beta_1^{IV}$ can be positive, negative, or
zero. Without a further restriction, the IV estimate might not even
have a meaningful sign.
Fortunately, there is a reasonable restriction that can be made. In
many cases, we think of instruments that have a monotonic effect on
the probability of receiving treatment. For example, in the military
service application of \cite{angrist1990} that we talked about in class,
it is sensible to assume that lower draft numbers only increase the
probability of serving in the military. In other words, it can be
reasonable to assume that $\Delta W_i \geq 0$.
\begin{assumption}[Monotone instrument] \label{a:monot}
$W_i(1) \geq W_i(0)$
\end{assumption}
This means that there
are no people who receive treatment when the instrument is 0, but do
not receive treatment when the instrument is 1. In general, we can
divide the population into four groups:
\begin{enumerate}
\item Always takers always receive treatment, $W_i(1) = W_i(0) = 1$
\item Never takers never receive treatment, $W_i(1) = W_i(0) = 0$
\item Compliers receive treatment only when the instrument is $1$,
$W_i(1) = 1$, $W_i(0) = 0$.
\item Deniers receive treatment only when the instrument is $0$,
$W_i(1) = 0$, $W_i(0) = 1$.
\end{enumerate}
If we assume that there no deniers, then
\begin{align}
\plim \hat{\beta}^{IV}_1 = & \frac{\Pr(\Delta W_i =1)\Er\left[Y_i(1)
- Y_i(0)| \Delta W_i = 1 \right]}
{\Pr(\Delta W_i = 1)} \\
= & \Er[Y_i(1) - Y_i(0) | \Delta W_i = 1] \label{e:late}.
\end{align}
This expression is what \cite{angristImbens1994} call the local
average treatment effect, abbreviated LATE. It is the average
treatment effect for compliers.
\subsection{Representativeness of compliers}
One natural question is how similar the compliers are to the rest of
the population. There is no definitive way to answer this question,
but you can get some idea by comparing the compliers to the always
takers and the never takers. We can estimate the portion of always
takers, compliers, and never takers as follows. Let $a$ denote always
takers, $n$ never takers, and $c$ compliers.
\begin{align*}
\Er[w_i|Z_i = 0] = & \Er[w_i | Z_i=0, a]\Pr(a|Z_i=0) + \Er[w_i |
Z_i=0, n]\Pr(n|Z_i=0) + \Er[w_i | Z_i=0, c]\Pr(c|Z_i=0) \\
= & \Pr(a|Z_i=0)
\end{align*}
The second line follows from the fact that by definition compliers and
never takers have $W_i = 0$ when $Z_i = 0$, and always takers have
$w_i = 1$. Now an always taker is just someone with $W_i(1)=W_i(0) =
1$. Assumption \ref{a:indep} says that the function $W_i$ is
independent of $Z_i$. Always takers are defined by $W_i$, therefore
being an always taker (or never taker or complier) is independent of
$Z_i$, and $\Pr(a) = \Pr(a|Z_i)$. Thus,
\[ \Pr(a) = \Er[w_i|Z_i=0]. \]
Identical reasoning\footnote{It might be a useful exercise to write
out the argument.} shows that
\[ \Pr(n) = 1-\Er[w_i | Z_i = 1], \]
and
\[ \Pr(c) = 1 - \Pr(n) - \Pr(a) = \Er[w_i | Z_i=1] - \Er[w_i|Z_i = 0]. \]
This is useful, but our interpretation of any given
local average treatment effect likely depends on $\Pr(c)$. If we know
that the compliers are most of the population ($\Pr(c)$ is near 1),
then we should expect that the LATE is near the ATE (although the
difference can still be arbitrarily large if $\mathcal{Y}$ is
unbounded).
We can get an even better idea of how the compliers compare the rest
of the population by comparing $\Er[Y_i(0) | c]$ with $\Er[Y_i(0)|n]$,
and $\Er[Y_i(1)|c]$ with $\Er[Y_i(1)|a]$. We have already shown that
$\Er[Y_i(1) - Y_i(0)|c]$ is identified. Now we will show that
$\Er[Y_i(0) | c]$ , $\Er[Y_i(0)|n]$,
$\Er[Y_i(1)|c]$, and $\Er[Y_i(1)|a]$ can be identified.
First note that by the independence assumption (\ref{a:indep}),
\begin{align*}
\Er[Y_i(1)|a] & = \Er[Y_i(1)|W_i(1) = W_i(0) = 1] = \\
& =\Er[Y_i(1)|W_i(1)
= W_i(0) = 1, Z_i = 1] = \Er[Y_i(1)|W_i(1)
= W_i(0) = 1, Z_i = 0]
\end{align*}
Also, since anyone with $w_i = 1$ and $Z_i=0$ is an always taker,
\begin{align*}
\Er[Y_i(1)|W_i(1)= W_i(0) = 1, Z_i = 0] = \Er[Y_i(w_i) | w_i = 1,
Z_i = 0].
\end{align*}
Thus,
\[ \Er[Y_i(1)|a] = \Er[y_i | w_i = 1, Z_i = 0] \]
is identified. Similarly,\footnote{It might be a useful exercise to
write out the argument.}
\[ \Er[Y_i(0)|n] = \Er[y_i | w_i = 0, Z_i = 1] \]
is identified. Now observe that
\begin{align*}
\Er[y_i|w_i=1,Z_i=1] = & \Er[y_i|w_i=1,Z_i=1,c] \Pr(c|w_i=1,Z_i=1) +
\Er[y_i|w_i=1,Z_i=1,a]
\Pr(a|w_i=1,Z_i=1) \\
= & \Er[Y_i(1)|c] \Pr(c|w_i=1,Z_i=1) + \Er[Y_i(1)|a] \Pr(a|w_i=1,Z_i=1)
\end{align*}
$w_i=1$ when $Z_i=1$ only for compliers and always takers, so
\begin{align*}
\Pr(c|w_i=1,Z_i=1) = &
\frac{\Pr(c|Z_i=1)}{\Pr(c|Z_i=1)+\Pr(a|Z_i=1)} \\
= & \frac{\Pr(c)}{\Pr(c)+\Pr(a)}
\end{align*}
Thus,
\begin{align*}
\Er[y_i|w_i=1,Z_i=1] = &
\Er[y_i|w_i=1,c] \frac{\Pr(c)}{\Pr(c)+\Pr(a)}
+ \Er[Y_i(1)|a] \frac{\Pr(a)}{\Pr(c)+\Pr(a)} \\
\end{align*}
and
\begin{align*}
\Er[Y_i(1)|c] = & \Er[y_i|w_i=1,Z_i=1] \frac{\Pr(c)+\Pr(a)}{\Pr(c)}
-\Er[Y_i(1)|a] \frac{\Pr(a)}{\Pr(c)} \\
= &\Er[y_i|w_i=1,Z_i=1]\frac{\Pr(c)+\Pr(a)}{\Pr(c)} - \Er[y_i |
w_i = 1, Z_i = 0]\frac{\Pr(a)}{\Pr(c)}.
\end{align*}
Similarly,
\begin{align*}
\Er[Y_i(0) | c] = \Er[y_i|w_i=0,Z_i=0]\frac{\Pr(c)+\Pr(n)}{\Pr(c)}
- \Er[y_i |
w_i = 0, Z_i = 1]\frac{\Pr(n)}{\Pr(c)}.
\end{align*}
So you can estimate and compare $\Er[Y_i(0)|c]$ with $\Er[Y_i(0)|n]$ and
$\Er[Y_i(1)|c]$ with $\Er[Y_i(1)|a]$. For an example of this see
\cite{imbensWooldridge2007}, which we talked about in lecture.
\subsection{Multi-valued instruments}
In our analysis of LATE above we assumed that the instrument in
binary. If the instrument takes on multiple values, say $Z_i \in
\mathcal{Z}$ then for any pair $z_0,z_1 \in \mathcal{Z}$, we could
repeat the analysis above to show that
\[ LATE(z_0,z_1) = \Er[Y_i(1)-Y_i(0) | W_i(z_1)=1, W_i(z_0) = 0] \]
is identified. Also, as above we could define populations of
compliers, always takers, and never takers for each $z_0,z_1$. Of
course, do to this we need assumption \ref{a:monot} to hold for each
$z_0, z_1$ i.e.\ $W_i(z_0) \leq W_i(z_1)$.
What does $\hat{\beta}^{IV}_1$ estimate when $Z$ is multi-valued?
Well, in general you can't get a nice interpretable expression for
it. However, with some further assumptions you can show that the IV
estimate is a weighted average of $LATE(z_0,z_1)$ across different
values of $z_0$ and $z_1$. \cite{angristImbens1994} state this result
for when $Z$ has discrete support. In the next section, we will give
an analogous result for continuously distributed $Z$.
\section{Continuous instruments and marginal treatment effects}
This section is largely based on \cite{heckmanVytlacil1999} and
\cite{heckmanVytlacil2007b}.
The more structural approach to treatment effects typically treat
treatment assignment as a selection problem. That is, they assume that
treatment is determined by a latent index,
\[ W_i(Z_i) = 1\{\nu(Z_i) - U_i \geq 0 \}, \]
where $\nu:\mathcal{Z} \to \R$, $U_i$ is some real valued random
variable, and $U_i \indep Z_i$. It is easy to see that a latent index
model implies the
monotonicity assumption (\ref{a:monot}) of LATE. For any $z_0,z_1$,
either $\nu(z_0) \leq \nu(z_1)$ or $\nu(z_0) \geq \nu(z_1)$, and then
either
$W_i(z_0) \leq W_i(z_1)$
or
$W_i(z_0) \geq W_i(z_1)$
for all $i$. On the other hand, it is not clear that the assumptions
of LATE imply the existence of such an index model. In fact, early
papers on LATE emphasized that the LATE framework does not include a
potentially restrictive latent index assumption. You might think that
the latent index model is completely unrestrictive since you can
always let $\nu(z) = P(W_i(Z_i)=1|Z_i=z)$ and make $U$
uniform. However, such a $U$ need not be independent of
$Z$. Nonetheless, it turns out that the four LATE assumptions imply
the existence of a latent index model with $U_i \indep Z_i$. This
result was shown by \cite{vytlacil2002}. This is a useful observation
because there are some results that are easier to show directly from
the LATE assumptions, and other results that are easier to show from
the latent index selection assumption.
Let $\pi(z) = \Pr(w_i=1|Z_i=z)$. As in the previous section we can
define
\begin{align*}
LATE(p_0,p_1 ) = & \frac{\Er[y_i|\pi(Z_i)=p_1] -
\Er[y_i|\pi(Z_i)=p_0]}{p_1 - p_0}
\end{align*}
and we should expect that this is the average treatment group for a
certain group of compliers. However, this group is a bit complicated
because it involves all $z_1,z_0$ such that $\pi(z_1) = p_1$, and
$\pi(z_0) = p_0$. We can get a more tractable expression by using the
latent index assumption. Notice that
\begin{align*}
\Er[y_i|\pi(Z_i) = p] = & p \Er[Y_i(1) | \pi(Z_i) = p, D=1] + (1-p)
\Er[Y_i(0) | \pi(Z_i) = p, D=0]
\\
= & p \int_{0}^p \Er[Y_i(1)|\tilde{U}_i = u] du +
(1-p) \int_{0}^p \Er[Y_i(0)|\tilde{U}_i = u] du
\end{align*}
where $\tilde{U}_i = F_U(U_i)$ is uniformly distributed (if we assume
$U_i$ is absolutely continuous with respect to Lebesgue measure,
something that \cite{vytlacil2002},
shows we can do without loss of generality) and
$U_i \leq \nu(Z_i)$ iff $\tilde{U}_i \leq \pi(Z_i)$. Then,
\begin{align}
LATE(p_0,p_1 ) = & \frac{\int_{p_0}^{p_1} \Er[Y_i(1) -
Y_i(0)|\tilde{U}_i=p]dp} {p_1 - p_0} \notag \\
= & \Er[\Delta Y_i | P(z_0) \leq \tilde{U}_i \leq P(z_1) ]. \label{e:ilate}
\end{align}
So $LATE(p_0,p_1)$ is the average treatment effect for people with
$\tilde{U}_i$ between $p_0$ and $p_1$. We can estimate $LATE(p_0,p_1)$
only when we observe $z_0$ and $z_1$ such that $\pi(z_0) = p_0$ and
$\pi(z_1)=p_1$. Also, $LATE(0,1)$ is the average treatment
effect. This is the well known result that in selection models, we can
identify the average treatment effect only if we have an exclusion
with ``large support'' i.e.\ $\exists z_0,z_1\in \mathcal{Z}$ such
that $\pi(z_0) = 0$ and $\pi(z_1) = 1$.
The expression in the integrand of (\ref{e:ilate}),
\[ MTE(p) = \Er[\Delta Y_i | \tilde{U}_i = p ] \]
is called the marginal treatment effect. It is
the effect of treatment for people with $\tilde{U} = p_0$, i.e.\ those
with $U_i = \nu(z_0)$ where $\pi(z_0) = p_0$. These people are
indifferent between receiving treatment or not. We can write pretty
much any other possible treatment effect of interest as an integral of
the marginal treatment effect. For example,
\[ ATE = \int_0^1 MTE(p) dp. \]
We can identify the marginal treatment as follows. If we take the
limit as $p_1$ approaches $p_0$ of $LATE(p_1,p_0)$, we get
\begin{align*}
LIV(p_0) = & \lim_{p_1 \to \pi 0} \frac{\Er[y_i|\pi(Z_i)=p_1] -
\Er[y_i|\pi(Z_i)=p_0]}{p_1 - p_0} \\
\end{align*}
\cite{heckmanVytlacil1999} call this the local instrumental variables
estimate. It is clear that
\begin{align*}
LIV(p) = & \Er[\Delta Y_i | \tilde{U}_i = p] = MTE(p),
\end{align*}
so LIV is an estimate of MTE.
\subsection{$\beta^{IV}$ as a weighted average of MTE}
We can show that $\beta_1^{IV}$ estimates a weighted average of
marginal treatment effects. Suppose we use some function of $Z_i$,
$g(Z_i)$ as an instrument. Then,
\begin{align*}
\beta_1^{IV}(g) = & \frac{\Er[y_i(g(z_i) - \Er[g(z_i)])]}
{\Er[y_i(g(z_i) - \Er[g(z_i)])]}.
\end{align*}
Following \cite{heckmanVytlacil2007b}, we will deal with the numerator
and denominator separately. Let $\tilde{g}(Z_i) = g(Z_i) -
\Er[g(Z_i)]$. Note that
\begin{align*}
\Er[y_i(g(z_i) - \Er[g(z_i)])] = & \Er\left[\left(Y_i(0) +
w_i(Y_i(1)-Y_i(0)) \right)\tilde{g}_i(Z) \right]
\\
= & \Er\left[ w_i\left(Y_i(1)-Y_i(0) \right) \tilde{g}_i(Z) \right] \text{ (independence of $z_i$ and
$Y_i(1)$)} \\
= & \Er\left[1\{\tilde{U}_i \leq \pi(z_i)\} \left(\Delta Y_i \right)
\tilde{g}_i(Z) \right] \\
= & \Er\left[1\{\tilde{U}_i \leq \pi(z_i)\} \left(\Delta Y_i \right)
\tilde{g}(z_i)\right] \\
= & \Er\left[1\{\tilde{U}_i \leq \pi(z_i)\} \Er_Y[\Delta Y_i |
\tilde{U}=u ] \tilde{g}(z_i) \right] \\
= & \Er_U\left[ \Er_Y[\Delta Y_i |
\tilde{U}=u ] \Er_Z[ \tilde{g}(z_i) | \pi(z_i) \geq \tilde{U}_i]
\Pr_Z(\tilde{U}_i \leq \pi(z_i)) \right] \\
= & \int_0^1 MTE(u) \Er_Z[ \tilde{g}(z_i) | \pi(z_i) \geq u] \Pr_Z(u
\leq \pi(z_i)) du
\end{align*}
where the subscripts on expectations and probabilities are simply to
emphasize what the expectation is being taken over. Finally, observe
that $Cov(g(z),W) = Cov(g(z),\pi(z))$, so
\begin{align}
\beta^{IV}(g) = \int_0^1 MTE(u) \omega_g(u) du \label{e:miv}
\end{align}
where
\[ \omega_g(u) = \frac{\Er_Z[ \tilde{g}(z_i) | \pi(z_i) \geq u] \Pr_Z(u
\leq \pi(z_i))}{Cov(g(z),\pi(z))}. \]
It can be shown that these weights integrate to one. Also, if $g(z) =
\pi(z)$,
it is easy to see that weights are positive. Also, since these
weights depend on $z$ and $w$, they are estimable. We could estimate
these weights to get some idea of which weighted average of marginal
treatment effects IV is estimating. A final interesting observation is
that $\beta^{IV}(g)$ depends on $g$. In the traditional IV setup, the
choice of $g$ affects efficiency, but it does not affect what is being
estimated.
\section{Policy relevant treatment effects}
This section is based largely on \cite{chv2010}. In the previous
sections we have focused on identifying the effect of administering
some treatment. If we think about evaluating some potential policy,
the average treatment effect is the effect of the policy that forces
everyone to receive treatment. This is often not the most realistic or
relevant policy. The majority of policies do not force people to
receive undergo some treatment. Instead, policies typically provide
some incentive to receive treatment. For example, attending college is
a treatment that has been widely studied. However, no one thinks that
any government would or should force everyone to attend college. In
light of that, although it may be an interesting thing to think about,
the average treatment effect of college does not have much practical
relevance. The policy interventions with respect to college that we
see, such as direct subsidies and subsidized loans, change the
incentives to go to college. Our current setup gives us a nice way to
think about such policies.
Suppose we observe some baseline policy and want to evaluate an
alternative policy. Define the policy relevant treatment effect as
\[ PRTE = \frac{\Er[y_i|alt] - \Er[y_i|base]}{\Er[w_i|alt] -
\Er[w_i|base]}. \]
More generally, we might be interested in the four conditional
expectations in this expression separately. We define the PRTE as this
ratio so that it has the same form as other treatment effects. It is
the effect of the policy per person induced to receive treatment.
If we assume that the policy only affects $\pi(z)$ and not the
distribution of $Y_i,W_i$ then we can use our observation of the
baseline policy to extrapolate what will happen in the alternate
policy. Let $\pi_b(z)$ denote the baseline probability of treatment and
$\pi_a(z)$ the alternative probability of treatment. Also, let
\[ F_{\pi(z)}(p) = \Pr(\pi(z) \leq p) \]
be the cdf of $\pi(z)$. Then
\begin{align*}
\Er[y_i|base] = & \int_0^1 \Er[y_i|base,\pi(z)=p] dF_{\pi_b(z)}(p) \\
= & \int_0^1 \Er[w_iY_i(1) + (1-w_i) Y_i(0)|\pi(z)=u]dF_{\pi_b(z)}(p)
\\
= & \int_0^1 \left(\int_0^1 1\{p\geq u\}\Er[Y_i(1)|\tilde{U}_i=u] du\right)
dF_{\pi_b(z)}(p) + \\
& + \int_0^1\left( \int_0^1 1\{p__M\}|x_i = x \right] \to 0, \]
and approximation error obeys $|r(x_i)|\leq \ell_k c_k =
o(\sqrt{n/\xi_k})$.
\end{assumption}
\begin{theorem}[Pointwise asymptotic normality\label{tv2}]
Suppose \ref{a.1}-\ref{a.4} hold. If $R_{1n} \inprob 0$ and $\ell_k
c_k \to 0$, then
\[ \sqrt{n} \frac{\alpha' (\hat{\beta} - \beta)}{\norm{\alpha'
\Omega^{1/2}} } \indist N(0,1), \]
where $\Omega = Q^{-1} \Er[\epsilon_i^2 p(x_i)p(x_i)'] Q^{-1}$.
\end{theorem}
\begin{proof}
See \cite{chernozhukov2009}.
\end{proof}
Note that we can take $\alpha = p(x)$ to obtain
\[ \sqrt{n} \frac{p(x)' (\hat{\beta} - \beta)}{\norm{p(x)'
\Omega^{1/2}} } \indist N(0,1). \]
If additionally $\frac{\sqrt{n} r(x)}{\norm{p(x)'
\Omega^{1/2}} } \to 0$, then we have
\[ \sqrt{n} \frac{p(x)'\hat{\beta} - g(x))}{\norm{p(x)'
\Omega^{1/2}} } \indist N(0,1). \]
This is why the theorem is label pointwise asymptotic normality.
Another thing to notice about theorem \ref{tv2} is that it is always
true that a $N(0,1)$ has the same distribution as $\frac{\alpha'
\Omega^{1/2}}{\norm{\alpha' \Omega^{1/2}}} N(0,I_k)$. If $k$ were
fixed we would have
\[\sqrt{n} (\hat{\beta} - \beta)\Omega^{-1/2} \indist N(0,I_k). \]
We cannot get this sort of result here because $k$ is increasing with
$n$. However, to emphasize this parallel, we could have stated the
result of theorem \ref{tv2} as
\[ \sqrt{n} \frac{p(x)'\hat{\beta} - g(x))}{\norm{p(x)'
\Omega^{1/2}} } \indist
\frac{p(x)'\Omega^{1/2}}{\norm{p(x)'\Omega^{1/2}}} N(0,I_k). \]
We could also state this result as
\[ \abs{
\sqrt{n} \frac{p(x)'\hat{\beta} - g(x))}{\norm{p(x)'
\Omega^{1/2}} } - \frac{p(x)'\Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}} \mathcal{N}_k } = o_p(1), \]
for some $\mathcal{N}_k \sim N(0,I_k)$. When we look at the uniform limit
distribution, we will get a result with this form, so it is useful to
draw attention to the similarity. We did not originally state the
theorem in this form to emphasize that theorem \ref{tv2} is really a
result of applying a standard central limit theorem.
To obtain a uniform linearization and asymptotic distribution, we need
a stronger assumption on the errors.
\begin{assumption}\label{a.5}
The errors are conditionally sub-Gaussian, which means
\[ \sup_{x \in \mathcal{X}} \Er\left[ e^{\epsilon_i^2/2} |x_i = x
\right] < \infty. \]
Additionally for $\alpha(x) \equiv p(x)/\norm{p(x)}$, we have
\[ \norm{\alpha(x_1) - \alpha(x_2)} \leq \xi_{1k} \norm{x_1 -
x_2} \]
with $\xi_{1k} \lesssim k^a$ for some $a<\infty$.
\end{assumption}
\begin{lemma}[uniform linearization \label{lv3}]
Suppose that \ref{a.1}-\ref{a.5} hold. Then uniformly in $x \in
\mathcal{X}$,
\[ \sqrt{n}\alpha(x)'(\hat{\beta} - \beta) = \alpha(x)'\Gn\left[p(x_i)
(\epsilon_i + r(x_i)) \right] + R_{1n} \]
where
\[ R_{1n} \lesssim_p \sqrt{\frac{\xi_k^2 (\log n)^2}{n}}
\left(1+\ell_k c_k \sqrt{k \log n} \right) \]
and
\[ \alpha(x)'\Gn[p(x_i)r(x_i)] = R_{2n} \lesssim_p \ell_k c_k \log n \]
\end{lemma}
\begin{proof}
See \cite{chernozhukov2009}.
\end{proof}
\begin{theorem}[uniform rate \label{tv2}]
Under \ref{a.1}-\ref{a.5} we have
\[ \sup_{x \in \mathcal{X}} \abs{\alpha(x)'\Gn\left[p(x_i)
(\epsilon_i + r(x_i)) \right]} \lesssim_p (\log n)^{3/2} \]
so
\[ \sup_{x \in \mathcal{X}} \abs{\hat{g}(x) - g(x)} \lesssim_p
\frac{\xi_k}{\sqrt{n}} \left( (\log n)^{3/2} + R_{1n} + R_{2n}
\right) + \ell_k c_k \]
\end{theorem}
\begin{proof}
See \cite{chernozhukov2009}.
\end{proof}
Finally, we state a uniform convergence in distribution result.
\begin{theorem}[strong approximation \label{tv4}]
Suppose \ref{a.1}-\ref{a.5} hold and $R_{1n} = o_p(a_n^{-1})$, and
that
\[ a_n^3 k^4 \xi_k^2 (1+\ell_k^3 c_k^3) (\log n)^2 / n \to 0 \]
Then for some $\mathcal{N}_k \sim N(0,I_k)$ we have
\[ \sup_{\alpha \in S^{k-1}} \abs{
\sqrt{n} \frac{\alpha'(\hat{\beta} - \beta)}
{\norm{\alpha'\Omega^{1/2}}} - \frac{\alpha'
\Omega^{1/2}}{\norm{\alpha' \Omega^{1/2}}} \mathcal{N}_k }
= o_p(a_n^{-1}) \]
\end{theorem}
As with the pointwise limit theorem \ref{tv2}, if we replace $\alpha$
with $p(x)$ and we have $\sup_{x \in \mathcal{X}} \sqrt{n}
\frac{\abs{r(x)}} {\norm{p(x)'\Omega^{1/2}}} = o_P(a_n^-1)$, then
\ref{tv4} implies that
\[ \sup_{x \in \mathcal{X}} \abs{
\sqrt{n} \frac{\hat{g}(x) - g(x)}
{\norm{\alpha'\Omega^{1/2}}} - \frac{p(x)'
\Omega^{1/2}}{\norm{p(x)' \Omega^{1/2}}} \mathcal{N}_k }
= o_p(a_n^{-1}). \]
Note that unlike the pointwise asymptotic distribution (\ref{tv2}),
the uniform limiting theory is not a traditional weak convergence
result. For a given $x$, regardless of $k$, $\frac{p(x)'
\Omega^{1/2}}{\norm{p(x)' \Omega^{1/2}}} \mathcal{N}_k$ has a
standard normal distribution. However, as a function of $x$, the
Gaussian process, $\frac{p(x)' \Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}} \mathcal{N}_k$ changes with $k$. Theorem
\ref{tv4} says nothing about whether $\frac{p(x)' \Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}} \mathcal{N}_k$ ever converges to a fixed
Gaussian process, so in particular, the theorem does not show weak
convergence. Nonetheless, for any $k$, $\frac{p(x)' \Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}} \mathcal{N}_k$ is a tractable process and
we can find its distribution either analytically or through
simulation. This is enough to perform inference.
To get some idea of how this approximating process behaves, figure
\ref{fig:covfuncp} shows the covariance function of $\frac{p(x)'
\Omega^{1/2}} {\norm{p(x)'\Omega^{1/2}}} \mathcal{N}_k$ for $d=1$ and
polynomials for various $k$. That is, it plots
\[ \mathrm{Cov}\left(\frac{p(x_1)'
\Omega^{1/2}} {\norm{p(x_1)'\Omega^{1/2}}} \mathcal{N}_k , frac{p(x_2)'
\Omega^{1/2}} {\norm{p(x_2)'\Omega^{1/2}}} \mathcal{N}_k \right) \]
as a function of $x_1$ and $x_2$. When $x_1 = x_2$ the variance is
always one. When $x_1 \neq x_2$, the covariance approaches $0$ as $k$
increases. The approximating processes eventually converge to white
noise. However, we cannot perform inference based on white noise as
the limiting distribution because we have not shown how quickly the
approximating processes approach white noise.
\begin{figure}\caption{Covariance function for
polynomials \label{fig:covfuncp}}
\begin{tabular}{cccc}
$k = 2$ & $k=4$ & $k=6$ & $k=8$ \\
\includegraphics[width=0.25\linewidth]{figures/poly02} &
\includegraphics[width=0.25\linewidth]{figures/poly04} &
\includegraphics[width=0.25\linewidth]{figures/poly06} &
\includegraphics[width=0.25\linewidth]{figures/poly08} \\
$k = 10$ & $k=13$ & $k=17$ & $k=20$ \\
\includegraphics[width=0.25\linewidth]{figures/poly10} &
\includegraphics[width=0.25\linewidth]{figures/poly13} &
\includegraphics[width=0.25\linewidth]{figures/poly17} &
\includegraphics[width=0.25\linewidth]{figures/poly20}
\end{tabular}
\end{figure}
Figure \ref{fig:covfunct} shows the same thing for Fourier
series.
\begin{figure}\caption{Covariance function for Fourier
series \label{fig:covfunct}}
\begin{tabular}{cccc}
$k = 2$ & $k=4$ & $k=6$ & $k=8$ \\
\includegraphics[width=0.25\linewidth]{figures/trig02} &
\includegraphics[width=0.25\linewidth]{figures/trig04} &
\includegraphics[width=0.25\linewidth]{figures/trig06} &
\includegraphics[width=0.25\linewidth]{figures/trig08} \\
$k = 10$ & $k=13$ & $k=17$ & $k=20$ \\
\includegraphics[width=0.25\linewidth]{figures/trig10} &
\includegraphics[width=0.25\linewidth]{figures/trig13} &
\includegraphics[width=0.25\linewidth]{figures/trig17} &
\includegraphics[width=0.25\linewidth]{figures/trig20}
\end{tabular}
\end{figure}
A uniform confidence band of $g(x)$ of level $1-\alpha$ is a pair of
functions, $l_k(x),u_k(x)$ such that
\[ \Pr(l_k(x) \leq g(x) \leq u_k(x) \forall x \in \mathcal{X}) =
1-\alpha \]
There are a number of ways to construct such bands, but it is standard
to focus on bands of the form
\[ (l_k(x),u_k(x)) = \hat{g}(x) \pm \kappa(1-\alpha)
\norm{p(x)'\Omega^{1/2}} \]
where $\kappa(1-\alpha)$ is chosen so as to get the correct coverage
probability. From theorem \ref{tv4}, we know that $\sqrt{n}
\frac{\hat{g}(x) - g(x)} {\norm{p(x)'\Omega^{1/2}}} $ is uniformly
close to $\frac{p(x)'\Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}}\mathcal{N}_k$. Let
\[ Z_n(x) \equiv \frac{p(x)'\Omega^{1/2}}
{\norm{p(x)'\Omega^{1/2}}}\mathcal{N}_k. \]
We can set $\kappa(1-\alpha)$
to be the $(1-\alpha)$ quantile of
\[ \sup_{x \in \mathcal{X}} \abs{Z_n(x)}. \]
There are analytic results for this, which are useful for comparing
the widths of these confidence bands to confidence bands from other
methods or of other estimators. However, in practice, it is easier to
use simulation. Thus, you could compute confidence bands by:
\begin{enumerate}
\item Estimate $\hat{g}(x)$.
\item Estimate $\hat{\Omega} = \En[p(x_i)p(x_i)']^{-1}
\En[\hat{\epsilon}_i^2p(x_i)p(x_i)'] \En[p(x_i)p(x_i)']^{-1}$
\item Simulate a large number of draws, say $z_1, ..., z_R$ from
$N(0,I_k)$, set
\[ Z_{n,r}(x) = \frac{p(x)'\hat{\Omega}^{1/2}}
{\norm{p(x)'\hat{\Omega}^{1/2}}}z_r \]
and find $\sup_{x \in \mathcal{X}} \abs{Z_{n,r}(x)}$
\item Set $\hat{\kappa}(1-\alpha) = $ $1-\alpha/2$ quantile of
$\sup_{x \in \mathcal{X}} \abs{Z_{n,r}(x)}$
\item The confidence bands are
\[ (\hat{l}_k(x),\hat{u}_k(x)) = \hat{g}(x) \pm \hat{\kappa}(1-\alpha)
\norm{p(x)'\hat{\Omega}^{1/2}}
\]
\end{enumerate}
Note that all of our results above treated $\Omega$ as known. We could
show that the results go through when using $\hat{\Omega}$ instead,
see \cite{clr2009}.
Throughout, we have had these constants $c_k$, $\ell_k$, etc that
depend on various details of the problem. \cite{wangYang2009} obtain
similar results for spline regression, but they make explicit
assumptions about what $c_k$, $\ell_k$, etc will be.
All of our results have been for $\hat{g}(x)$ and not $\frac{\partial
\hat{g}} {\partial x_j}(x)$. However, the result in theorem
\ref{tv4} also applies to
\[ \sup_{x \in \mathcal{X}} \abs{
\sqrt{n} \frac{p^j(x)'(\hat{\beta} - \beta)}
{\norm{p^j(x)'\Omega^{1/2}}} - \frac{p^j(x)'
\Omega^{1/2}}
{\norm{\alpha' \Omega^{1/2}}} \mathcal{N}_k } = o_p(a_n^{-1}) \]
where $p^j(x) = \frac{\partial p}{\partial x_j}(x)$. If we redefine
the approximation error as
\[ r(x) = p^j(x)' \beta - \frac{\partial g}{\partial x_j} \]
then we just need to control this approximation error instead. If I
recall correctly, we will generally get $c_k = k^{-\frac{s - m}{d}}$
when we approximate the $m$th derivative. I believe $\ell_k$ will not
change, but I am not at all certain. Finally $\xi_k$ must be redefined
as $\sup \norm{p^j(x)}$, and it increases to
$k^{1/2 + m}$ for splines, and $k^{1+2m}$ for polynomials. I am not
sure about Fourier series, but I suspect $k^{1/2+m}$ as well. It is
easy to show that $k^{1/2+m}$ works, but it may be possible to get a
sharper bound. In any case, both $\xi_k$ and $c_k$ are worse when
estimating derivatives instead of functions themselves. Because of
this, we will get a slower rate of convergence when estimating
derivatives.
\subsubsection{Kernel regression}
I am running out of time, so just refer to \cite{hansen2009} for
kernel regression. Hansen's notes on nonoparametrics have 16
parts. The most relevant is the second part,
\url{http://www.ssc.wisc.edu/~bhansen/718/NonParametrics2.pdf}.
Hansen's notes show the same sort of pointwise asymptotic normality
and uniform convergence rate results as above for series
esitmators. Hansen's notes do not cover a uniform limiting
distribution. However, something like theorem \ref{tv4} can be shown
for kernel regression as well. See e.g.\ \cite{clr2009}, although the
result was first shown much earlier.
\subsubsection{Bootstrap}
Someone asked whether you can construct uniform confidence bands using
the bootstrap. Yes, you can, but only if you bootstrap in the correct
way. It has not been proven that the standard nonparametric bootstrap
works (i.e.\ resampling observations with replacement). However,
certain variants of the bootstrap do work. For kernel regression,
\cite{hardleMarron1991} propose using a wild bootstrap
procedure. \cite{claeskensVanKeilegom2003} propose a smoothed
bootstrap procedure is consistent for local polynomial regression. I
do not know of any analogous result for series regression. However, I
am fairly certain that a combination of the arguments in
\cite{chernozhukov2009} and \cite{ccms2011} would show consistency of
another smoothed bootstrap procedure.
%\subsection{Weighted average derivatives}
%\section{Partial identification}
\newpage
\bibliographystyle{econometrica}
\bibliography{../628}
\end{document}
__