\documentclass[a4paper,10pt]{article}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{setspace}
\usepackage{harvard}
\usepackage{aer}
\usepackage{fullpage}
\usepackage{hyperref}
\usepackage{graphicx}
\newcommand{\indep}{\perp\!\!\!\perp}
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\argmin}{\operatornamewithlimits{arg\,min}}
\newcommand{\plim}{\operatornamewithlimits{plim}}
\newcommand{\citefull}[1]{\citename{#1} \citeyear{#1}}
\newcommand{\citeparagraph}[1]{\medskip\noindent\textbf{{\citename{#1} \citeyear{#1}}}}
\newcommand{\cov}{\text{Cov}}
\newcommand{\var}{\text{Var}}
\newcommand{\rank}{\text{rank}}
% \newcommand{\det}{\text{det}}
\def\inprobLOW{\rightarrow_p}
\def\inprobHIGH{\,{\buildrel p \over \rightarrow}\,}
\def\as{\,{\buildrel a.s. \over \rightarrow}\,}
\def\asu{\,{\buildrel a.s.u. \over \rightarrow}\,}
\def\inprob{\,{\inprobHIGH}\,}
\def\indist{\,{\buildrel d \over \rightarrow}\,}
% defined environments
\newtheorem{thm}{Theorem} %[section]
\newtheorem{cor}[thm]{Corollary}
\newtheorem{lem}[thm]{Lemma}
\newtheorem{prop}[thm]{Proposition}
\theoremstyle{remark}
\newtheorem{rem}[thm]{Remark}
\newtheorem{ex}[thm]{Example}
\theoremstyle{definition}
\newtheorem{defn}[thm]{Definition}
\title{14.385 Recitation 7}
\author{Paul Schrimpf}
\begin{document}
\maketitle
\section{Midterm}
Average was 97.62.
\subsection{Bootstrap}
In part (a), the best answers to the question about whether the
bootstrap is a good idea was: yes, especially if you bootstrap a
pivotal statistic; or maybe: yes, but it would be better to use
parametric bootstrap.
\subsection{Test Scores}
\emph{You want to evaluate the effect of an after school program on test
scores. You have a data set with the following information: test
scores, $y$, whether the child attended the after school program, $d$,
and some other covariates, $x$, for example,
family income and whether the child lives with one or both
parents. Participation in the after school program was
completely voluntary, but before the program began, the school
sent a random subset of students detailed information extolling the
benefits of the program. You observe an indicator for whether a
student received this information, $z$.}
\begin{itemize}
\item[\emph{(a)}] \emph{(15 min) Describe how you could estimate the
effect of the program. Propose a test for the hypothesis that the
program had zero effect.} \\
An answer is IV.
\item[\emph{(b)}] \emph{(15 min) Due to no child left behind, adminstrators
especially care about the effect of the program on the low end of
test scores. Describe how you could estimate this effect. Briefly
discuss how you could compute your estimator.} \\
The intended answer was quantile IV. Most people got that, but not
everyone wrote down the objective function correctly. The moment
condition for quantile IV is:
\[ E\left[ \tau - \mathbf{1}(y \leq x\beta(\tau) + \alpha(\tau)d) |
x,z \right] = 0 \]
which gives us the objective function
\[ Q(\alpha,\beta) = \frac{1}{n} \sum \left(\tau - \mathbf{1}(y_i \leq
x_i\beta(\tau) + \alpha(\tau)d) \right)' w_i' A w_i \left(\tau -
\mathbf{1}(y_i \leq x_i\beta(\tau) + \alpha(\tau)d) \right) \]
The estimates can be computed by using quasi-Bayesian methods as on
the last problem set.
\item[\emph{(c)}] \emph{(15 min) Suppose a large portion of students
received a perfect score on the test. How would you modify your
estimator(s)?}
% One way to modify IV is to use a control function approach instead.
% Let's suppose that
% \begin{align*}
% y = & x\beta + \alpha d + \epsilon \\
% d = & \mathbf{1}(z\gamma + x\psi + u > 0)
% \end{align*}
% where $\epsilon$ and $u$ are jointly normal. We can write:
% \begin{align*}
% y = & x\beta + \alpha d + E[\epsilon|x,z,d] + v\\
% = & x\beta + \alpha d + d \rho_1 \frac{\phi(z\gamma +
% x\psi)}{1-\Phi(z\gamma + x\psi)} + (1-d) \rho_0 \frac{\phi(z\gamma +
% x\psi)}{\Phi(z\gamma + x\psi)} + v
% \end{align*}
The way to modify quantile IV is tricky. The invariance principle
that works for quantile regression does exactly not apply here. The
reason is that we must condition on $z$ instead of $d$, as we would
in exogenous quantile regression. Let's start from the first order
condition.
\[ \tau = P(y^* \leq x\beta(\tau) + \alpha(\tau)d) |
x,z) \]
where $y^*$ is the uncensored test score. The censored test score,
$y = \min\{y^*,100\}$ is less than or equal to $y^*$, so
\[ \tau \leq P(y \leq x\beta(\tau) + \alpha(\tau)d |
x,z) \]
Similarly, we know that
\[ 1-\tau = P(y^* > x\beta(\tau) + \alpha(\tau)d |
x,z) \]
and
\[ 1-\tau \geq P(y > x\beta(\tau) + \alpha(\tau)d |x,z) \]
These two conditional moment inequalities can be combined into an
objective function and set inference can be done.
\end{itemize}
\def\J{\mathcal{J}}
\section{HAC}
To estimate the asymptotic variance of GMM or do efficient GMM, we
need to estimate $\Omega = \lim Var(\sqrt{n}\hat{g}(\beta_0))$. When
the data is iid estimation of $\Omega$ is straightforward, we can just
use its sample analog. When the data is autocorrelated, estimation is
more complicated. Newey and West (1987) developed a
heteroskedasticity and auto-correlation consistent (HAC) covariance
estimator.
We have a series $\{z_t\}$, and we want to estimate its long-run
variance, $\mathcal{J} = \lim var\left(\frac{1}{\sqrt{T}} \sum
z_t\right)$. If we assume that $z_t$ is covariance stationary, so
that $cov(z_t,z_{t+k})$ only depends on $k$ and not $t$, and denote
the $k$th autocovariance as $\gamma_k = cov(z_t,z_{t+k})$, then we have
$\mathcal{J} = \sum_{-\infty}^\infty \gamma_k$.
\subsection{A na\"{i}ve approach}
$\mathcal{J}$ is the sum of all auto-covariance. We can estimate
$T-1$ of these, but not all. What if we just use the ones we can
estimate, \emph{i.e.}
\begin{align*}
\tilde{\mathcal{J}} = \sum_{k=T-1}^{T-1} \hat{\gamma}_k \, ,\,
\hat{\gamma}_k = \frac{1}{T} \sum_{j=1}^{T-k} z_j z_{j+k}
\end{align*}
It turns out that this is very bad.
\begin{align*}
\tilde{\mathcal{J}} = & \sum_{k=T-1}^{T-1} \hat{\gamma}_k \\
= & \frac{1}{T} \sum_{k=T-1}^{T-1} \sum_{j=1}^{T-k} z_j z_{j+k} \\
= & \frac{1}{T} (\sum_{t=1}^T z_t)^2 \\
= & (\frac{1}{\sqrt{T}} \sum_{t=1}^T z_t)^2 \\
\Rightarrow & N(0,\mathcal{J})^2
\end{align*}
so $\tilde{\mathcal{J}}$ is not consistent; it converges to a distribution
instead of a point. The problem is that we're summing too many
imprecisely estimated covariances.
\subsection{Truncated sum of sample covariances}
What if we don't use all the covariances?
\begin{align*}
\tilde{\mathcal{J}}_2 = & \sum_{k=-S_T}^{S_T} \hat{\gamma}_k
\end{align*}
where $S_TS_T}\gamma_j + \sum_{j=-S_T}^{S_T}(k_T(j)-1)
\gamma_j + \sum_{j=-S_T}^{S_T} k_T(j) (\hat{\gamma}_j-\gamma_j
\end{align*}
We can interprete these three terms as follows;
\begin{enumerate}
\item $\sum_{|j|>S_T}\gamma_j$ is truncation error
\item $\sum_{j=-S_T}^{S_T}(k_T(j)-1) \gamma_j$ is error from using
the kernel
\item $\sum_{j=-S_T}^{S_T} k_T(j) (\hat{\gamma}_j-\gamma_j$ is
error from estimating the covariances
\end{enumerate}
Terms 1 and 2 are non-stochastic. They represent bias. The third
term is stochastic; it is responsible for uncertainty. We will
face a bias-variance tradeoff.
We want to show that each of these terms goes to zero
\begin{enumerate}
\item Disappears as long as $S_T \rightarrow \infty$, since we
assumed $\sum_{-\infty}^\infty |\gamma_j|<\infty$.
\item $\sum_{j=-S_T}^{S_T}(k_T(j)-1) \gamma_j \leq
\sum_{j=-S_T}^{S_T}|k_T(j)-1| |\gamma_j|$ This will converge to
zero as long as $k_T(j) \rightarrow 1$ as $T\rightarrow \infty$
and $|k_T(j)|<1$ $\forall j$.
\item Notice that for the first two terms we wanted
$S_T$ big enough to eliminate them. Here, we'll want $S_T$
to be small enough.
First, note that $\hat{\gamma}_j \equiv \frac{1}{T}
\sum_{k=1}^{T-j} z_k z_{k+j}$ is not unbiased. $E\hat{\gamma}_j =
\frac{T-j}{T} \gamma_j = \tilde{\gamma}_j$. However, it's clear
that this bias will disappear as $T \rightarrow \infty$.
Let $\xi_{t,j} = z_t z_{t+j} - \gamma_j$, so $\hat{\gamma}_j -
\tilde{\gamma}_j = \frac{1}{T} \sum_{\tau=1}^{T-j} \xi_{\tau,j}$.
We need to show that the sum of $\xi_{t,j}$ goes to zero.
\begin{align*}
E(\hat{\gamma}_j - \tilde{\gamma}_j)^2 = & \frac{1}{T^2} \sum_{k=1}^{T-j}
\sum_{t=1}^{T-j} \cov(\xi_{k,j},\xi_{t,j}) \\
\leq & \frac{1}{T^2} \sum_{k=1}^{T-j} \sum_{t=1}^{T-j} |
\cov(\xi_{k,j},\xi_{t,j}) |
\end{align*}
We need an assumption to guarantee that the covariances of $\xi$
disappear. The assumption that $\xi_{t,j}$ are sationary for all $j$ and
$\sup_j \sum_k |\cov(\xi_{t,j},\xi_{t+k,j})| < C$ for some constant
$C$ implies that
\begin{align*}
\frac{1}{T^2} \sum_{k=1}^{T-j} \sum_{t=1}^{T-j}
|\cov(\xi_{k,j},\xi_{t,j}) | \leq \frac{C}{T}
\end{align*}
By Chebyshev's inequality we have:
\begin{align*}
P(|\hat{\gamma}_j - \tilde{\gamma}_j|>\epsilon) \leq \frac{
E(\hat{\gamma}_j - \tilde{\gamma}_j)^2}{\epsilon^2} \leq
\frac{C}{\epsilon^2 T}
\end{align*}
Then adding these together:
\begin{align*}
P(\sum_{-S_T}^{S_T} |\hat{\gamma}_j -
\tilde{\gamma}_j|>\epsilon) \leq &
\sum_{-S_T}^{S_T} P(|\hat{\gamma}_j -
\tilde{\gamma}_j|>\frac{\epsilon}{2S_T+1}) \\
\leq & \sum_{-S_T}^{S_T}
\frac{E(\hat{\gamma}_j-\gamma_j)^2}{\epsilon^2} (2S_T+1)^2 \\
\leq & \sum_{-S_T}^{S_T} \frac{C}{T} (2S_T+1)^2 \approx
C_1 \frac{S_T^3}{T}
\end{align*}
so, it is enough to assume $\frac{S_T^3}{T} \rightarrow 0$ as $T
\rightarrow \infty$.
\end{enumerate}
\end{proof}
\subsubsection{Positive Definiteness}
Under appropriate conditions on the kernel, $k_T(j)$, the
estimate of the long run variance is guaranteed to be positive
definite.
Assume $k_T(j)$ is an inverse Fourier transform of $K_T(l)$,
\emph{i.e.}\
\begin{align*}
k_T(j) = & \sum_{l=-(T-1)}^{T-1} K_T(l) e^{-i \frac{2\pi j l}{T}}
\end{align*}
\begin{lem}
$\hat{\J}$ is non-negative with probability 1 if and only if
$K_T(l) \geq 0$ and $K_T(l) = K_T(-l)$
\end{lem}
\paragraph{Common Kernels}
\begin{defn}
\emph{Bartlett kernel} $k_T(j) = k(j/S_T)$ where $k(x) =
\begin{cases} 1-|x| & x \in [0,1] \\ 0 & \text{otherwise}
\end{cases}$ \\
Newey-West (1987) (this is one of the most cited papers in
economics)
\end{defn}
\begin{defn}
\emph{Parzen kernel} $k(x) = \begin{cases}
1-\sigma x^2 - \sigma |x|^2 & 0 \leq x \leq 1/2 \\
2(1-|x|)^3 & 1/2 < x \leq 1 \\
0 & \text{otherwise} \end{cases}$
\end{defn}
\paragraph{Keifer \& Vogelsang}
Keifer \& Vogelsang (2002) consider setting $S_T=T-1$. This gives
$\hat{\J}$ inconsistent (it converges to a distribution). However,
$\hat{\J}$ usually isn't what we care about. We care about testing
$\hat{\beta}$, say by looking at the $t$ statistic. We can use
$\hat{\J}$ with $T=S_T$ to compute $t =
\frac{\hat{\beta}}{se(\hat{\beta})}$, which will converge to some
(non-normal) distribution without any nuisance parameters, and we
can use this distribution for testing. The motivation for doing
this is that Newey-West often works poorly in small samples.
\subsection{Parametric HAC Estimation}
Assume $z_t$ is AR(p)\\
Estimate by OLS
\begin{align*}
z_t = a_1 z_{t-1} + ... + a_p z_{t-p} + e_t
\end{align*}
then use $\hat{a}(L)=1-\hat{a}_1-..-\hat{a}_p$ to construct
$\hat{\mathcal{J}}$,
\begin{align*}
\hat{\mathcal{J}} = \frac{\hat{\sigma}^2}{\hat{a}(1)^2}
\end{align*}
where $\hat{\sigma}^2 = \frac{1}{T} \sum \hat{e}_t^2$
Two questions:
\begin{itemize}
\item What $p$? -- model selection criteria, BIC (Bayesian
informaiton criteria)
\item What if $z_t$ is not AR(p)?
\end{itemize}
The second question is still an open question. Den Haan and Levin
(1997) showed that if $z_t$ is AR(p), then the convergence of the
parameteric estimator is faster than the kernel estimator described
below.
\subsection{Prewhitening}
Nonparametric HAC performs poorly when
the series is persistent. Parametric HAC performs poorly if the
model is wrong. Prewhitening combines the two. From the above
we know that if $e_t$ is white noise with variance $\Sigma$,
then when $A(L) z_t = B(L) e_t$, the long-run variance of $z_t$
is
\[\J_z = {A}(1)^{-1} {B}(1) {\Sigma} {B}(1)' {{A}(1)^{1}}' \]
Similarly if $e_t$ is not white noise, but has long-run variance
$\J_e$, then
\[\J_z = {A}(1)^{-1} {B}(1) {\J_e} {B}(1)' {{A}(1)^{1}}' \]
The prewhitened nonparametric estimate of $\J_z$ is then simply:
\[\hat{\J}_z = \hat{A}(1)^{-1} \hat{B}(1) \hat{\J_e} \hat{B}(1)'
{\hat{A}(1)^{1}}' \]
where $\hat{A}$ and $\hat{B}$ are estimated by OLS or Kalman
filtering, and $\hat{J}_e$ is estimated by doing nonparametric HAC
hat $\hat{e}_t$.
\paragraph{Practical Advice} This summer, Mark Watson gave a lecture
on HAC \\
\href{http://nber15.nber.org/c/2008/si2008/TSE/Lecture9.pdf}{http://nber15.nber.org/c/2008/si2008/TSE/Lecture9.pdf}
\\
and this is a short summary of what he recommended. When doing HAC,
you have to choose which of the three methods to use, and then if you
choose ARMA, the lag lengths, or if you choose nonparametric, the
kernel and bandwidth. In this discussion, the goal is to do
inference on $\hat{\beta}$
\begin{itemize}
\item Simulations show large size distortions for all methods (reject
at 5\% level far more than 5\% of time). Tests work worse when
\begin{itemize}
\item Sample size is smaller
\item Data is more persistent (e.g. an AR(1) with coefficient near one)
\end{itemize}
\item If it is the correct model, parametric ARMA works best. Sometimes
theory suggests an ARMA (den Haan and Levin 1997).
\item Kiefer-Vogelsang leads to smaller size distortions, but has
less power than kernel methods
\item For kernel methods:
\begin{itemize}
\item The theoretically optimal\footnote{In the sense that it
minimizes MSE of $\hat{\J}$} kernel is called the
quadratic-spectral (QS) kernel. In practice, all common kernels
perform similarly.
\item For inference, it is not necessarily best to minimize MSE of
$\hat{\J}$
\begin{itemize}
\item See Sun, Philips, and Jin (2008) for a more formal
discussion
\item Intuition: suppose $z \sim N(\mu,\sigma^2)$ (think of $z$
as $\sqrt{n}(\beta - \hat{\beta}_0)$) and
$\hat{\sigma}^2$ is an estimate of $\sigma^2$. For testing
$H_0: \mu=0$ at level $\alpha$, we would compute a critical
value, $c$, from the normal distribution such that
$P(|z/\sigma|