\documentclass[a4paper,10pt]{article}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{setspace}
\usepackage{harvard}
\usepackage{aer}
\usepackage{fullpage}
\usepackage{hyperref}
\usepackage{graphicx}
\newcommand{\indep}{\perp\!\!\!\perp}
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\argmin}{\operatornamewithlimits{arg\,min}}
\newcommand{\plim}{\operatornamewithlimits{plim}}
\newcommand{\citefull}[1]{\citename{#1} \citeyear{#1}}
\newcommand{\citeparagraph}[1]{\medskip\noindent\textbf{{\citename{#1} \citeyear{#1}}}}
\newcommand{\cov}{\text{Cov}}
\newcommand{\var}{\text{Var}}
\newcommand{\rank}{\text{rank}}
% \newcommand{\det}{\text{det}}
\def\inprobLOW{\rightarrow_p}
\def\inprobHIGH{\,{\buildrel p \over \rightarrow}\,}
\def\as{\,{\buildrel a.s. \over \rightarrow}\,}
\def\asu{\,{\buildrel a.s.u. \over \rightarrow}\,}
\def\inprob{\,{\inprobHIGH}\,}
\def\indist{\,{\buildrel d \over \rightarrow}\,}
% defined environments
\newtheorem{thm}{Theorem} %[section]
\newtheorem{cor}[thm]{Corollary}
\newtheorem{lem}[thm]{Lemma}
\newtheorem{prop}[thm]{Proposition}
\theoremstyle{remark}
\newtheorem{rem}[thm]{Remark}
\newtheorem{ex}[thm]{Example}
\theoremstyle{definition}
\newtheorem{defn}[thm]{Definition}
%\usepackage[numbered,framed,bw]{mcode}
\usepackage{listings}
\lstset{ %
basicstyle=\footnotesize, % the size of the fonts that are used for the code
numbers=left, % where to put the line-numbers
numberstyle=\footnotesize, % the size of the fonts that are used for the line-numbers
stepnumber=5, % the step between two line-numbers. If it's 1 each line will be numbered
numbersep=5pt, % how far the line-numbers are from the code
%backgroundcolor=\color{white}, % choose the background color. You must add \usepackage{color}
showspaces=false, % show spaces adding particular underscores
showstringspaces=false, % underline spaces within strings
showtabs=false, % show tabs within strings adding particular underscores
frame=single, % adds a frame around the code
tabsize=2, % sets default tabsize to 2 spaces
captionpos=b, % sets the caption-position to bottom
breaklines=false, % sets automatic line breaking
breakatwhitespace=false % sets if automatic breaks should only happen at whitespace
}
\title{14.385 Recitation 8}
\author{Paul Schrimpf}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section{GEL and Bias}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mathematical Programming Approach to BLP}
\subsection{BLP Setup}
In the BLP model person $i$'s utility from choice $j$ is
\begin{align}
U_{ij} = \alpha \ln(y_i - p_j) + x_j \beta + \xi_{jm} + \sum_{l=1}^L
\sigma_l x_{jl} v_{il} + \epsilon_{ij}
\end{align}
where $\epsilon_{ij}$ has an extreme value type I distribution.
This implies that the population probability of choice $j$ in market
$m$ is
\begin{align}
P_{jm}(\alpha,\beta,\sigma) = \int \frac{\exp(\alpha \ln(y_i - p_{jm}) +
x_j \beta + \xi_{jm} + \sum_{l=1}^L \sigma_l x_{jl} \nu_{il} + \epsilon_{ij})}
{\sum_{j'=0}^J \exp(\alpha \ln(y_i - p_{j'm}) + x_{j'} \beta + \xi_{j'm} +
\sum_{l=1}^L \sigma_l x_{j'l} v_{il} + \epsilon_{ij'}) } dG_m(\nu,y_i)
\end{align}
If $\xi_j$ were observable, we could estimate this model by MLE.
$\xi$ is not observarble, so we must do something else. If we thought
$\xi$ was independent from $p$, $y$, and $x$, then we could integrate
it out. However, if firms know $\xi$, then firms will set their
prices to depend on $\xi$. Consequently, we must instrument for at
least prices. If our instruments are $z$, then we have the following
moment condition
\[ E[\xi_{jm} z_{jm}] = 0 \]
Or equivalently if $\delta_{jm} = \xi_{jm} + x_j \beta$,
\begin{align}
E[(\delta_{jm} - x_j \beta)z_{jm}] = 0 \label{momenteq}
\end{align}
Where we require $\delta_{jm}$ to rationalize the observed product
shares, that is, $\delta_{jm}$ must satisfy
\begin{align}
\hat{s}_{jm} = \int \frac{\exp(\alpha \ln(y_i - p_{jm}) +
\delta_{jm} + \sum_{l=1}^L \sigma_l x_{jl} \nu_{il} + \epsilon_{ij})}
{\sum_{j'=0}^J \exp(\alpha \ln(y_i - p_{j'm}) + \delta_{j'm} +
\sum_{l=1}^L \sigma_l x_{j'l} v_{il} + \epsilon_{ij'}) }
dG_m(\nu,y_i) \label{shareeq}
\end{align}
BLP view (\ref{shareeq}) as an equation that implicitly defines a
function mapping the other parameters to $\delta(\alpha,\sigma)$.
They then propose estimating the parameters by minimizing
\begin{align}
\min_{\alpha,\beta,\sigma} (\delta(\alpha,\sigma) - x\beta)' z' W z
(\delta(\alpha,\sigma) - x\beta)
\end{align}
This is a difficult function to minimize because
$\delta(\alpha,\beta)$ is difficult to compute. Computing
$\delta(\alpha,\beta)$ requires solving the nonlinear system of
equations (\ref{shareeq}). This general situation -- where you want
to minimize an objective function that depends on another, hard to
compute function -- is quite common in structural econometrics. It
occurs anytime the equilibrium of your model does not have a closed
form solution. Examples include models of games and models of
investment.
\subsection{Mathematical Programming}
The mathematical programming approach to BLP rewrites the problem as a
constrained optimization problem:
\begin{align}
\min_{\alpha,\beta,\sigma,\delta} & (\delta - x\beta)' z' W z
(\delta - x\beta) \notag \\
& s.t. \notag \\
\hat{s}_{jm} = & \int \frac{\exp(\alpha \ln(y_i - p_{jm}) +
\delta_{jm} + \sum_{l=1}^L \sigma_l x_{jl} \nu_{il} + \epsilon_{ij})}
{\sum_{j'=0}^J \exp(\alpha \ln(y_i - p_{j'm}) + \delta_{j'm} +
\sum_{l=1}^L \sigma_l x_{j'l} v_{il} + \epsilon_{ij'}) }
dG_m(\nu,y_i)
\end{align}
This probably does not look any easier to solve. However, constrained
optimization problems are very common, and they have been heavily
studied by computer scientists and mathematicians. Consequently,
there are many good algorithms and software tools for solving
constrained optimization problems. These algorithms are often faster
than the plug in the constraint approach of BLP. Part of the reason
is that by not always requiring the constraint to be satisfied,
mathematical programming algorithms can adjust all the parameters to
simultaneously lower the objective function value and make the
constraints closer to being satisfied. Ken Judd and coauthors have a
few papers advocating the use of mathematical programming in
economics.
\subsection{BLP in AMPL}
AMPL (\textbf{A} \textbf{M}athematical \textbf{P}rogramming
\textbf{L}anguage) is a computer language for mathematical
programming. The advantages of AMPL are:
\begin{itemize}
\item Simple, natural syntax for writing down mathematical programs
\item Automatically computes derivatives, which are needed to use the
most efficient solution algorithms
\item Provides a simple interface to a large number of state of the
art solution algorithms
\end{itemize}
AMPL's disadvantages:
\begin{itemize}
\item Simple syntax and limited set of commands makes it painful to do
anything other than write down mathematical programs
\item Very memory intensive
\item Poorly documented
\item Free student version limited to 300 variables and 300
constraints
\end{itemize}
I wrote some code to simulate and estimate a BLP type model in AMPL.
You can find the code on the webpage. Strangely enough, the hardest
part of writing the program was figuring out to simulate the model.
The problem was that with a poorly chosen DGP, it is quite likely that
there is no value of the parameters that both match the simulated
shares and satisfy the supply side restrictions.
If you include a supply side in the estimation, then
\[ mc_{jm} = p_{jm} + \frac{s_{jm}}{\partial s_{jm}/\partial
p}(\alpha,\delta,\sigma) \]
It's natural to require $mc \geq 0$. Unfortunately, there is no
guarantee that there is there exists parameters such that
$\hat{s}_{jm} = P_{jm}(\alpha,\delta,\sigma)$ and $p_{jm} +
\frac{s_{jm}}{\partial s_{jm}/\partial p}(\alpha,\delta,\sigma) \geq
0$. With enough simulations to create $\hat{s}_{jm}$ you are
guaranteed that a solution exists, but with a small number of
simulations, you are not. I used 100 simulations, and had to adjust
the parameters and distribution of covariates to avoid
problems. \footnote{This experience made me wonder: why not treat
$\hat{s}_{jm} = P_{jm}(\cdot)$ as moments instead of constraints? I
suppose the answer is that $\hat{s}_{jm}$ is often computed from
population data, and so is without error. However, there must be some
applications where $\hat{s}_{jm}$ comes from a smaller sample, and
then I think treating everything as a moment would make more sense.}
The model I estimated has an extra set of parameters. Utility is
given by
\begin{align}
U_{ij} = \alpha \ln(y_i - p_{jm}) + x_j \beta + \xi_{jm} + \sum_{l=1}^L
\sigma_l x_{jl} v_{il} + \sum_{k=1}^K \sum_{l=1}^L \pi_{kl} x_{jl}
d_{ik} + \epsilon_{ij}
\end{align}
where $d_{ik}$ are observed demographic characteristics. For each
market, $d_{ik}$ were drawn with replacement from the empirical
distribution of the characteristics. The idea here is that our
markets are something like states or years, and we can look at the
census to find out the distribution of demographics in each market.
I also estimated the model with supply side moments. As in Whitney's
notes, I assume Bertrand competition and marginal cost equal to
\begin{align}
mc_{jm} = \exp(c_j \alpha_c + \eta_{jm})
\end{align}
The first order condition for firms is:
\begin{align}
s_{jm}(p,x,\xi;\theta) + (p_{jm} - mc_{jm}) \frac{\partial
s_{jm}}{\partial p} = 0
\end{align}
I have assumed that each firm only produces one product, so that this
equation is slightly simpler than in the lecture notes. Rearranging
the first order condition gives:
\begin{align}
\eta_{jm} = \ln\left(p + \frac{s_{jm}}{(\partial s_{jm})/(\partial
p)} \right) - c_j \alpha_c
\end{align}
If we have some instruments correlated with the right hand side of the
equation, but uncorrelated with $\eta_{jm}$, our moments are:
\begin{align}
0 = E\left[z_{c} \left(\ln\left(p + \frac{s_{jm}}{(\partial
s_{jm})/(\partial p)} \right) - c_j \alpha_c \right) \right]
\end{align}
\subsubsection{Model File}
\lstinputlisting{blpCode/blpSupply.mod}
\subsubsection{Command File}
\lstinputlisting{blpCode/blp.ampl}
\subsubsection{Data Simulation Comands}
\lstinputlisting{blpCode/simulateData.ampl}
\subsubsection{Results}
\begin{table}\caption{Demand Only, 10 Markets}
\begin{center} \input{blpCode/blpD10.csv.table.tex} \end{center}
\end{table}
\begin{table}\caption{Supply and Demand , 10 Markets}
\begin{center} \input{blpCode/blpSD10.csv.table.tex} \end{center}
\end{table}
\begin{table}\caption{Demand Only, 37 Markets}
\begin{center} \input{blpCode/blpD37.csv.table.tex} \end{center}
\end{table}
\begin{table}\caption{Supply and Demand, 37 Markets}
\begin{center} \input{blpCode/blpSD37.csv.table.tex} \end{center}
\end{table}
\begin{table}\caption{Demand Only, 50 Markets, fewer Parameters}
\begin{center} \input{blpCode/blpD50.csv.table.tex} \end{center}
\end{table}
\begin{table}\caption{Supply and Demand, 50 Markets, fewer Parameters}
\begin{center} \input{blpCode/blpSD50.csv.table.tex} \end{center}
\end{table}
These results look rather dismal. In tables 1-4, the only parameters
that are well-estimated (perhaps too well estimated) are the
coefficients on cost shifters. These are also the only exogenous
covariates in the model. In the simulations all the product
characteristics are endogenous. I checked that my instrument are
correlated with the $x$'s and $p$'s and they
are. The results in tables 5 and 6, where there are fewer parameters
and more markets are better, but still not all that precise. The
estimates of $\sigma$ are uniformly downward biased. Note that I
simply used an identity weighting matrix. Perhaps optimally weighting
GMM would work better. I guess it's also possible that there's a bug
in my code, but the reasonably good results in tables 5 and 6 give me
some confidence.
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Weakly Identified GMM: Estimation of the NKPC}
One popular use of GMM in applied macro has been estimating the
Neo-Keynesian Phillips Curve.\footnote{I wrote this part of these
notes for time series, so they feature a
time series type application. However, the section about weak
identification and identification robust inference is very relevant
for this class.} An important example is Gal\'{i} and
Gertler (1999). This is an interesting paper because it
involves a good amount of macroeconomics, validated a model that many
macroeconomists like, and (best of all for econometricians) has
become a leading example of weak identification in GMM. These notes
describe Gal\'{i} and Gertler's paper, then give a quick overview of
identification robust inference in GMM, and finally describe the
results of identification robust procedures for Gal\'{i} and
Gertler's models.
\subsection{Deriving the NKPC}
You should have seen this in macro, so I'm going to go through it
quickly. Suppose there is a continuum of identical firms that sell
differentiated products to a representative consumer with
Dixit-Stiglitz preferences over the goods. Prices are sticky in the
sense of Calvo (1983). More specifically, each period each firm has
a probability of $1-\theta$ of being able to adjust its price each
period. If $p_t^*$ is the log price chosen by firms that adjust at time
$t$, then the evolution of the log price level will be
\begin{align}
p_{t} = \theta p_{t-1} + (1-\theta)p_t^* \label{pevo}
\end{align}
The first order condition (or maybe a first order approx to the first
order condition) for firms that get to adjust their
price at time $t$ is
\begin{align}
p_t^* = & (1-\beta \theta) \sum_{k=0}^\infty (\beta \theta)^k
E_t[mc_{t+k}^n + \mu] \label{pstar}
\end{align}
where $mc_t^n$ is log nominal marginal cost at time $t$, and $\mu$ is a
markup parameter that depends on consumer preferences. This first
order condition can be rewritten as:
\begin{align}
p_t^* = & (1-\beta \theta) mc_t^n + (1-\beta\theta)\sum_{k=1}^\infty
(\beta \theta)^k E_t[mc_{t+k}^n + \mu] \notag \\
= & (1-\beta \theta) (\mu + mc_t^n) + (1-\beta\theta) \beta \theta
E_t p_{t+1}^* \notag
\end{align}
Substituing in $p_t^* = \frac{p_t - \theta p_{t-1}}{1-\theta}$ gives:
\begin{align}
\frac{p_t - \theta p_{t-1}}{1-\theta} = & \frac{1-\beta
\theta}{1-\theta} E_t[p_{t+1} - \theta p_t] + (1-\beta
\theta)(\mu+mc_t^n) \notag \\
p_t - p_{t-1} = & \beta E_t[p_{t+1} - p_t] +
\frac{(1-\theta)(1-\theta\beta)}{\theta}(\mu+mc_t^n - p_t) \notag \\
\pi_t = & \beta E_t[\pi_{t+1}] +
\lambda (\mu + mc_t^n - p_t) \label{pidyn} \\
\end{align}
This is the NKPC. Inflation depends on expected inflation and real
marginal costs (or the deviation of log marginal costs from the
steady state. In the steady state $\mu - p = mc$.).
\subsection{Estimation}
\paragraph{Using the Output Gap}
Since real marginal costs are difficult to observe, people have noted
that in a model without capital, $mc_t^n - p_t \approx \kappa x_t$
where $x_t$ is the output gap (the difference between current output
and output in a model without price frictions). This suggests
estimating:
\begin{align*}
\beta \pi_t = \pi_{t-1} - \lambda \kappa x_t - \lambda \mu +
\epsilon_t
\end{align*}
When estimating this equation, people general find that
$\widehat{-\lambda \kappa}$ is positive, contradicting the model.
\paragraph{GG}
Gal\'{i} and Gertler (1999) argued that there at least two problems
with this model: (i) the output gap is hard to measure and (ii) the
output gap may not be proportional to real marginal costs. Gal\'{i}
and Gertler argue that the labor income share is a better proxy for
real marginal costs. With a Cobb-Douglas production function,
\[ Y_t = A_t K_t^{\alpha_k} L_t^{\alpha_l} \]
marginal cost is the ratio of the wage to the marginal product of
labor,
\begin{align*}
MC_t = & \frac{W_t}{P_t (\partial Y_t/\partial Y)} = \frac{W_t L_t}{P_t
\alpha_l Y_t} \\
= & \frac{1}{\alpha_l} S_{Lt}
\end{align*}
Thus the deviation of log marginal cost from its steady state should
equal the deviation of log labor share from its steady state, $mc_t =
s_t$. This leads to moment conditions:
\begin{align}
E_t[(\pi_t - \lambda s_t - \beta \pi_{t+1}) z_t] & = 0 \\
E_t[(\theta \pi_t - (1-\theta)(1-\beta\theta) s_t - \theta \beta
\pi_{t+1}) z_t] & = 0 \label{ggmoments}
\end{align}
where $z_t$ are any variables in firms' information sets at time $t$.
As instruments, Gal\'{i} and Gertler use four lags of inflation, the
labor income share, the output gap, the long-short interest rate
spread, wage inflation, and commodity price inflation. Gal\'{i} and
Gertler estimate this model and find values of $\beta$ around 0.95,
$\theta$ around 0.85, and $\lambda$ around 0.05. In particular,
$\lambda>0$ in accordance with the theory unlike when using the
output gap. The estimates of $\theta$ are a bit high. They imply an
average price duration of five to six quarters, which is much higher
than observed in the micro-data of Bils and Klenow (200?).
\subsection{Hybrid Philips Curve}
The NKPC implies that price setting behavior is purely forward
looking. All inflation inertia comes from price stickiness in this
model. One might be concerned whether this is enough to capture the
observed dynamics of inflation. To answer this question, Gal\'{i}
and Gertler consider a more general model that allows for backward
looking behavior. In particular, they assume that a fraction,
$\omega$ of firms set prices equal to the optimal price last period
plus an inflation adjustment: $p_t^b = p_{t-1}^* + \pi_{t-1}$. The
rest of the firms behave optimally. This leads to the following inflation
equation:
\begin{align}
\pi_t = & \frac{(1-\omega)(1-\theta)(1-\beta\theta) mc_t + \beta
\theta E_t\pi_{t+1} + \omega \pi_{t-1}} {\theta +
\omega(1-\theta(1-\beta))} \label{HNKPC} \\
= & \lambda mc_t + \gamma^f E_t \pi_{t+1} + \gamma^b \pi_{t-1} \notag
\end{align}
As above, Gal\'{i} and Gertler estimate this equation using GMM. The
find $\hat{\omega} \approx 0.25$ with a standard error of $0.03$,
so a purely forward looking model is rejected. Their estimates of
$\theta$ and $\beta$ are roughly the same as above.
\subsection{Identification Issues}
Gal\'{i} and Gertler note that they can write their moment condition
in many ways, for example the HNKPC could be estimated from either of
the following moment conditions:
\begin{align}
E_t\left[ \left((\theta +
\omega(1-\theta(1-\beta))) \pi_t -
(1-\omega)(1-\theta)(1-\beta\theta) s_t - \beta
\theta \pi_{t+1} - \omega \pi_{t-1}\right) z_t \right] = &
0 \label{m1} \\
E_t \left[ \left( \pi_t -
\frac{(1-\omega)(1-\theta)(1-\beta\theta)} {\theta +
\omega(1-\theta(1-\beta))} s_t - \frac{\beta
\theta } {\theta +
\omega(1-\theta(1-\beta))} \pi_{t+1} - \frac{\omega}{\theta +
\omega(1-\theta(1-\beta))} \pi_{t-1}\right) z_t \right] = &
0 \label{m2}
\end{align}
Estimation based on these two moment conditions gives surprisingly
different results. In particular, (\ref{m1}) leads to an estimate of
$\omega$ of 0.265 with a standard error of 0.031, but (\ref{m2})
leads to an estimate of 0.486 with a standard error of 0.040. If
the model is correctly specified and well-identified, the two
equations should, asymptotically, give the same estimates. The fact
that the estimates differ suggests that either the model is
misspecified or not well identified.
\subsubsection{Analyzing Identification}
There's an old literature about analyzing identification conditions
in rational expectations models. Pesaran (1987) is the classic paper
that everyone seems to cite, but I have not read it. Anyway, the
idea is to solve the rational expectations model (\ref{HNKPC}) to
write it as an autoregression, write down a model for $s_t$ to
complete the system, and then analyze identification using familar
SVAR or simultaneous equation tools. I will follow Mavroeidis
(2005). Another paper that does this is Nason and Smith (2002).
Solving (\ref{HNKPC}) and writing an equation for $s_t$ gives a
system like:
\begin{align}
\pi_t = & D(L) \pi_{t-1} + A(L) s_t + \epsilon_t \label{pirf} \\
s_t = & \rho(L) s_{t-1} + \phi(L) \pi_{t-1} + v_t \label{srf}
\end{align}
$D(L)$ and $A(L)$ are of order the maximum of $1$ and the order of
$\rho(L)$ and $\phi(L)$ respectively.
An order conditions for identification is that the order of $\rho(L)$
plus $\phi(L)$ is at least two, so that you have at least two valid
instruments to instrument for $s_t$ and $\pi_{t+1}$ in
(\ref{HNKPC}). This condition can be tested by estimating
(\ref{srf}) and testing whether the coefficients are 0. Mavroeidis
does this and finds a p-value greater than 30\%, so
non-identification is not rejected. Mavroeidis then picks a wide
range of plausible values for the parameters in the model and
calculates the concentration parameter for these parameters. He
finds that concentration parameter is often very close to zero.
Recall from 382 that in IV, a low concentration parameter indicates
weak instrument problems.
\subsubsection{Weak Identification in GMM}
As with IV, when a GMM model is weakly identified, the usual
asymptotic approximations work poorly. Fortunately, there are
alternative inference procedures that perform better.
\paragraph{GMM Bias}\footnote{This section is based on Whitney's
notes from 386.} The primary approaches are based on the CUE
(continuously updating estimator) version of GMM. To understand why,
it is useful to write down the approximate finite sample bias of GMM.
If our moment conditions are $g(\beta) = \sum g_i(\beta) / T$ and
$\Omega(\beta) = E[g_i(\beta) g_i(\beta)']$ (in the iid case, for
time series replace with an appropriate auto-correlation consistent
type estimator) CUE minimizes:
\begin{align*} \hat{\beta} = \argmin
g(\beta)' \Omega(\beta)^{-1} g(\beta)
\end{align*}
That is, rather
than plugging in a preliminary estimate of $\beta$ to find the
weighting matrix, CUE continuously updates the weighting matrix as a
function of $\beta$. Suppose we used a fixed weighting matrix, $A$
and do GMM. What is the expectation of the objective function?
Well, for iid data (if observations are correlated, we will get an
even worse bias) we have:
\begin{align*}
E\left[ g(\beta)' A g(\beta) \right] = & E\left[ \sum_{i,j}
g_i(\beta)' A g_j(\beta) /T^2 \right] \\
= & \sum_{i \neq j} E[g(\beta)] A E[g(\beta)] / T^2 + \sum_i
E[g_i(\beta)' A g_i(\beta)]/n^2 \\
= & (1-T^{-1}) E[g(\beta)] A E[g(\beta)] + tr(A\Omega(\beta))T^{-1}
\end{align*}
The first term is the population objective function, so it is
minimized at $\beta_0$. The second term, however, is not generally
minized at $\beta_0$, causing $E[\hat{\beta}_T] \neq \beta$.
However, if we use $A=\Omega(\beta)^{-1}$, then the second term
vanishes and we have an unbiased estimator. This is sort of what CUE
does. It is not exactly since we use $\hat{\Omega}(\beta)$ instead
of $\Omega(\beta)$. Nonetheless, it can be shown to be less biased
than two-step GMM. See Newey and Smith (2004).
Another view of the bias can be obtained by comparing the first order
conditions of CUE and two-step GMM. The first order condition for
GMM is
\begin{align}
0 = G(\beta) \hat{\Omega}(\tilde{\beta})^{-1} g(\beta)
\end{align}
where $G(\beta) = \frac{\partial g}{\partial \beta} = \sum \frac{\partial g_i}{\partial \beta}$
and $\tilde{\beta}$ is the first step estimate of $\beta$. This term
will have bias because the $i$th observation in the sum used for $G$,
$\hat{\Omega}$, and $g$ will be correlated. Compare this to the first
order condition for CUE:
\begin{align*}
0 = & G(\beta) \hat{\Omega}(\beta)^{-1} g(\beta) -
g(\beta) \hat{\Omega}(\beta)^{-1} \left( \sum \left(\frac{\partial
g_i}{\partial \beta} g_i(\beta)' + g_i(\beta) \frac{\partial
g_i}{\partial \beta}' \right) / T \right) \hat{\Omega}(\beta)^{-1}
g(\beta) \\
= & \left[G(\beta) - \left( \sum \frac{\partial
g_i}{\partial \beta}
g_i(\beta)'\right) \hat{\Omega}(\beta)^{-1} \right] \hat{\Omega}(\beta)^{-1}
g(\beta)
\end{align*}
The term in brackets is the projection of $G(\beta)$ onto the space
orthogonal to $g(\beta)$. Hence, the term in brackets is
uncorrelated with $g(\beta)$. This reduces bias.\footnote{There is
still some bias due to parts of $\hat{\Omega}$ being correlated with
$g$ and $G$.}
\paragraph{Identification Robust Inference}
The lower bias of CUE suggests that inference based on CUE might be
more robust to small sample issues than traditional GMM inference.
This is indeed the case. Stock and Wright (2000) showed that under
$H_0: \beta = \beta_0$ the CUE objective function converges to a
$\chi^2_m$ where $m$ is the number of moment conditions. Moreover,
this convergence occurs whether the model is strongly,
weakly\footnote{Defining weak GMM asymptotics involves introducing a
bunch of notation, so I'm not going to go through it. The idea is
essentially the same as in linear models. See Stock and Wright
(2000) for details.}, or
non-identified. Some authors
call the CUE objective function the $S$-statistic. Others call it
the $AR$-statistic because in linear models, the $AR$ statistic is
the same as the CUE objective function. The $S$-stat has the same
properties as the $AR$-stat discussed in 382. Importantly, its
degrees of freedom grows with the number of moments, so it may have
lower power in very over identified models. Also, an $S$-stat test
may reject either because $\beta \neq \beta_0$ or because the model
is misspecified. This can lead to empty confidence sets.
The Kleibergen (2005) developed an analog of the Lagrange Multiplier
that, like the $S$-stat, has the same limiting distribution
regradless of identification. The LM stat is based on the fact that
under $H_0: \beta = \beta_0$, the derivative of the objective
function at $\beta$ should be approximately zero. Kleibergen applies
this principal to the CUE objective function. Let $\hat{D}(\beta) =
G(\beta)
- \widehat{acov}\left(G(\beta),g(\beta) \right) \hat{\Omega}(\beta)^{-1}$
(as above for iid data, $\widehat{acov}\left(G(\beta),g(\beta)\right)
= \sum \frac{\partial g_i}{\partial \beta} g_i(\beta)'$).
Kleibergen's statistic is
\begin{align}
KLM = g(\beta) \hat{\Omega(\beta)}^{-1} \hat{D}(\beta)
( \hat{D}(\beta)' \hat{\Omega(\beta)}^{-1} \hat{D}(\beta))^{-1} \hat{D}(\beta)'
\hat{\Omega(\beta)}^{-1}
g(\beta) \indist \chi^2_p
\end{align}
It is asymptotically $\chi^2$ with $p=$(number of parameters) degrees
of freedom. The degrees of freedom of KLM does not depend on the
degree of overidentification. This can give it better power
properties than the AR/S stat. However, since it only depends on the
first order condition, in addition to being minimized at the minimum
of the CUE objective function, it will also be minimized at local
minima and maxima and inflection points. This property leads
Kleibergen to consider an identification robust version of the
Hansen's J-statistic for testing overidentifying restriction.
Kleibergen's J is
\begin{align}
J(\beta) = S(\beta) - KLM(\beta) \indist \chi^2_{m-p}
\end{align}
Moreover, $J$ is asymptotically independent of $KLM$, so you can
test using both of them, yielding a joint test with size $\alpha
= \alpha_J + \alpha_K - \alpha_J \alpha_K$.
If you have a great memory, you might also remember Moreira's
conditional likelihood ratio test from covering weak instruments in
382. There's also a GMM version of this test discussed in Kleibergen
(2005).
\subsubsection{Results of Weak Identification Robust Inference for
HNKPC}
\paragraph{Kleibergen and Mavroeidis}
Kleibergen and Mavroeidis (2008) extend Kleibergen's tests described
above, which only work for testing a the full set of parameters, to
tests for subsets of parameters. As an application, Kleibergen and
Mavroeidis (2008) simulate a HNKPC model and consider
testing whether the faction of backward looking firms (which they
call $\alpha$, but GG and I call $\omega$) equals one half.
Figure \ref{kmpc} shows the frequency of rejection for various true
values of $\alpha$. The Wald test badly overrejects when the true
$\alpha$ is 1/2. The KLM and JKLM have the correct size under $H_0$,
but they also have no power against any of the alternatives. It
looks like identification is a serious issue.
\begin{figure}
\caption{Kleibergen and Mavroeidis Power Curve \label{kmpc}}
\includegraphics[width=\linewidth]{km}
\end{figure}
\paragraph{Dufour, Khalaf, and Kichian (2006}
Use the $AR$ and $K$ statistics to construct confidence sets for
Gal\'{i} and Gertler's model. Figure \ref{dkk} shows the results.
The confidence sets are reasonably informative. The point
estimates imply an average price duration of 2.75 quarters, which is
much closer to the micro-data evidence (Bils and Klenow's average is
1.8 quarters) than Gal\'{i} and Gertler's estimater. Also, although
not clear from this figure, Dufour, Khalaf, and Kichian find that
Gal\'{i} and Gertler's point estimates lie outside their 95\%
confidence sets.
\begin{figure}
\caption{Dufour, Khalaf, and Kichian Confidence Sets\label{dkk}}
\includegraphics[width=\linewidth]{dkk}
\end{figure}
\end{document}