Assignment: reproducing Grieco & McDevitt (2017)

Author: Paul Schrimpf
Date: 2019-02-01

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About this document

This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebooks.

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]

Introduction

This assignment will reproduce some of the results of Grieco and McDevitt (2017).

Getting started

https://vse.syzygy.ca provides a convenient browser based interface to Julia. Open it and log in. This assignment is in a git repository at https://github.com/UBCECON567/Dialysis. Start by cloning the git repository to your syzygy directory. Open a terminal in syzygy (File -> New -> Terminal). This will open a Linux shell in your browser. To clone the git repository, enter

git clone https://github.com/UBCECON567/Dialysis

This will create a directory called Dialysis containing all the files related to this assignment.

Clicking on the folder icon near the top left of the screen opens a file browser panel. Use it to open the Dialysis/notebooks folder. You can complete this assignment by modifying the dialysis.ipynb notebook. I recommend creating a copy of this notebook, and then working on the copy. You can create a copy by right clicking in the file browser panel. Now open your copy of the notebook.

Notebooks consist of a series of “cells” of either text written in markdown or Julia code. If you double click on any of the text cells, you can see the markdown that created it. To go back to the formatted text, execute the cell by either clicking the play icon on the top of the page or typing ctrl and enter together.

Julia resources

This assignment will try to explain aspects of Julia as needed. However, it at some point you feel lost, you may want to consult some of the following resources. Reading the first few sections of either QuantEcon or Think Julia is recommended.

Resources

QuantEcon with Julia
Think Julia A detailed introduction to Julia and programming more generally. Long, but recommended, especially if you’re new to programming.
From the julia prompt, you can access documentation with ?functionname. Some packages have better documentation than others.
https://julialang.org/ is the website for Julia
Documentation for core Julia can be found at https://docs.julialang.org/en/v1/. All Julia packages also have a github page. Many of these will include package specific documentation.
Notes on Julia from ECON 628 much of this is part of QuantEcon, but not all
The Julia Express short book with examples of Julia usage

Part I: Loading and exploring the data

Loading packages

Like many programming environments (R, Python, etc), Julia relies on packages for lots of its functionality.The following code will download and install all the packages required for this assignment (but the packages will still need to be loaded with using ...). Execute this cell. It will take some time. While the cell is running, there will be a [*] to the left of it. This will change to [1] (or some other number) after the cell is finished running. The number indicates the order in which the cell was executed. You can execute cells out of order. This can be useful during development, but you should always make sure that your notebook works correctly when cells are executed in order before considering it complete (that is, make sure the “Run -> Restart Kernel and Run all Cells” menu option produces the output you want). Don’t worry about understanding the details of the code in this section.

using Pkg # loads the Pkg package 
Pkg.activate("..") # loads environment specified in "../Project.toml"
                   # This contains a list of packages and their
                   # versions
Pkg.instantiate()  # installs all packages in the environment

Updating registry at ~/.julia/registries/General Updating git-repo https://github.com/JuliaRegistries/General.git [?25l[2K[?25h

Some useful functions for this assignment are in ../src/Dialysis.jl. We load those functions now.

using Revise
if (!("../src" ∈ LOAD_PATH))
  push!(LOAD_PATH, "../src") # so that `using` knows where to find Dialysis.jl
end
using Dialysis

The functions in Dialysis.jl are organized into a module, just like any other Julia package. Revise.jl is a package to make it easier to develop packages (like Dialysis.jl). In particular, using Revise will make it so that as soon as you save any modifications to Dialysis.jl, those modifications will be loaded into your Julia session without you doing anything extra.

Load the data

Now let’s get to work. I originally downloaded the data for this problem set from https://dialysisdata.org/content/dialysis-facility-report-data. As in Grieco and McDevitt (2017) the data comes from Dialysis Facility Reports (DFRs) created under contract to the Centers for Medicare and Medicaid Services (CMS). However, there are some differences. Most notably, this data covers 2006-2014, instead of 2004-2008 as in Grieco and McDevitt (2017) .

The R script https://bitbucket.org/paulschrimpf/econ565/src/master/assignments/production-R/dialysis/downloadDialysisData.R downloads, combines, and cleans the data. Unfortunately, dialysisdata.org has reorganized their website, and the data no longer seems to be available. Similar (likely identical) data is available from https://data.cms.gov/browse?q=dialysis. It might be useful to look at the documentation included with any of the “Dialysis Facility Report Data for FY20XX” zip files. Anyway, the result of the R script is the dialysisFacilityReports.rda file contained in this git repository. This R data file contains most of the variables used by Grieco and McDevitt (2017).

using DataFrames  # DataFrames.jl is a package for storing and
                  # interacting with datasets
dialysis = loaddata() # loaddata() is a function I wrote that is part
                      # of Dialysis.jl. It returns a DataFrame
typeof(dialysis)

DataFrame

We will begin our analysis with some exploratory statistics and figures. There are at least two reasons for this. First, we want to check for any anomalies in the data, which may indicate an error in our code, our understanding of the data, or the data itself. Second, we should try to see if there are any striking patterns in the data that deserve extra attention. We can get some information about all the variables in the data as follows

describe(dialysis)

58 rows × 8 columns

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Union…	Union…	DataType
1	provfs		=“012306”		=“682567”	7015		String
2	year	2010.23	2006.0	2010.0	2014.0			Float64
3	comorbidities	4.70985	0.44444	4.69811	10.7619		7982	Float64
4	comorbidities_p3	4.70985	0.44444	4.69811	10.7619		7982	Float64
5	hemoglobin	9.77373	3.6	9.7727	18.0		1608	Float64
6	hemoglobin_p3	9.77326	3.6	9.7715	18.0		1608	Float64
7	std_mortality	1.00545	0.0	0.96694	4.97293		2239	Float64
8	std_mortality_p3	1.02017	0.0	0.98193	4.76552		2239	Float64
9	std_hosp_days	0.976877	0.0	0.932715	4.94606		2038	Float64
10	std_hosp_days_p3	0.973781	0.0	0.92961	4.94606		2038	Float64
11	std_hosp_admit	0.997463	0.0	0.96394	3.87802		2071	Float64
12	std_hosp_admit_p3	0.995428	0.0	0.96269	4.1462		2071	Float64
13	pct_septic	11.3111	0.0	10.7143	100.0		1354	Float64
14	pct_septic_p3	11.5805	0.0	11.111	100.0		1354	Float64
15	n_hosp_admit	93.215	0.0	80.0	814.0		1261	Float64
16	n_hosp_admit_p3	93.8085	0.0	80.0	840.0		1261	Float64
17	n_hosp_patients	74.0305	0.0	65.0	539.0		1261	Float64
18	n_hosp_patients_p3	74.3494	0.0	66.0	539.0		1261	Float64
19	patient_years_hd	49.9054	0.0	43.4	333.421		1261	Float64
20	patient_years_hd_p3	50.1318	0.0	43.587	333.421		1261	Float64
21	city		ABBEVILLE		ZUNI	3199		String
22	name		CHEYENNE DIALYSIS (DVA)		ZZ CLOSED TRINITY REGIONAL MEDICAL CENTER DIALYSIS - WEBSTER	9276		String
23	state		AK		WY	56		String
24	network	9.64447	1	9.0	18			Int32
25	chain_name				WEST PENN ALLEGHENY HEALTH SYST	126		String
26	profit_status		For Profit		Unavailable	3		CategoricalString{UInt8}
27	stations	17.9418	0.0	17.0	108.0		3227	Float64
28	total_staff	14.552	0.0	13.0	235.0		644	Float64
29	dieticiansFT	0.516599	0.0	0.0	12.0		1114	Float64
30	dieticiansPT	0.604436	0.0	1.0	20.0		1114	Float64
31	nurseFT	4.93611	0.0	4.0	103.0		1114	Float64
32	nursePT	1.00594	0.0	0.0	72.0		1114	Float64
33	ptcareFT	5.66502	0.0	5.0	140.0		1114	Float64
34	ptcarePT	0.838336	0.0	0.0	32.0		1114	Float64
35	social_workerFT	0.567487	0.0	1.0	42.0		5712	Float64
36	social_workerPT	0.563829	0.0	1.0	19.0		1114	Float64
37	total_staff_l1	14.7059	0.0	13.0	235.0		1628	Float64
38	dieticiansFT_l1	0.506917	0.0	0.0	25.0		1745	Float64
39	dieticiansPT_l1	0.618645	0.0	1.0	25.0		1745	Float64
40	nurseFT_l1	4.90774	0.0	4.0	103.0		1745	Float64
41	nursePT_l1	1.04822	0.0	0.0	72.0		1745	Float64
42	ptcareFT_l1	5.65182	0.0	5.0	225.0		1745	Float64
43	ptcarePT_l1	0.872588	0.0	0.0	32.0		1745	Float64
44	social_workerFT_l1	0.562001	0.0	1.0	42.0		6176	Float64
45	social_workerPT_l1	0.574905	0.0	1.0	25.0		1745	Float64
46	patient_months	771.833	0.0	684.0	4849.0		151	Float64
47	patient_years_rom	67.1459	0.0	57.656	441.339			Float64
48	pct_fistula	55.264	0.0	55.977	100.0		1215	Float64
49	pct_female	44.7421	0.0	44.8	100.0		131	Float64
50	patient_age	61.4366	7.3077	61.6972	85.2857		131	Float64
51	patient_esrd_years	4.4062	0.3121	4.3491	19.9425		131	Float64
52	treatment_type		Hemodialysis		Unavailable	4		CategoricalString{UInt8}
53	inspect_date		1992-05-20		2015-05-29	3273	329	Date
54	inspect_result		.		Unknown	12	2011	CategoricalString{UInt8}
55	inspect_cfc_cites	0.293753	0.0	0.0	10.0		3918	Float64
56	inspect_std_cites	5.89393	0.0	4.0	153.0		3921	Float64
57	days_since_inspection	704.886	-177.0	557.0	7454.0		329	Float64
58	original_chain_name				WELLSPAN DIALYSIS	189		String

The meaning of these variables are as follows:

Variable	Definition
provfs	provider identifier
year	year
comorbidities	average patient comorbidities
hemoglobin	average patient hemoglobin level
std_mortality	standardized mortality ratio
std_hosp_days	standardized hospitalization days
std_hosp_admit	standardized hospitalization admittance rate
pct_septic	percent of patients hospitalized due to septic infection
n_hosp_admit	number of hospitalizations
n_hosp_patients
patient_years_hd	patient years at risk of hospitalization
city	city
name	provider name
state	state
chain_name	name of chain if provider is part of one
profit_status	whether for profit
stations	number of dialysis stations
total_staff	total staff
dieticiansFT	full-time renal dieticians
dieticiansPT	part-time renal dieticians
nurseFT	full-time nurses (>32 hours/week)
nursePT	part-time nurses (<32 hours/week)
ptcareFT	full-time patient care technicians
ptcarePT	part-time patient care technicians
social_workerFT	full-time social workers
social_workerPT	part-time social workers
patient_months	number of patient-months treated during the year
patient_years_rom	patient-years at risk of mortality
pct_fistula	the percentage of patient months in which the patient received dialysis through arteriovenous (AV) fistulae
pct_female	percent of female patients
patient_age	average age of patients
patient_esrd_years	average number of years patients have had end stage renal disease
treatment_type	types of treatment provided at facility
inspect_date	date of most recent inspection
inspect_result	result of most recent inspection
inspect_cfc_cites	number of condition for coverage deficiencies in most recent inspection
inspect_std_cites	number of standard deficiencies in most recent inspection
days_since_inspection	days since last inspection

The raw data contains information on many variables in each of the previous 4 years. Staffing variables with no suffix are staff as of January 31, year as reported in year + 1. Staffing variables with “.l1” are staff as of January 31, year - 1 as reported in year + 1. If there were no reporting errors, the .l1 variables would equal the lag of the ones without .l1. However, you might find that this is not the case.

As explained in downloadDialysisData.R, data collected in year Y has information on most variables in years Y-1, Y-2, Y-3, and Y-4. However, for some variables and survey years, only information in years Y-2, Y-3, Y-4 is included. For such variables, at year Y-1, I use the value reported in survey year Y if it is available. If not, I use the value reported in survey year Y+1. The variables ending with “.p3” instead use the convention to use use Y-2 values if available and the Y-1 ones if not. Again, if there were no reporting errors the variables with and without .p3 would be the same.

There are three variables for the number of patients treated. The data documentation describes patient_months as

“Prevalent Hemodialysis Patient Months (7a): The monthly prevalent hemodialysis patient count at a facility includes all non-transient patients (home and in-center) who receive hemodialysis as of the last day of that calendar month. Incident patients (those who received ESRD treatment for the first time ever) are included in this count. Row 7a reports the number of prevalent hemodialysis patient months reported at the facility each year. The number of patient months over a time period is the sum of patients reported for the months covered by the time period. An individual patient may contribute up to 12 patient months per year.”

patient_years_rom is the number of patient years at risk of mortality. patient_years_hd is number of patient years at risk of hospitalization. Since hospitalization data is constructed from Medicare records, a patient is considered at risk of hospitalization only when one can be reasonably certain that a hospitalization would be billed to Medicare. Dialysis patients who pay for for hospitalization with other methods could have unobserved hospitalizations. The data guide explains,

“Ideally, this table includes only patients whose Medicare billing records include all hospitalizations for the period. To achieve this goal, we require that patients reach a certain level of Medicare-paid dialysis bills to be included in hospitalization statistics, or that patients have Medicare-paid inpatient claims during the period. For the purpose of analysis, each patient’s follow-up time is broken into periods defined by time since dialysis initiation. For each patient, months within a given period are included if that month in the period is considered ‘eligible’; a month is deemed eligible if it is within two months of a month having at least $900 of Medicare-paid dialysis claims or at least one Medicare-paid inpatient claim. In setting this criterion, our aim is to achieve completeness of information on hospitalizations for all patients included in the years at risk.”

Create some variables

Not all variables used Grieco and McDevitt (2017) are included here. Some variables will need to be transformed to be comparable to what is in the paper. For example, net investment in stations in year $t$ is the difference between the number of stations in year $t+1$ and year in $t$.

# sort data by :provfs, :year
# function names that end with ! indicate that the function will
# modify one (or more) of its inputs. In this case, sort! modifies the
# dialysis DataFrame
sort!(dialysis, (:provfs, :year))
# things starting with : are Symbols. Names of variables within a
# DataFrame must be Symbols, so they all start with :

# we can access a single column of DataFrame by writing
# dialysis[:stations] . This will be a 1 dimensional Array containing
# of length equal to the number of rows in the dialysis DataFrame

# panellag is a function defined in Dialysis.jl it creates lags and
# leads of variables in panel data. It will insert missing values
# where appropriate.

# putting dialysis[:invest] on the left will create a new column in
# the dialysis dataframe
dialysis[:invest] = panellag(:stations, dialysis, :provfs, :year, -1) -
  dialysis[:stations]; # ; prevents the notebook from printing the
# output of the last command. Otherwise, notebooks display the output
# of the last command in each cell.

We can also create labor and hiring. Note that the choices of giving 0.5 weight to part-time workers, including social workers, and weighting all types of staff equally are all somewhat arbitrary and may not agree exactly with what Grieco and McDevitt (2017) did.

dialysis[:labor] = (dialysis[:nurseFT] + 0.5*dialysis[:nursePT]+
                    dialysis[:ptcareFT] + 0.5*dialysis[:ptcarePT] +
                    dialysis[:dieticiansFT] + 0.5*dialysis[:dieticiansPT] +
                    dialysis[:social_workerFT] + 0.5*dialysis[:social_workerPT])
dialysis[:hiring] = panellag(:labor, dialysis, :provfs, :year, -1) -
  dialysis[:labor];

Creating for profit and chain indicators.

# create a Boolean for profit indicator
dialysis[:for_profit] = dialysis[:profit_status].=="For Profit"
# The dot in .== is an example of broadcasting. It's very common to
# want to apply the same function to all elements of an
# array. Broadcasting does exactly this. If A is an array, and f() is
# a function that operates on scalars, then f.(A) will apply f to each
# element of A and return an array of the results. The above .==
# compares each element of the array dialysis[:profit_status] to the
# scalar string "For Profit"

# similarly create indicators for the two largest chains
dialysis[:fresenius] = dialysis[:chain_name].=="FRESENIUS"
dialysis[:davita] = dialysis[:chain_name].=="DAVITA";

State inspection rates are a bit more complicated to create.

using Statistics # for mean, std, and so on 
# first make an indicator for inspected in the last year
dialysis[:inspected_this_year] =
  ((dialysis[:days_since_inspection].>=0) .&
   (dialysis[:days_since_inspection].<365))
# then take the mean by state
stateRates = by(dialysis, [:state, :year],
                # by(data, :var, f) will apply f to each group of data
                # with a different value of :var
                df -> mean(skipmissing(df[:inspected_this_year])))
# df -> mean(skipmissing(df[:inspected_this_year])) is a shorthand way
# to define a function it's equalivant to
#
# function f(df)
#   mean(skipmissing(df[:inspected_this_year]))
# end
#
# skipmissing skips missing values inside a DataFrame. Most arithmetic
# functions will not do what you want if missing values are included.

# rename the variable in the stateRates DataFrame
rename!(stateRates, :x1 => :state_inspection_rate)
# merge the stateRates with the dialysis data
dialysis = join(dialysis, stateRates, on = [:state, :year]);

Creating the number of competitors in the same city is somewhat similar. Note that Grieco and McDevitt (2017) use the number of competitors in the same HSA, which would be preferrable. However, this dataset does not contain information on HSAs. If you are feeling ambitious, you could try to find data linking city, state to HSA, and use that to calculate competitors in the same HSA.

dialysis[:city] = uppercase.(dialysis[:city]) 
comps = by(dialysis,[:city,:year],
           df -> mapreduce((x) -> ifelse(ismissing(x),0,1*(x>0)), +, df[:patient_months])
           )
rename!(comps, :x1 => :competitors)
dialysis = join(dialysis, comps, on = [:city,:year]);

Problem 1: Summary statistics

Creata a table (or multiple tables) similar to Tables 1-3 of Grieco and McDevitt (2017). Comment on any notable differences. The following code will help you get started.

using Statistics

# at the very least, you will need to change this list
vars = [:patient_years_rom, :labor, :hiring]

# You shouldn't neeed to change this function, but you can if you want
function summaryTable(df, vars;
                      funcs=[mean, std, x->length(collect(x))],
                      colnames=[:Variable, :Mean, :StDev, :N])
  # In case you want to search for information about the syntax used here, 
  # [XXX for XXX] is called a comprehension
  # The ... is called the splat operator
  DataFrame([vars [[f(skipmissing(df[v])) for v in vars] for f in funcs]...], colnames)  
end
summaryTable(dialysis, vars)

3 rows × 4 columns

	Variable	Mean	StDev	N
	Any	Any	Any	Any
1	patient_years_rom	67.1459	45.6089	48224
2	labor	13.1744	8.34365	42512
3	hiring	0.422357	4.30587	34665

Problem 2: exploratory figures

Create some figures to explore the data. Try to be creative. Are there any strange patterns or other obvious problems with the data?

Here are some examples to get started. You may want to look at the StatPlots.jl, Plots.jl, or VegaLite.jl github pages for more examples.

Comparing output measures

using StatPlots , Plots

This package has been renamed to StatsPlots. The name StatPlots has been deprecated.

Please run

]rm StatPlots ]add StatsPlots

Failure to update will mean that you do not receive new developments.

Plots.gr(fmt=:png)
dialysis[:patient_years] = dialysis[:patient_months]/12
# missings will mess up corrplot
vars = [:patient_years, :patient_years_hd, :patient_years_rom]
inc = completecases(dialysis[vars])
@df dialysis[inc,:] corrplot([:patient_years :patient_years_hd :patient_years_rom])

Trends over time

function yearPlot(var)
  data = dialysis[completecases(dialysis[[:year, var]]),:]
  scatter(data[:year], data[var], alpha=0.1, legend=:none,
          markersize=3, markerstrokewidth=0.0)
  yearmeans = by(data, :year,
                 mean = var => x->mean(skipmissing(x)),
                 q01  = var => x->quantile(skipmissing(x), 0.01),
                 q10  = var => x->quantile(skipmissing(x), 0.1),
                 q25  = var => x->quantile(skipmissing(x), 0.25),
                 q50  = var => x->quantile(skipmissing(x), 0.50),
                 q75  = var => x->quantile(skipmissing(x), 0.75),
                 q90  = var => x->quantile(skipmissing(x), 0.9),
                 q99  = var => x->quantile(skipmissing(x), 0.99))
  @df yearmeans plot!(:year, :mean, colour = ^(:black), linewidth=4)
  @df yearmeans plot!(:year, cols(3:ncol(yearmeans)),
                      colour = ^(:red), alpha=0.3, legend=:none,
                      xlabel="year", ylabel=String(var))
end
yearPlot(:labor)

The above plot shows a scatter of labor vs year. The black lines are average labor each year. The red lines are the 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.99 quantiles conditional on year.

Part II: estimating the model

Problem 3: Quality measures

Grieco and McDevitt (2017) use the residuals from regressing the infection rate on patient characteristics as a measure of quality. Since the infection rate is a noisy measure of quality, they instrument with the standardized mortality ratio as a second measure of quality. Medicare collects the data we are using in part to create the “Dialysis Facility Compare” website, which is meant to allow consumers to compare quality of dialysis facilities. Browsing around the Dialysis Facility Compare or by looking at the first few pages of a sample Dialysis Facility Report, you will see that there are a number of other variables that Medicare considers indicators of quality. Pick one of these (it may or may not be included in the extract of data I provided), and argue for or against using it instead of or in addition to the septic infection rate and standardized mortality ratio.

We can construct residuals from an OLS regression as follows:

"""
    ols_residuals(data::AbstractDataFrame, y::Symbol,
                  x::Array{Symbol,1}; intecept::Bool=true)

This is a doc string. After executing this cell, if you type ?ols_residuals, you
will see this text. 

Calculate residuals from an OLS regression of data[y] on data[x]

Inputs:
 - `data` DataFrame containg y and x
 - `y` Symbol specifying y variable
 - `x` Symbol specifying x variables

Output:
 - Vector of residuals of length = nrow(data)
"""
function ols_residuals(data::DataFrame,  y::Symbol,
                       x::Array{Symbol,1};
                       # arguments following the 
                       # are optional
                       intercept::Bool=true  
                       )
  # The :: are type specifications. They could be left out, and this
  # function would still work just fine. One of their purposes are to
  # document what inputs this function expects, and throw useful error
  # messages if someone tries calling the function on the wrong types.

  inc = completecases(data[[y, x...]]) # deal with missing
  Y = disallowmissing(data[y][inc])
  if (intercept) 
    X = [ones(sum(inc)) data[x][inc,:]]
  else
    X = data[x][inc,:]
  end
  X = disallowmissing(convert(Matrix, X))
  
  # you can type Greek and some other LaTeX characters by typing their LaTeX
  # code followed by tab, e.g.  \beta<TAB> and \in<TAB>
  β = X \ Y # β ∈ argmin_b || X*b - Y || 
  ϵ = Y - X*β  
  if (any(.!inc)) # add back in missings
    resid = Array{Union{Missing,eltype(ϵ)},1}(undef, nrow(data))
    resid .= missing
    resid[inc] = ϵ
    return(resid)
  else # no missing, just return ϵ
    return(ϵ)
  end
end
q = -ols_residuals(dialysis, :pct_septic, [:days_since_inspection,
                                           :patient_age,
                                           :pct_female,
                                           :patient_esrd_years,
                                           :pct_fistula,
                                           :comorbidities,
                                           :hemoglobin]);

Of course, regression is common enough that there are already Julia packages for it. I included the ols_residuals only for pedagogical purposes. Whenever there exists a well-used package, it is (usually) better to use the package than try to write your own functions. Here’s how to accomplish the same thing using FixedEffectModels.jl.

using FixedEffectModels

dialysis[:idcat] = categorical(dialysis[:provfs])
# FixedEffectModels requires clustering and fixed effect variables to
# be categorical

qreg = reg(dialysis, @model(pct_septic ~ days_since_inspection + patient_age +
                            pct_female + patient_esrd_years + pct_fistula + comorbidities +
                            hemoglobin, vcov=cluster(idcat)),
           save=true) # saves residuals in augmentdf
dialysis[:quality] = -qreg.augmentdf[:residuals]

# Let's test that these results are the same from ols_residuals
println("Mean absolute difference = $(mean(skipmissing( abs.(q.- dialysis[:quality]) )))")

Mean absolute difference = 4.867410822221838e-14

# using $(expr) in a string will insert the result of expr in the$
# string 

using Test
@test all(skipmissing(q .≈ dialysis[:quality])) == true

Test Passed

Testing is an important part of software development. The Test.jl package provides help function for running tests. See these notes from 628 for more information about testing.

Problem 4: OLS and fixed effects estimates

Reproduce columns 2,3, 5, and 6 of Table 5. The syntax for fixed effects regression is shown below:

# you may want to use patient_years_hd or patient_years_rom instead
log_infmiss = x->ifelse(!ismissing(x) && x>0, log(x), missing) # -Inf confuses reg()
dialysis[:lpy] = log_infmiss.(dialysis[:patient_months]./12)
dialysis[:logL] = log_infmiss.(dialysis[:labor])
dialysis[:logK] = log_infmiss.(dialysis[:stations])

# you may want to restrict sample to match sample that can be used in model estimates 
reg(dialysis, @model(lpy ~ quality + logK + logL, fe = idcat, vcov =
                     cluster(idcat)))

=========================================================================

========================================================================= Fixed Effect Model
========================================================================= Number of obs: 32013 Degrees of freedom: 3 R2: 0.882 R2 within: 0.144 F-Statistic: 261.008 p-value: 0.000 Iterations: 1 Converged: true Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95% quality -0.00921389 0.000589449 -15.6314 0.000 -0.0103692 -0.00805855 logK 0.171904 0.0285335 6.02462 0.000 0.115977 0.22783 logL 0.470537 0.0171156 27.4917 0.000 0.43699 0.504085

Be sure to add the other columns. If you’d like, you could use RegressionTables.jl to produce tables that look a lot like the ones in the paper.

Estimation of $\alpha$

As discussed in section 5 of Grieco and McDevitt (2017), the coefficient on quality, $\alpha$, is estimated from \[ y_{jt} = \alpha q_{jt} + \Phi(\underbrace{h_{jt}, k_{jt}, l_{jt}, x_{jt}}_{w_{jt}}) + \epsilon_{jt} \] with a second noisy measure of quality, $z_{jt}$, used to instrument for $\alpha$. To estimate $\alpha$, first the exogenous variables, $w$, can be partialed out to give: \[ y_{jt} - \Er[y_{jt} | w_{jt} ] = \alpha (q_{jt} - \Er[q_{jt}|w_{jt}]) + \epsilon_{jt} \] where we used the assumption that $\Er[\epsilon_{jt} | w_{jt} ] = 0$ and the fact that $\Er[\Phi(w) | w] = \Phi(w)$. Under the assumption that $\Er[\epsilon| z, w] = 0$, we can estimate $\alpha$ based on the moment condition: \[ \begin{align*} 0 = & \Er[\epsilon f(z,w) ] \\ 0 = & \Er\left[ \left(y_{jt} - \Er[y_{jt} | w_{jt} ] - \alpha (q_{jt} - \Er[q_{jt}|w_{jt}])\right) f(z_{jt},w_{jt}) \right] \end{align*} \] If $Var(\epsilon|z,w)$ is constant, the efficient choice of $f(z,w)$ is \[ \Er[\frac{\partial \epsilon}{\partial \alpha} |z, w ] = \Er[q| z, w] - \Er[q|w] \] To estimate $\alpha$, we simply replace these conditional expectations with regression estimates, and replace the unconditional expectation with a sample average. Let $\hat{\Er}[y|w]$ denote a nonparmetric estimate of the regression of $y$ on $w$. Then, \[ \hat{\alpha} = \frac{\sum_{j,t} (y_{jt} - \hat{E}[y|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])} {\sum_{j,t} (q_{jt} - \hat{E}[q|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])} \] The function partiallinearIV in Dialysis.jl will estimate this model. Also included are two methods for estimating $\hat{E}[y|w]$. polyreg estimates a polynomial series regression, that is it regresses $y$ on a polynomial of degree $d$ in $w$. To allow the regression to approximate any function, the degree must increase with the sample size, but to control variance the degree must not increase too quickly. We will not worry too much about the choice of degree here.

An alternative method (and what Grieco and McDevitt (2017) used) is local linear regression. To estimate $\hat{E}[y|x_{jt}]$, local linear regression estimates a linear regression of $y$ on $x$, but weights observations by how close $x_{it}$ is to $x_{jt}$. That is, \[ \hat{E}[y|x_{jt}] = x_{jt} \hat{\beta}(x_jt) \] where \[ \hat{\beta}(x_jt) = \argmin_\beta \sum_{i,t} (y_{it} - x_{it}\beta)^2 k((x_{it} - x_{jt})/h_n) \] Here $k()$ is some function with its maximum at 0 (and has some other properties), like $k(x) \propto e^{-x^2}$. The bandwidth, $h_n$, determines how much weight to place on observations close to vs far from $x_{jt}$. Similar to the degree in polynomial regression, the bandwidth must decrease toward 0 with sample size allow local linear regression to approximate any function, but to control variance the bandwidth must not decrease too quickly. We will not worry too much about the choice of bandwidth. Anyway, the function locallinear in Dialysis.jl estimates a local linear regression.

Problem 5: $\alpha$

Estimate $\alpha$ using the following code. You may want to modify some aspects of it and/or report estimates of $\alpha$ for different choices of instrument, nonparametric estimation method, degree or bandwidth. Compare your estimate(s) of $\alpha$ with the ones in Tables 5 and 6 of Grieco and McDevitt (2017).

# create indicator for observations usable in estimation of α
inc1 = ((dialysis[:patient_months] .> 0) .& (dialysis[:labor] .> 0) .&
           (dialysis[:stations] .> 0) .&
           .!ismissing.(dialysis[:quality]) .&
           .!ismissing.(dialysis[:std_mortality]) .&
           (dialysis[:invest].==0) .&
           (dialysis[:hiring].!=0));
inc1[ismissing.(inc1)] .= false;
dialysis[:inc1] = inc1;

dialysis[:lsmr] = log.(dialysis[:std_mortality] .+ .01)
# As degree → ∞ and/or bandwidth → 0, whether we use :std_mortality or
# some transformation as the instrument should not matter. However,
# for fixed degree or bandwidth it will have some (hopefully small)
# impact. 

(α, Φ, αreg, eyqz)=partiallinearIV(:lpy,  # y 
                         :quality, # q
                         :lsmr,   # z
                         [:hiring, :logL, :logK,
                         :state_inspection_rate, :competitors], # w
                         dialysis[findall(dialysis[:inc1]),:];
                         npregress=(xp, xd,yd)->polyreg(xp,xd,yd,degree=1),
                         parts=true                      
                         # You may want to change the degree here
                         #
                         # You could also change `polyreg`  to
                         # `locallinear` and `degree` to
                         # `bandwidthmultiplier`
                         #
                         # locallinear will likely take some time to
                         # compute (≈350 seconds on my computer)
                         ) 

# we will need these later in step 2
dialysis[:Φ] = similar(dialysis[:lpy])
dialysis[:Φ] .= missing
dialysis[:Φ][findall(dialysis[:inc1])] = Φ
dialysis[:ey] = similar(dialysis[:lpy])
dialysis[:ey] .= missing
dialysis[:ey][findall(dialysis[:inc1])] = eyqz[:,1]
dialysis[:eq] = similar(dialysis[:lpy])
dialysis[:eq] .= missing
dialysis[:eq][findall(dialysis[:inc1])] = eyqz[:,2]
dialysis[:ez] = similar(dialysis[:lpy])
dialysis[:ez] .= missing
dialysis[:ez][findall(dialysis[:inc1])] = eyqz[:,3]

Brief introduction to GMM

The coefficients on labor and capital are estimated by GMM. The idea of GMM is as follows. We have a model that implies \[ \Er[c(y,x;\theta) | z ] = 0 \] where $y$, $x$, and $z$ are observed variables. $c(y,x;\theta)$ is some known function of the data and some parameters we want to estimate, $\theta$. Often, $c(y,x;\theta)$ are the residuals from some equation. For example, for linear IV, we’d have \[ c(y,x;\theta) = y - x\theta \] The conditional moment restriction above implies that \[ \Er[c(y,x;\theta)f(z) ] = 0 \] for any function $f()$. We can then estimate $\theta$ by replacing the population expectation with a sample average and finding $\hat{\theta}$ such that \[ \En[c(y,x;\hat{\theta})f(z) ] \approx 0 \] The dimension of $f(z)$ should be greater than or equal to the dimension of $\theta$, so we have at least as many equations as unknowns. We find this $\hat{\theta}$ by minimizing a quadratic form of these equations. That is, \[ \hat{\theta} = \argmin_\theta \En[g_i(\theta)] W_n \En[g_i(\theta)]' \] were $g_i(\theta) = c(y_i, x_i;\theta)f(z_i)$, and $W_n$ is some positive definite weighting matrix.

Problem 6: OLS by GMM

As practice with GMM, use it to estimate a simple regression model, \[ y = x\beta + \epsilon \] assuming $\Er[\epsilon|x] = 0$. Test your code on simulated data. The following will help you get started.

function sim_ols(n; β = ones(3))
  x = randn(n, length(β))
  ϵ = randn(n)
  y = x*β + ϵ
  return(x,y)
end
β = ones(2)
(x, y) = sim_ols(100; β=β)
βols = (x'*x) \ (x'*y)

function gmm_objective(β)
  gi = (y - x*β) .* x
  Egi = mean(gi, dims=1)
  error("This is incomplete; you must finish it")

  # It is is likely that the code you will write will return a 1 x 1,
  # 2 dimensional array. For compatibility with Optim, you need to
  # return a scalar. If foo is a 1x1 array, write `foo[1]` to return a scalar instead of
  # 1x1 array
end

# minimizer gmm_objective
using Optim # github page : https://github.com/JuliaNLSolvers/Optim.jl
# docs : http://julianlsolvers.github.io/Optim.jl/stable/
try 
  res = optimize(gmm_objective,
                 zeros(size(β)), # initial value
                 BFGS(), # algorithm, see http://julianlsolvers.github.io/Optim.jl/stable/
                 autodiff=:forward)
  βgmm = res.minimizer
catch err
  @info err
  βgmm = βols
  res = nothing
end
@test βgmm ≈ βols

Error During Test at none:1 Test threw exception Expression: βgmm ≈ βols UndefVarError: βgmm not defined Stacktrace: [1] top-level scope at none:0 [2] eval at ./boot.jl:319 [inlined] [3] capture_output(::Expr, ::Module, ::Bool, ::Bool, ::Bool, ::Bool) at /home/paul/.julia/dev/Weave/src/run.jl:230 [4] run_code(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/paul/ .julia/dev/Weave/src/run.jl:208 [5] eval_chunk(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/pau l/.julia/dev/Weave/src/run.jl:289 [6] run_chunk(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/paul /.julia/dev/Weave/src/run.jl:130 [7] #run#21(::String, ::Module, ::Symbol, ::Dict{Any,Any}, ::String, ::N othing, ::String, ::Symbol, ::Bool, ::typeof(run), ::Weave.WeaveDoc) at /ho me/paul/.julia/dev/Weave/src/run.jl:94 [8] #run at ./none:0 [inlined] [9] #weave#10(::String, ::Symbol, ::Symbol, ::Dict{Any,Any}, ::Module, : :String, ::Nothing, ::String, ::Symbol, ::Bool, ::Nothing, ::Nothing, ::Not hing, ::Array{String,1}, ::String, ::typeof(weave), ::String) at /home/paul /.julia/dev/Weave/src/Weave.jl:107 [10] (::getfield(Weave, Symbol(“#kw##weave”)))(::NamedTuple{(:out_path, :mod, :doctype, :pandoc_options, :cache),Tuple{Symbol,Module,String,Array{S tring,1},Symbol}}, ::typeof(weave), ::String) at ./none:0 [11] top-level scope at none:0 [12] include(::String) at ./client.jl:392 [13] top-level scope at none:0 [14] run_backend(::REPL.REPLBackend) at /home/paul/.julia/packages/Revis e/yp5KG/src/Revise.jl:769 [15] (::getfield(Revise, Symbol(“##58#60”)){REPL.REPLBackend})() at ./ta sk.jl:259

ERROR: Test.FallbackTestSetException("There was an error during testing")

Estimating $\beta$

The model implies that \[ \omega_{jt} = \Phi(w_{jt}) - \beta_k k_{jt} - \beta_l l_{jt} \] {#eq:omega} and \[ y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt} = g(\omega_{jt-1}) + \eta_{jt} \] {#eq:eta} The timing and exogeneity assumptions imply that \[ \Er[\eta_{jt} | k_{jt}, l_{jt}, w_{jt-1}] \] Given a value of $\beta$, and our above estimates of $\Phi$ and $\alpha$, we can compute $\omega$ from (???), and then estimate $g()$ and $\eta$ by a nonparametric regression of $y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt}$ on $\omega_{jt-1}$. $\beta$ can then be estimated by finding the value of $\beta$ that comes closest to satisfying the moment condition \[ \Er[\eta(\beta)_{jt} k_{jt}] = 0 \text{ and } \Er[\eta(\beta)_{jt} l_{jt}] = 0 \] To do this, we minimize \[ Q_n(\beta) = \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right) W_n \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right)' \]

Problem 7: estimate $\beta$

Write the body of the $Q_n(\beta)$ function below. Use it to estimate $\beta$. Compare your results with those of Grieco and McDevitt (2017). Optionally, explore robustness of your results to changes in the specification.

# indicator for observations usable in estimation of β
dialysis[:inclag] = panellag(:inc1, dialysis, :provfs, :year, 1);
dialysis[:inc2] = (dialysis[:inclag] .&
                   (dialysis[:stations].>0) .&
                   (dialysis[:labor].>0) .&
                   (dialysis[:patient_years].>0) .&
                   .!ismissing.(dialysis[:quality]));
dialysis[:inc2][ismissing.(dialysis[:inc2])] .= false;

(ωfunc, ηfunc) = errors_gm(:lpy, :logK, :logL, :quality, :Φ, :provfs, :year,
                           dialysis, α; degree=1)
function Qn(β)
  η = ηfunc(β)
  error("You must write the body of this function")
end

using Optim
try 
  res = optimize(Qn,    # objective
                 [0.0, 0.0], # lower bounds, should not be needed, but
                 # may help numerical performance 
                 [1.0, 1.0], # upper bounds 
                 [0.4, 0.2], # initial value               
                 Fminbox(BFGS()),  # algorithm
                 autodiff=:forward)
  β = res.minimizer
  @show β
catch err
  @info err
  res = nothing
  β = nothing
end

Inference

Many, perhaps most, estimators in econometrics are extrumem estimators. That is, many estimators are defined by \[ \hat{\theta} = \argmax_{\theta \in \Theta} \hat{Q}_n(\theta) \] where $\hat{Q}_n(\theta)$ is some objective function that depends on data. Examples include maximum likelihood, \[ \hat{Q}_n(\theta) = \frac{1}{n} \sum_{i=1}^n f(z_i | \theta) \] nonlinear least squares, \[ \hat{Q}_n(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - h(x_i,\theta))^2 \] and as we are using for this example, GMM, \[ \hat{Q}_n(\theta) = \left(\frac{1}{n} \sum_{i=1}^n g(z_i, \theta)\right)' \hat{W} \left(\frac{1}{n} \sum_{i=1}^n g(z_i, \theta)\right). \] See Newey and McFadden (1994) for more details and examples.

We will encounter extremum estimators often in this course, so it is useful to be familiar with their statistical properties. However, since this course is not focused on econometrics, we will just state some basic “high-level” conditions for consistency and asymptotic normality, and only give brief sketches of proofs. Our main goal is to find the asymptotic distribution of $\hat{\theta}$, so that we can report standard errors and confidence regions. If $Q_n$ is differentiable, then \[ 0 = \nabla \hat{Q}_n(\hat{\theta}) \] Taking a first order expansion around $\theta_0$, \[ 0 \approx \nabla \hat{Q}_n(\theta_0) + (\hat{\theta}-\theta_0) \nabla^2 Q_n(\theta_0) \] Rearranging and multiplying by $\sqrt{n}$, gives \[ \sqrt{n}(\hat{\theta}-\theta_0) \approx (\nabla^2 Q_n(\theta_0))^{-1} \sqrt{n} \hat{Q}_n(\theta_0) \] If a law of large number applies to $\nabla^2 Q_n(\theta_0)$ and a central limit theorem applies to $\sqrt{n} \hat{Q}_n(\theta_0)$, then $\sqrt{n}(\hat{\theta}-\theta_0)$ will be asymptotically normal. The following theorem states this idea somewhat more precisely.

Consistency

Consistency for extremum estimators: assume

$\hat{Q}_n(\theta)$ converges uniformly in probability to $Q_0(\theta)$
$Q_0(\theta)$ is uniquely maximized at $\theta_0$.
$\Theta$ is compact and $Q_0(\theta)$ is continuous.

Then $\hat{\theta} \inprob \theta_0$

Asymptotic normality

Asymptotic normality for extremum estimators: assume

$\hat{\theta} \inprob \theta_0$
$\theta_0 \in interior(\Theta)$
$\hat{Q}_n(\theta)$ is twice continuously differentiable in open $N$ containing $\theta$, and $\sup_{\theta \in N} \Vert \nabla^2 \hat{Q}_n(\theta) - H(\theta) \Vert \inprob 0$ with $H(\theta_0)$ nonsingular
$\sqrt{n} \nabla \hat{Q}_n(\theta_0) \indist N(0,\Sigma)$

Then $\sqrt{n} (\hat{\theta} - \theta_0) \indist N\left(0,H^{-1} \Sigma H^{-1} \right)$

GMM

For a GMM objective function of the form: \[ [1/n \sum_i g_i(\theta)] W_n [1/n \sum g_i(\theta)]\], if we assume:

$1/\sqrt{n} \sum_i g_i(\theta_0) \indist N(0,\Sigma)$
$1/n \sum_i \nabla g_i(\theta) \inprob E[\nabla g(\theta)] = D$, $W_n \inprob W$
$(D'WD)$ is nonsingular.

then the above theorem for asymptotic normality of extremum estimators implies that \[ \sqrt{n}(\hat{\theta} - \theta_0) \indist N(0,\Omega) \] where \[ \Omega= (D'WD)^{-1} (D' W \Sigma W D) (D'WD)^{-1}. \] If we additionally assume $W_n \inprob \Sigma^{-1}$, e.g. observations are independent and $W_n = \widehat{Var}(g_i(\theta))^{-1}$, then the asymptotic variance simplifies to $(D' \Sigma D)^{-1}$. This choice of $W$ is efficient in that it leads to the smallest asymptotic variance.

2-step estimators

The above applies to estimators that come from minimizing a single objective function. This application involves a multi-step estimator. First, we estimated $\alpha$ and $\Phi()$, then we estimated $\beta$ by GMM using moments of the form: \[ \hat{\beta} = \argmin [1/n \sum_i g_i(\beta, \hat{\alpha}, \hat{\Phi})] W_n [1/n \sum g_i(\beta, \hat{\alpha}, \hat{\Phi}) ] \] A similar expansion as above will give \[ \sqrt{n}(\hat{\beta} - \beta_0) \approx -(D'WD)^{-1} D' W \left(1/\sqrt{n} \sum g_i(\beta, \hat{\alpha}, \hat{\Phi}) \right) \] where $D=\Er[\nabla_\beta g(\beta, \alpha, \Phi)]$. If we also expand $g_i$ in terms of $\alpha$ and $\Phi$, we will get, \[ \sqrt{n}(\hat{\beta} - \beta_0) \approx -(D'WD)^{-1} D' W \left(1/\sqrt{n} \sum g_i(\beta_0, \alpha_0, \Phi_0) \right) + \Er[\frac{\partial g_i}{\partial \alpha}(\beta,\alpha,\Phi)] \sqrt{n}(\hat{\alpha}-\alpha_0) + \Er[\frac{\partial g_i}{\partial \Phi}(\beta,\alpha,\Phi) \sqrt{n}(\hat{\Phi}-\Phi_0) ]. \] So, there will be additional variance in $\hat{\beta}$ from the estimation of $\hat{\alpha}$ and $\hat{\Phi}$. Since $\alpha$ is finite dimensional, it is not too difficult to derive the distribution of $\sqrt{n}(\hat{\alpha}-\alpha_0)$ and estimate $\Er[\frac{\partial g_i}{\partial \alpha}(\beta,\alpha,\Phi)]$. However, $\hat{\Phi}$ is a function and is more difficult to deal with. Under some strong assumptions, \[ \Er[\frac{\partial g_i}{\partial \Phi}(\beta,\alpha,\Phi) \sqrt{n}(\hat{\Phi}-\Phi_0) ] \indist N, \] but the needed assumptions are slightly restrictive and are tedious to state. An alternative approach is to redefine $g_i$ to ensure that $E[\frac{\partial g_i}{\partial \Phi}] = 0$. This can be done by letting \[ \begin{align*} g_{jt}(\beta, \alpha, \Phi) = & \left(y_{jt} - \alpha q_{jt} - x_{jt} \beta - h(\Phi(w_{jt-1}) - x_{jt-1}\beta) \right) \left(x_{jt} - \Er[x|\Phi(w_{jt-1}) - x_{jt-1}\beta]\right) + \\ & - \left(y_{jt-1} - \alpha q_{jt-1} - \Phi(w_{jt-1})\right) h'(\Phi(w_{jt-1}) - x_{jt-1}\beta) \left(x_{jt} - \Er[x|\Phi(w_{jt-1}) - x_{jt-1}\beta]\right) \end{align*} \] where $x = (k, l)$, and $h(\Phi(w_{jt-1}) - x_{jt-1}) = \Er[y_{jt} - \alpha q_{jt} - x_{jt} \beta |\Phi(w_{jt-1}) - x_{jt-1}\beta]$. It is not too difficult to verify that \[ \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)] = 0 \] and \[ 0 = D_\Phi \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)]\;\;,\;\; 0 = D_h \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)] \] In other words, these moment conditions are orthogonal in the sense of Chernozhukov et al. (2018). Estimation error in $\Phi$ and $h$ only has second order effects on the estimate of $\beta$. Under appropriate assumptions, these second order effects will vanish quickly enough that they can be ignored in the asymptotic distribution of $\hat{\beta}$.

We can similarly deal with the uncertainty in $\hat{\alpha}$ by redefining the moment condition to be orthogonal with respect to $\alpha$. Let \[ \tilde{g}_{jt}(\beta,\alpha,\Phi) = g_{jt}(\beta, \alpha, \Phi) - \Er[ D_\alpha g_{jt}(\beta_0, \alpha, \Phi)] \frac{(y_{jt} - \alpha q_{jt} - \Phi(w_{jt})) (\Er[q|z_{jt},w_{jt}] - \Er[q|w_{jt}])}{\Er[(q - \Er[q|w])(\Er[q|z,w]-\Er[q|w])]}. \] Then $\Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)] = 0$ and \[ 0 = D_\Phi \Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)]\;\;,\;\; 0 = D_\alpha \Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)]. \] Hence, \[ \begin{align*} \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) = & \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \alpha_0, \Phi_0 ) + o_p(1) \end{align*} \]

Under appropriate assumptions a CLT will apply to $\tilde{g}_{jt}$, so \[ \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\hat{\beta}, \hat{\alpha}, \hat{\Phi} ) = \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0) + o_p(1) \indist N(0,\Sigma) \] Furthermore, $\Sigma$ can estimated by taking the sample (clustered) covariance of $\tilde{g}_{jt}(\beta,\hat{\alpha},\hat{\Phi})$. Denote this by $\hat{\Sigma}(\beta)$. We then have \[ Q_n^{CUE}(\beta_0) = \left(\frac{1}{n} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) \right)' \hat{\Sigma(\beta_0)}^{-1} \left(\frac{1}{n} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) \right) \indist \chi^2_{dim(\tilde{g}_{jt})}. \] This can be used to test $H_0: \beta = \beta_0$, or to form a confidence region by taking all values of $\beta$ for which the test fails to reject. Such an inference procedure is robust to identification problems in that it does not require an assumption about $D = D_\beta \Er[g_{jt}(\beta,\alpha_0,\Phi_0)]$ being full rank or $Q_n$ being uniquely minimized at $\beta_0$. If we are willing to assume strong identification in that $(D' \Sigma(\beta_0)^{-1} D)$ is invertible (and, loosely speaking, the estimated version of this is far from singular relative to its estimation error), then we can take an expansion to get \[ \sqrt{n}(\hat{\beta} - \beta_0) = (D'\Sigma^{-1} D)^{-1} D'\Sigma^{-1} \frac{1}{\sqrt{n}} \tilde{g}_{jt}(\beta_0, \alpha_0, \Phi_0) + o_p(1) \indist N(0, (D' \Sigma^{-1} D)^{-1}). \]

See my notes from 628 on extremum estimation and on bootstrap for more details.

The objective_gm function in Dialysis.jl constructs the $Q_n^{CUE}$ objective function described above. The following code minimizes it and plots a confidence region for $\beta$ based on inverting the $\chi^2$ test described above.

(obj, momenti, cue, varfn) = objective_gm(:lpy, :logK, :logL, :quality, :Φ,
                                          :provfs, :year,
                                          [:logK, :logL], dialysis,
                                          clusterid=:idcat,
                                          npregress=(xp,xd,yd)->polyreg(xp,xd,yd,degree=1,
                                                                        deriv=true));
res = optimize(b->cue(b,α),    # objective
               [0.0, 0.0], # lower bounds, should not be needed, but
               # may help numerical performance 
               [1.0, 1.0], # upper bounds 
               [0.4, 0.2], # initial value               
               Fminbox(BFGS()),  # algorithm
               autodiff=:forward)
res

Results of Optimization Algorithm * Algorithm: Fminbox with BFGS * Starting Point: [0.4,0.2] * Minimizer: [2.4281864085583334e-20,0.14987299477154925, …] * Minimum: 2.340586e+00 * Iterations: 7 * Convergence: true * |x - x’| ≤ 0.0e+00: false |x - x’| = 2.98e-17 * |f(x) - f(x’)| ≤ 0.0e+00 |f(x)|: true |f(x) - f(x’)| = 0.00e+00 |f(x)| * |g(x)| ≤ 1.0e-08: false |g(x)| = 4.44e+00 * Stopped by an increasing objective: true * Reached Maximum Number of Iterations: false * Objective Calls: 1271 * Gradient Calls: 1271

# calculate standard errors based on normal approximation
βhat = res.minimizer
gi = momenti(βhat,α)
(Σ, n) = varfn(gi)
using ForwardDiff
using LinearAlgebra # for diag
D = ForwardDiff.jacobian(β->mean(momenti(β,α), dims=1), βhat)
Vβ = inv(D'*inv(Σ)*D)/n
@show βhat

βhat = [2.42819e-20, 0.149873]

@show sqrt.(diag(Vβ))

sqrt.(diag(Vβ)) = [1.46929, 0.0594817] 2-element Array{Float64,1}: 1.469292635429146
0.05948171117514879

using Plots, Distributions
Plots.gr()
lb = [0., 0.]
ub = [1.,  1.]
ntest = 200
βtest = [rand(2).*(ub-lb) .+ lb for i in 1:ntest];
βtest = vcat(βtest'...)

fn = β->cue(β,α)
pval = Vector{Float64}(undef, ntest);
qval = similar(pval);
Threads.@threads for i in 1:size(βtest,1)
  qval[i] = fn(βtest[i,:]);
  pval[i] = cdf(Chisq(2),qval[i]);
end

crit = 0.9
fig=scatter(βtest[:,1],βtest[:,2], group=(pval.<crit), legend=false,
            markersize=4, markerstrokewidth=0.0, seriesalpha=1.0,
            palette=:isolum, xlabel="βk", ylabel="βl")
ngrid = 20
b = range.(lb,ub, length=ngrid)
pc = Matrix{Float64}(undef, length(b[1]), length(b[2]))
qc = similar(pc)
Threads.@threads for i in CartesianIndices(pc)
  qc[i] = fn([b[1][i[1]],b[2][i[2]]]);
  pc[i] = cdf(Chisq(2),qc[i]);
end

fig=contour!(b[1],b[2],pc',
            levels = [0.75, 0.9, 0.95, 0.99],
             contour_labels=true)
display(fig)

Here we plot $Q^{CUE}_n(\beta)$.

Plots.plotly()
surface(b[1],b[2],qc', xlabel="βk", ylabel="βl")

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–C68. https://doi.org/10.1111/ectj.12097.

Grieco, Paul L. E., and Ryan C. McDevitt. 2017. “Productivity and Quality in Health Care: Evidence from the Dialysis Industry.” The Review of Economic Studies 84 (3): 1071–1105. https://doi.org/10.1093/restud/rdw042.

Newey, Whitney K., and Daniel McFadden. 1994. “Chapter 36 Large Sample Estimation and Hypothesis Testing.” In, 4:2111–2245. Handbook of Econometrics. Elsevier. https://doi.org/https://doi.org/10.1016/S1573-4412(05)80005-4.