Author:
Paul Schrimpf
Date: 2019-02-01
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebooks.
\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]
This assignment will reproduce some of the results of Grieco and McDevitt (2017).
https://vse.syzygy.ca provides a convenient browser based interface to Julia. Open it and log in. This assignment is in a git repository at https://github.com/UBCECON567/Dialysis. Start by cloning the git repository to your syzygy directory. Open a terminal in syzygy (File -> New -> Terminal). This will open a Linux shell in your browser. To clone the git repository, enter
git clone https://github.com/UBCECON567/Dialysis
This will create a directory called Dialysis
containing all the files related to this assignment.
Clicking on the folder icon near the top left of the screen opens a file browser panel. Use it to open the Dialysis/notebooks
folder. You can complete this assignment by modifying the dialysis.ipynb
notebook. I recommend creating a copy of this notebook, and then working on the copy. You can create a copy by right clicking in the file browser panel. Now open your copy of the notebook.
Notebooks consist of a series of “cells” of either text written in markdown or Julia code. If you double click on any of the text cells, you can see the markdown that created it. To go back to the formatted text, execute the cell by either clicking the play icon on the top of the page or typing ctrl and enter together.
This assignment will try to explain aspects of Julia as needed. However, it at some point you feel lost, you may want to consult some of the following resources. Reading the first few sections of either QuantEcon or Think Julia is recommended.
Think Julia A detailed introduction to Julia and programming more generally. Long, but recommended, especially if you’re new to programming.
From the julia prompt, you can access documentation with ?functionname
. Some packages have better documentation than others.
https://julialang.org/ is the website for Julia
Documentation for core Julia can be found at https://docs.julialang.org/en/v1/. All Julia packages also have a github page. Many of these will include package specific documentation.
Notes on Julia from ECON 628 much of this is part of QuantEcon, but not all
The Julia Express short book with examples of Julia usage
Like many programming environments (R, Python, etc), Julia relies on packages for lots of its functionality.The following code will download and install all the packages required for this assignment (but the packages will still need to be loaded with using ...
). Execute this cell. It will take some time. While the cell is running, there will be a [*]
to the left of it. This will change to [1]
(or some other number) after the cell is finished running. The number indicates the order in which the cell was executed. You can execute cells out of order. This can be useful during development, but you should always make sure that your notebook works correctly when cells are executed in order before considering it complete (that is, make sure the “Run -> Restart Kernel and Run all Cells” menu option produces the output you want). Don’t worry about understanding the details of the code in this section.
using Pkg # loads the Pkg package Pkg.activate("..") # loads environment specified in "../Project.toml" # This contains a list of packages and their # versions Pkg.instantiate() # installs all packages in the environment
Updating registry at ~/.julia/registries/General
Updating git-repo https://github.com/JuliaRegistries/General.git
[?25l[2K[?25h
Some useful functions for this assignment are in ../src/Dialysis.jl
. We load those functions now.
using Revise if (!("../src" ∈ LOAD_PATH)) push!(LOAD_PATH, "../src") # so that `using` knows where to find Dialysis.jl end using Dialysis
The functions in Dialysis.jl are organized into a module, just like any other Julia package. Revise.jl is a package to make it easier to develop packages (like Dialysis.jl). In particular, using Revise
will make it so that as soon as you save any modifications to Dialysis.jl, those modifications will be loaded into your Julia session without you doing anything extra.
Now let’s get to work. I originally downloaded the data for this problem set from https://dialysisdata.org/content/dialysis-facility-report-data. As in Grieco and McDevitt (2017) the data comes from Dialysis Facility Reports (DFRs) created under contract to the Centers for Medicare and Medicaid Services (CMS). However, there are some differences. Most notably, this data covers 2006-2014, instead of 2004-2008 as in Grieco and McDevitt (2017) .
The R script https://bitbucket.org/paulschrimpf/econ565/src/master/assignments/production-R/dialysis/downloadDialysisData.R downloads, combines, and cleans the data. Unfortunately, dialysisdata.org has reorganized their website, and the data no longer seems to be available. Similar (likely identical) data is available from https://data.cms.gov/browse?q=dialysis. It might be useful to look at the documentation included with any of the “Dialysis Facility Report Data for FY20XX” zip files. Anyway, the result of the R script is the dialysisFacilityReports.rda
file contained in this git repository. This R data file contains most of the variables used by Grieco and McDevitt (2017).
using DataFrames # DataFrames.jl is a package for storing and # interacting with datasets dialysis = loaddata() # loaddata() is a function I wrote that is part # of Dialysis.jl. It returns a DataFrame typeof(dialysis)
DataFrame
We will begin our analysis with some exploratory statistics and figures. There are at least two reasons for this. First, we want to check for any anomalies in the data, which may indicate an error in our code, our understanding of the data, or the data itself. Second, we should try to see if there are any striking patterns in the data that deserve extra attention. We can get some information about all the variables in the data as follows
describe(dialysis)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Union… | Union… | DataType | |
1 | provfs | =“012306” | =“682567” | 7015 | String | |||
2 | year | 2010.23 | 2006.0 | 2010.0 | 2014.0 | Float64 | ||
3 | comorbidities | 4.70985 | 0.44444 | 4.69811 | 10.7619 | 7982 | Float64 | |
4 | comorbidities_p3 | 4.70985 | 0.44444 | 4.69811 | 10.7619 | 7982 | Float64 | |
5 | hemoglobin | 9.77373 | 3.6 | 9.7727 | 18.0 | 1608 | Float64 | |
6 | hemoglobin_p3 | 9.77326 | 3.6 | 9.7715 | 18.0 | 1608 | Float64 | |
7 | std_mortality | 1.00545 | 0.0 | 0.96694 | 4.97293 | 2239 | Float64 | |
8 | std_mortality_p3 | 1.02017 | 0.0 | 0.98193 | 4.76552 | 2239 | Float64 | |
9 | std_hosp_days | 0.976877 | 0.0 | 0.932715 | 4.94606 | 2038 | Float64 | |
10 | std_hosp_days_p3 | 0.973781 | 0.0 | 0.92961 | 4.94606 | 2038 | Float64 | |
11 | std_hosp_admit | 0.997463 | 0.0 | 0.96394 | 3.87802 | 2071 | Float64 | |
12 | std_hosp_admit_p3 | 0.995428 | 0.0 | 0.96269 | 4.1462 | 2071 | Float64 | |
13 | pct_septic | 11.3111 | 0.0 | 10.7143 | 100.0 | 1354 | Float64 | |
14 | pct_septic_p3 | 11.5805 | 0.0 | 11.111 | 100.0 | 1354 | Float64 | |
15 | n_hosp_admit | 93.215 | 0.0 | 80.0 | 814.0 | 1261 | Float64 | |
16 | n_hosp_admit_p3 | 93.8085 | 0.0 | 80.0 | 840.0 | 1261 | Float64 | |
17 | n_hosp_patients | 74.0305 | 0.0 | 65.0 | 539.0 | 1261 | Float64 | |
18 | n_hosp_patients_p3 | 74.3494 | 0.0 | 66.0 | 539.0 | 1261 | Float64 | |
19 | patient_years_hd | 49.9054 | 0.0 | 43.4 | 333.421 | 1261 | Float64 | |
20 | patient_years_hd_p3 | 50.1318 | 0.0 | 43.587 | 333.421 | 1261 | Float64 | |
21 | city | ABBEVILLE | ZUNI | 3199 | String | |||
22 | name |
|
ZZ CLOSED TRINITY REGIONAL MEDICAL CENTER DIALYSIS - WEBSTER | 9276 | String | |||
23 | state | AK | WY | 56 | String | |||
24 | network | 9.64447 | 1 | 9.0 | 18 | Int32 | ||
25 | chain_name | WEST PENN ALLEGHENY HEALTH SYST | 126 | String | ||||
26 | profit_status | For Profit | Unavailable | 3 | CategoricalString{UInt8} | |||
27 | stations | 17.9418 | 0.0 | 17.0 | 108.0 | 3227 | Float64 | |
28 | total_staff | 14.552 | 0.0 | 13.0 | 235.0 | 644 | Float64 | |
29 | dieticiansFT | 0.516599 | 0.0 | 0.0 | 12.0 | 1114 | Float64 | |
30 | dieticiansPT | 0.604436 | 0.0 | 1.0 | 20.0 | 1114 | Float64 | |
31 | nurseFT | 4.93611 | 0.0 | 4.0 | 103.0 | 1114 | Float64 | |
32 | nursePT | 1.00594 | 0.0 | 0.0 | 72.0 | 1114 | Float64 | |
33 | ptcareFT | 5.66502 | 0.0 | 5.0 | 140.0 | 1114 | Float64 | |
34 | ptcarePT | 0.838336 | 0.0 | 0.0 | 32.0 | 1114 | Float64 | |
35 | social_workerFT | 0.567487 | 0.0 | 1.0 | 42.0 | 5712 | Float64 | |
36 | social_workerPT | 0.563829 | 0.0 | 1.0 | 19.0 | 1114 | Float64 | |
37 | total_staff_l1 | 14.7059 | 0.0 | 13.0 | 235.0 | 1628 | Float64 | |
38 | dieticiansFT_l1 | 0.506917 | 0.0 | 0.0 | 25.0 | 1745 | Float64 | |
39 | dieticiansPT_l1 | 0.618645 | 0.0 | 1.0 | 25.0 | 1745 | Float64 | |
40 | nurseFT_l1 | 4.90774 | 0.0 | 4.0 | 103.0 | 1745 | Float64 | |
41 | nursePT_l1 | 1.04822 | 0.0 | 0.0 | 72.0 | 1745 | Float64 | |
42 | ptcareFT_l1 | 5.65182 | 0.0 | 5.0 | 225.0 | 1745 | Float64 | |
43 | ptcarePT_l1 | 0.872588 | 0.0 | 0.0 | 32.0 | 1745 | Float64 | |
44 | social_workerFT_l1 | 0.562001 | 0.0 | 1.0 | 42.0 | 6176 | Float64 | |
45 | social_workerPT_l1 | 0.574905 | 0.0 | 1.0 | 25.0 | 1745 | Float64 | |
46 | patient_months | 771.833 | 0.0 | 684.0 | 4849.0 | 151 | Float64 | |
47 | patient_years_rom | 67.1459 | 0.0 | 57.656 | 441.339 | Float64 | ||
48 | pct_fistula | 55.264 | 0.0 | 55.977 | 100.0 | 1215 | Float64 | |
49 | pct_female | 44.7421 | 0.0 | 44.8 | 100.0 | 131 | Float64 | |
50 | patient_age | 61.4366 | 7.3077 | 61.6972 | 85.2857 | 131 | Float64 | |
51 | patient_esrd_years | 4.4062 | 0.3121 | 4.3491 | 19.9425 | 131 | Float64 | |
52 | treatment_type | Hemodialysis | Unavailable | 4 | CategoricalString{UInt8} | |||
53 | inspect_date | 1992-05-20 | 2015-05-29 | 3273 | 329 | Date | ||
54 | inspect_result | . | Unknown | 12 | 2011 | CategoricalString{UInt8} | ||
55 | inspect_cfc_cites | 0.293753 | 0.0 | 0.0 | 10.0 | 3918 | Float64 | |
56 | inspect_std_cites | 5.89393 | 0.0 | 4.0 | 153.0 | 3921 | Float64 | |
57 | days_since_inspection | 704.886 | -177.0 | 557.0 | 7454.0 | 329 | Float64 | |
58 | original_chain_name | WELLSPAN DIALYSIS | 189 | String |
The meaning of these variables are as follows:
Variable | Definition |
---|---|
provfs | provider identifier |
year | year |
comorbidities | average patient comorbidities |
hemoglobin | average patient hemoglobin level |
std_mortality | standardized mortality ratio |
std_hosp_days | standardized hospitalization days |
std_hosp_admit | standardized hospitalization admittance rate |
pct_septic | percent of patients hospitalized due to septic infection |
n_hosp_admit | number of hospitalizations |
n_hosp_patients | |
patient_years_hd | patient years at risk of hospitalization |
city | city |
name | provider name |
state | state |
chain_name | name of chain if provider is part of one |
profit_status | whether for profit |
stations | number of dialysis stations |
total_staff | total staff |
dieticiansFT | full-time renal dieticians |
dieticiansPT | part-time renal dieticians |
nurseFT | full-time nurses (>32 hours/week) |
nursePT | part-time nurses (<32 hours/week) |
ptcareFT | full-time patient care technicians |
ptcarePT | part-time patient care technicians |
social_workerFT | full-time social workers |
social_workerPT | part-time social workers |
patient_months | number of patient-months treated during the year |
patient_years_rom | patient-years at risk of mortality |
pct_fistula | the percentage of patient months in which the patient received dialysis through arteriovenous (AV) fistulae |
pct_female | percent of female patients |
patient_age | average age of patients |
patient_esrd_years | average number of years patients have had end stage renal disease |
treatment_type | types of treatment provided at facility |
inspect_date | date of most recent inspection |
inspect_result | result of most recent inspection |
inspect_cfc_cites | number of condition for coverage deficiencies in most recent inspection |
inspect_std_cites | number of standard deficiencies in most recent inspection |
days_since_inspection | days since last inspection |
The raw data contains information on many variables in each of the previous 4 years. Staffing variables with no suffix are staff as of January 31, year as reported in year + 1. Staffing variables with “.l1” are staff as of January 31, year - 1 as reported in year + 1. If there were no reporting errors, the .l1 variables would equal the lag of the ones without .l1. However, you might find that this is not the case.
As explained in downloadDialysisData.R, data collected in year Y has information on most variables in years Y-1, Y-2, Y-3, and Y-4. However, for some variables and survey years, only information in years Y-2, Y-3, Y-4 is included. For such variables, at year Y-1, I use the value reported in survey year Y if it is available. If not, I use the value reported in survey year Y+1. The variables ending with “.p3” instead use the convention to use use Y-2 values if available and the Y-1 ones if not. Again, if there were no reporting errors the variables with and without .p3 would be the same.
There are three variables for the number of patients treated. The data documentation describes patient_months
as
“Prevalent Hemodialysis Patient Months (7a): The monthly prevalent hemodialysis patient count at a facility includes all non-transient patients (home and in-center) who receive hemodialysis as of the last day of that calendar month. Incident patients (those who received ESRD treatment for the first time ever) are included in this count. Row 7a reports the number of prevalent hemodialysis patient months reported at the facility each year. The number of patient months over a time period is the sum of patients reported for the months covered by the time period. An individual patient may contribute up to 12 patient months per year.”
patient_years_rom
is the number of patient years at risk of mortality. patient_years_hd
is number of patient years at risk of hospitalization. Since hospitalization data is constructed from Medicare records, a patient is considered at risk of hospitalization only when one can be reasonably certain that a hospitalization would be billed to Medicare. Dialysis patients who pay for for hospitalization with other methods could have unobserved hospitalizations. The data guide explains,
“Ideally, this table includes only patients whose Medicare billing records include all hospitalizations for the period. To achieve this goal, we require that patients reach a certain level of Medicare-paid dialysis bills to be included in hospitalization statistics, or that patients have Medicare-paid inpatient claims during the period. For the purpose of analysis, each patient’s follow-up time is broken into periods defined by time since dialysis initiation. For each patient, months within a given period are included if that month in the period is considered ‘eligible’; a month is deemed eligible if it is within two months of a month having at least $900 of Medicare-paid dialysis claims or at least one Medicare-paid inpatient claim. In setting this criterion, our aim is to achieve completeness of information on hospitalizations for all patients included in the years at risk.”
Not all variables used Grieco and McDevitt (2017) are included here. Some variables will need to be transformed to be comparable to what is in the paper. For example, net investment in stations in year \(t\) is the difference between the number of stations in year \(t+1\) and year in \(t\).
# sort data by :provfs, :year # function names that end with ! indicate that the function will # modify one (or more) of its inputs. In this case, sort! modifies the # dialysis DataFrame sort!(dialysis, (:provfs, :year)) # things starting with : are Symbols. Names of variables within a # DataFrame must be Symbols, so they all start with : # we can access a single column of DataFrame by writing # dialysis[:stations] . This will be a 1 dimensional Array containing # of length equal to the number of rows in the dialysis DataFrame # panellag is a function defined in Dialysis.jl it creates lags and # leads of variables in panel data. It will insert missing values # where appropriate. # putting dialysis[:invest] on the left will create a new column in # the dialysis dataframe dialysis[:invest] = panellag(:stations, dialysis, :provfs, :year, -1) - dialysis[:stations]; # ; prevents the notebook from printing the # output of the last command. Otherwise, notebooks display the output # of the last command in each cell.
We can also create labor and hiring. Note that the choices of giving 0.5 weight to part-time workers, including social workers, and weighting all types of staff equally are all somewhat arbitrary and may not agree exactly with what Grieco and McDevitt (2017) did.
dialysis[:labor] = (dialysis[:nurseFT] + 0.5*dialysis[:nursePT]+ dialysis[:ptcareFT] + 0.5*dialysis[:ptcarePT] + dialysis[:dieticiansFT] + 0.5*dialysis[:dieticiansPT] + dialysis[:social_workerFT] + 0.5*dialysis[:social_workerPT]) dialysis[:hiring] = panellag(:labor, dialysis, :provfs, :year, -1) - dialysis[:labor];
Creating for profit and chain indicators.
# create a Boolean for profit indicator dialysis[:for_profit] = dialysis[:profit_status].=="For Profit" # The dot in .== is an example of broadcasting. It's very common to # want to apply the same function to all elements of an # array. Broadcasting does exactly this. If A is an array, and f() is # a function that operates on scalars, then f.(A) will apply f to each # element of A and return an array of the results. The above .== # compares each element of the array dialysis[:profit_status] to the # scalar string "For Profit" # similarly create indicators for the two largest chains dialysis[:fresenius] = dialysis[:chain_name].=="FRESENIUS" dialysis[:davita] = dialysis[:chain_name].=="DAVITA";
State inspection rates are a bit more complicated to create.
using Statistics # for mean, std, and so on # first make an indicator for inspected in the last year dialysis[:inspected_this_year] = ((dialysis[:days_since_inspection].>=0) .& (dialysis[:days_since_inspection].<365)) # then take the mean by state stateRates = by(dialysis, [:state, :year], # by(data, :var, f) will apply f to each group of data # with a different value of :var df -> mean(skipmissing(df[:inspected_this_year]))) # df -> mean(skipmissing(df[:inspected_this_year])) is a shorthand way # to define a function it's equalivant to # # function f(df) # mean(skipmissing(df[:inspected_this_year])) # end # # skipmissing skips missing values inside a DataFrame. Most arithmetic # functions will not do what you want if missing values are included. # rename the variable in the stateRates DataFrame rename!(stateRates, :x1 => :state_inspection_rate) # merge the stateRates with the dialysis data dialysis = join(dialysis, stateRates, on = [:state, :year]);
Creating the number of competitors in the same city is somewhat similar. Note that Grieco and McDevitt (2017) use the number of competitors in the same HSA, which would be preferrable. However, this dataset does not contain information on HSAs. If you are feeling ambitious, you could try to find data linking city, state to HSA, and use that to calculate competitors in the same HSA.
dialysis[:city] = uppercase.(dialysis[:city]) comps = by(dialysis,[:city,:year], df -> mapreduce((x) -> ifelse(ismissing(x),0,1*(x>0)), +, df[:patient_months]) ) rename!(comps, :x1 => :competitors) dialysis = join(dialysis, comps, on = [:city,:year]);
Creata a table (or multiple tables) similar to Tables 1-3 of Grieco and McDevitt (2017). Comment on any notable differences. The following code will help you get started.
using Statistics # at the very least, you will need to change this list vars = [:patient_years_rom, :labor, :hiring] # You shouldn't neeed to change this function, but you can if you want function summaryTable(df, vars; funcs=[mean, std, x->length(collect(x))], colnames=[:Variable, :Mean, :StDev, :N]) # In case you want to search for information about the syntax used here, # [XXX for XXX] is called a comprehension # The ... is called the splat operator DataFrame([vars [[f(skipmissing(df[v])) for v in vars] for f in funcs]...], colnames) end summaryTable(dialysis, vars)
Variable | Mean | StDev | N | |
---|---|---|---|---|
Any | Any | Any | Any | |
1 | patient_years_rom | 67.1459 | 45.6089 | 48224 |
2 | labor | 13.1744 | 8.34365 | 42512 |
3 | hiring | 0.422357 | 4.30587 | 34665 |
Create some figures to explore the data. Try to be creative. Are there any strange patterns or other obvious problems with the data?
Here are some examples to get started. You may want to look at the StatPlots.jl, Plots.jl, or VegaLite.jl github pages for more examples.
using StatPlots , Plots
This package has been renamed to StatsPlots. The name StatPlots has been deprecated.
Please run
]rm StatPlots
]add StatsPlots
Failure to update will mean that you do not receive new developments.
Plots.gr(fmt=:png) dialysis[:patient_years] = dialysis[:patient_months]/12 # missings will mess up corrplot vars = [:patient_years, :patient_years_hd, :patient_years_rom] inc = completecases(dialysis[vars]) @df dialysis[inc,:] corrplot([:patient_years :patient_years_hd :patient_years_rom])
function yearPlot(var) data = dialysis[completecases(dialysis[[:year, var]]),:] scatter(data[:year], data[var], alpha=0.1, legend=:none, markersize=3, markerstrokewidth=0.0) yearmeans = by(data, :year, mean = var => x->mean(skipmissing(x)), q01 = var => x->quantile(skipmissing(x), 0.01), q10 = var => x->quantile(skipmissing(x), 0.1), q25 = var => x->quantile(skipmissing(x), 0.25), q50 = var => x->quantile(skipmissing(x), 0.50), q75 = var => x->quantile(skipmissing(x), 0.75), q90 = var => x->quantile(skipmissing(x), 0.9), q99 = var => x->quantile(skipmissing(x), 0.99)) @df yearmeans plot!(:year, :mean, colour = ^(:black), linewidth=4) @df yearmeans plot!(:year, cols(3:ncol(yearmeans)), colour = ^(:red), alpha=0.3, legend=:none, xlabel="year", ylabel=String(var)) end yearPlot(:labor)
The above plot shows a scatter of labor vs year. The black lines are average labor each year. The red lines are the 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.99 quantiles conditional on year.
Grieco and McDevitt (2017) use the residuals from regressing the infection rate on patient characteristics as a measure of quality. Since the infection rate is a noisy measure of quality, they instrument with the standardized mortality ratio as a second measure of quality. Medicare collects the data we are using in part to create the “Dialysis Facility Compare” website, which is meant to allow consumers to compare quality of dialysis facilities. Browsing around the Dialysis Facility Compare or by looking at the first few pages of a sample Dialysis Facility Report, you will see that there are a number of other variables that Medicare considers indicators of quality. Pick one of these (it may or may not be included in the extract of data I provided), and argue for or against using it instead of or in addition to the septic infection rate and standardized mortality ratio.
We can construct residuals from an OLS regression as follows:
""" ols_residuals(data::AbstractDataFrame, y::Symbol, x::Array{Symbol,1}; intecept::Bool=true) This is a doc string. After executing this cell, if you type ?ols_residuals, you will see this text. Calculate residuals from an OLS regression of data[y] on data[x] Inputs: - `data` DataFrame containg y and x - `y` Symbol specifying y variable - `x` Symbol specifying x variables Output: - Vector of residuals of length = nrow(data) """ function ols_residuals(data::DataFrame, y::Symbol, x::Array{Symbol,1}; # arguments following the # are optional intercept::Bool=true ) # The :: are type specifications. They could be left out, and this # function would still work just fine. One of their purposes are to # document what inputs this function expects, and throw useful error # messages if someone tries calling the function on the wrong types. inc = completecases(data[[y, x...]]) # deal with missing Y = disallowmissing(data[y][inc]) if (intercept) X = [ones(sum(inc)) data[x][inc,:]] else X = data[x][inc,:] end X = disallowmissing(convert(Matrix, X)) # you can type Greek and some other LaTeX characters by typing their LaTeX # code followed by tab, e.g. \beta<TAB> and \in<TAB> β = X \ Y # β ∈ argmin_b || X*b - Y || ϵ = Y - X*β if (any(.!inc)) # add back in missings resid = Array{Union{Missing,eltype(ϵ)},1}(undef, nrow(data)) resid .= missing resid[inc] = ϵ return(resid) else # no missing, just return ϵ return(ϵ) end end q = -ols_residuals(dialysis, :pct_septic, [:days_since_inspection, :patient_age, :pct_female, :patient_esrd_years, :pct_fistula, :comorbidities, :hemoglobin]);
Of course, regression is common enough that there are already Julia packages for it. I included the ols_residuals
only for pedagogical purposes. Whenever there exists a well-used package, it is (usually) better to use the package than try to write your own functions. Here’s how to accomplish the same thing using FixedEffectModels.jl.
using FixedEffectModels dialysis[:idcat] = categorical(dialysis[:provfs]) # FixedEffectModels requires clustering and fixed effect variables to # be categorical qreg = reg(dialysis, @model(pct_septic ~ days_since_inspection + patient_age + pct_female + patient_esrd_years + pct_fistula + comorbidities + hemoglobin, vcov=cluster(idcat)), save=true) # saves residuals in augmentdf dialysis[:quality] = -qreg.augmentdf[:residuals] # Let's test that these results are the same from ols_residuals println("Mean absolute difference = $(mean(skipmissing( abs.(q.- dialysis[:quality]) )))")
Mean absolute difference = 4.867410822221838e-14
# using $(expr) in a string will insert the result of expr in the$ # string using Test @test all(skipmissing(q .≈ dialysis[:quality])) == true
Test Passed
Testing is an important part of software development. The Test.jl package provides help function for running tests. See these notes from 628 for more information about testing.
Reproduce columns 2,3, 5, and 6 of Table 5. The syntax for fixed effects regression is shown below:
# you may want to use patient_years_hd or patient_years_rom instead log_infmiss = x->ifelse(!ismissing(x) && x>0, log(x), missing) # -Inf confuses reg() dialysis[:lpy] = log_infmiss.(dialysis[:patient_months]./12) dialysis[:logL] = log_infmiss.(dialysis[:labor]) dialysis[:logK] = log_infmiss.(dialysis[:stations]) # you may want to restrict sample to match sample that can be used in model estimates reg(dialysis, @model(lpy ~ quality + logK + logL, fe = idcat, vcov = cluster(idcat)))
========================================================================= Fixed Effect Model
========================================================================= Number of obs: 32013 Degrees of freedom: 3 R2: 0.882 R2 within: 0.144 F-Statistic: 261.008 p-value: 0.000 Iterations: 1 Converged: true Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95% quality -0.00921389 0.000589449 -15.6314 0.000 -0.0103692 -0.00805855 logK 0.171904 0.0285335 6.02462 0.000 0.115977 0.22783 logL 0.470537 0.0171156 27.4917 0.000 0.43699 0.504085
Be sure to add the other columns. If you’d like, you could use RegressionTables.jl to produce tables that look a lot like the ones in the paper.
As discussed in section 5 of Grieco and McDevitt (2017), the coefficient on quality, \(\alpha\), is estimated from \[
y_{jt} = \alpha q_{jt} + \Phi(\underbrace{h_{jt}, k_{jt}, l_{jt}, x_{jt}}_{w_{jt}}) +
\epsilon_{jt}
\] with a second noisy measure of quality, \(z_{jt}\), used to instrument for \(\alpha\). To estimate \(\alpha\), first the exogenous variables, \(w\), can be partialed out to give: \[
y_{jt} - \Er[y_{jt} | w_{jt} ] = \alpha (q_{jt} - \Er[q_{jt}|w_{jt}]) +
\epsilon_{jt}
\] where we used the assumption that \(\Er[\epsilon_{jt} | w_{jt} ] = 0\) and the fact that \(\Er[\Phi(w) | w] = \Phi(w)\). Under the assumption that \(\Er[\epsilon| z, w] = 0\), we can estimate \(\alpha\) based on the moment condition: \[
\begin{align*}
0 = & \Er[\epsilon f(z,w) ] \\
0 = & \Er\left[ \left(y_{jt} - \Er[y_{jt} | w_{jt} ] - \alpha
(q_{jt} - \Er[q_{jt}|w_{jt}])\right) f(z_{jt},w_{jt}) \right]
\end{align*}
\] If \(Var(\epsilon|z,w)\) is constant, the efficient choice of \(f(z,w)\) is \[
\Er[\frac{\partial \epsilon}{\partial \alpha} |z, w ] = \Er[q| z, w] - \Er[q|w]
\] To estimate \(\alpha\), we simply replace these conditional expectations with regression estimates, and replace the unconditional expectation with a sample average. Let \(\hat{\Er}[y|w]\) denote a nonparmetric estimate of the regression of \(y\) on \(w\). Then, \[
\hat{\alpha} = \frac{\sum_{j,t} (y_{jt} -
\hat{E}[y|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])}
{\sum_{j,t} (q_{jt} - \hat{E}[q|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])}
\] The function partiallinearIV
in Dialysis.jl will estimate this model. Also included are two methods for estimating \(\hat{E}[y|w]\). polyreg
estimates a polynomial series regression, that is it regresses \(y\) on a polynomial of degree \(d\) in \(w\). To allow the regression to approximate any function, the degree must increase with the sample size, but to control variance the degree must not increase too quickly. We will not worry too much about the choice of degree here.
An alternative method (and what Grieco and McDevitt (2017) used) is local linear regression. To estimate \(\hat{E}[y|x_{jt}]\), local linear regression estimates a linear regression of \(y\) on \(x\), but weights observations by how close \(x_{it}\) is to \(x_{jt}\). That is, \[
\hat{E}[y|x_{jt}] = x_{jt} \hat{\beta}(x_jt)
\] where \[
\hat{\beta}(x_jt) = \argmin_\beta \sum_{i,t} (y_{it} -
x_{it}\beta)^2 k((x_{it} - x_{jt})/h_n)
\] Here \(k()\) is some function with its maximum at 0 (and has some other properties), like \(k(x) \propto e^{-x^2}\). The bandwidth, \(h_n\), determines how much weight to place on observations close to vs far from \(x_{jt}\). Similar to the degree in polynomial regression, the bandwidth must decrease toward 0 with sample size allow local linear regression to approximate any function, but to control variance the bandwidth must not decrease too quickly. We will not worry too much about the choice of bandwidth. Anyway, the function locallinear
in Dialysis.jl estimates a local linear regression.
Estimate \(\alpha\) using the following code. You may want to modify some aspects of it and/or report estimates of \(\alpha\) for different choices of instrument, nonparametric estimation method, degree or bandwidth. Compare your estimate(s) of \(\alpha\) with the ones in Tables 5 and 6 of Grieco and McDevitt (2017).
# create indicator for observations usable in estimation of α inc1 = ((dialysis[:patient_months] .> 0) .& (dialysis[:labor] .> 0) .& (dialysis[:stations] .> 0) .& .!ismissing.(dialysis[:quality]) .& .!ismissing.(dialysis[:std_mortality]) .& (dialysis[:invest].==0) .& (dialysis[:hiring].!=0)); inc1[ismissing.(inc1)] .= false; dialysis[:inc1] = inc1; dialysis[:lsmr] = log.(dialysis[:std_mortality] .+ .01) # As degree → ∞ and/or bandwidth → 0, whether we use :std_mortality or # some transformation as the instrument should not matter. However, # for fixed degree or bandwidth it will have some (hopefully small) # impact. (α, Φ, αreg, eyqz)=partiallinearIV(:lpy, # y :quality, # q :lsmr, # z [:hiring, :logL, :logK, :state_inspection_rate, :competitors], # w dialysis[findall(dialysis[:inc1]),:]; npregress=(xp, xd,yd)->polyreg(xp,xd,yd,degree=1), parts=true # You may want to change the degree here # # You could also change `polyreg` to # `locallinear` and `degree` to # `bandwidthmultiplier` # # locallinear will likely take some time to # compute (≈350 seconds on my computer) ) # we will need these later in step 2 dialysis[:Φ] = similar(dialysis[:lpy]) dialysis[:Φ] .= missing dialysis[:Φ][findall(dialysis[:inc1])] = Φ dialysis[:ey] = similar(dialysis[:lpy]) dialysis[:ey] .= missing dialysis[:ey][findall(dialysis[:inc1])] = eyqz[:,1] dialysis[:eq] = similar(dialysis[:lpy]) dialysis[:eq] .= missing dialysis[:eq][findall(dialysis[:inc1])] = eyqz[:,2] dialysis[:ez] = similar(dialysis[:lpy]) dialysis[:ez] .= missing dialysis[:ez][findall(dialysis[:inc1])] = eyqz[:,3]
The coefficients on labor and capital are estimated by GMM. The idea of GMM is as follows. We have a model that implies \[ \Er[c(y,x;\theta) | z ] = 0 \] where \(y\), \(x\), and \(z\) are observed variables. \(c(y,x;\theta)\) is some known function of the data and some parameters we want to estimate, \(\theta\). Often, \(c(y,x;\theta)\) are the residuals from some equation. For example, for linear IV, we’d have \[ c(y,x;\theta) = y - x\theta \] The conditional moment restriction above implies that \[ \Er[c(y,x;\theta)f(z) ] = 0 \] for any function \(f()\). We can then estimate \(\theta\) by replacing the population expectation with a sample average and finding \(\hat{\theta}\) such that \[ \En[c(y,x;\hat{\theta})f(z) ] \approx 0 \] The dimension of \(f(z)\) should be greater than or equal to the dimension of \(\theta\), so we have at least as many equations as unknowns. We find this \(\hat{\theta}\) by minimizing a quadratic form of these equations. That is, \[ \hat{\theta} = \argmin_\theta \En[g_i(\theta)] W_n \En[g_i(\theta)]' \] were \(g_i(\theta) = c(y_i, x_i;\theta)f(z_i)\), and \(W_n\) is some positive definite weighting matrix.
As practice with GMM, use it to estimate a simple regression model, \[ y = x\beta + \epsilon \] assuming \(\Er[\epsilon|x] = 0\). Test your code on simulated data. The following will help you get started.
function sim_ols(n; β = ones(3)) x = randn(n, length(β)) ϵ = randn(n) y = x*β + ϵ return(x,y) end β = ones(2) (x, y) = sim_ols(100; β=β) βols = (x'*x) \ (x'*y) function gmm_objective(β) gi = (y - x*β) .* x Egi = mean(gi, dims=1) error("This is incomplete; you must finish it") # It is is likely that the code you will write will return a 1 x 1, # 2 dimensional array. For compatibility with Optim, you need to # return a scalar. If foo is a 1x1 array, write `foo[1]` to return a scalar instead of # 1x1 array end # minimizer gmm_objective using Optim # github page : https://github.com/JuliaNLSolvers/Optim.jl # docs : http://julianlsolvers.github.io/Optim.jl/stable/ try res = optimize(gmm_objective, zeros(size(β)), # initial value BFGS(), # algorithm, see http://julianlsolvers.github.io/Optim.jl/stable/ autodiff=:forward) βgmm = res.minimizer catch err @info err βgmm = βols res = nothing end @test βgmm ≈ βols
Error During Test at none:1 Test threw exception Expression: βgmm ≈ βols UndefVarError: βgmm not defined Stacktrace: [1] top-level scope at none:0 [2] eval at ./boot.jl:319 [inlined] [3] capture_output(::Expr, ::Module, ::Bool, ::Bool, ::Bool, ::Bool) at /home/paul/.julia/dev/Weave/src/run.jl:230 [4] run_code(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/paul/ .julia/dev/Weave/src/run.jl:208 [5] eval_chunk(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/pau l/.julia/dev/Weave/src/run.jl:289 [6] run_chunk(::Weave.CodeChunk, ::Weave.Report, ::Module) at /home/paul /.julia/dev/Weave/src/run.jl:130 [7] #run#21(::String, ::Module, ::Symbol, ::Dict{Any,Any}, ::String, ::N othing, ::String, ::Symbol, ::Bool, ::typeof(run), ::Weave.WeaveDoc) at /ho me/paul/.julia/dev/Weave/src/run.jl:94 [8] #run at ./none:0 [inlined] [9] #weave#10(::String, ::Symbol, ::Symbol, ::Dict{Any,Any}, ::Module, : :String, ::Nothing, ::String, ::Symbol, ::Bool, ::Nothing, ::Nothing, ::Not hing, ::Array{String,1}, ::String, ::typeof(weave), ::String) at /home/paul /.julia/dev/Weave/src/Weave.jl:107 [10] (::getfield(Weave, Symbol(“#kw##weave”)))(::NamedTuple{(:out_path, :mod, :doctype, :pandoc_options, :cache),Tuple{Symbol,Module,String,Array{S tring,1},Symbol}}, ::typeof(weave), ::String) at ./none:0 [11] top-level scope at none:0 [12] include(::String) at ./client.jl:392 [13] top-level scope at none:0 [14] run_backend(::REPL.REPLBackend) at /home/paul/.julia/packages/Revis e/yp5KG/src/Revise.jl:769 [15] (::getfield(Revise, Symbol(“##58#60”)){REPL.REPLBackend})() at ./ta sk.jl:259
ERROR: Test.FallbackTestSetException("There was an error during testing")
The model implies that \[ \omega_{jt} = \Phi(w_{jt}) - \beta_k k_{jt} - \beta_l l_{jt} \] {#eq:omega} and \[ y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt} = g(\omega_{jt-1}) + \eta_{jt} \] {#eq:eta} The timing and exogeneity assumptions imply that \[ \Er[\eta_{jt} | k_{jt}, l_{jt}, w_{jt-1}] \] Given a value of \(\beta\), and our above estimates of \(\Phi\) and \(\alpha\), we can compute \(\omega\) from (???), and then estimate \(g()\) and \(\eta\) by a nonparametric regression of \(y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt}\) on \(\omega_{jt-1}\). \(\beta\) can then be estimated by finding the value of \(\beta\) that comes closest to satisfying the moment condition \[ \Er[\eta(\beta)_{jt} k_{jt}] = 0 \text{ and } \Er[\eta(\beta)_{jt} l_{jt}] = 0 \] To do this, we minimize \[ Q_n(\beta) = \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right) W_n \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right)' \]
Write the body of the \(Q_n(\beta)\) function below. Use it to estimate \(\beta\). Compare your results with those of Grieco and McDevitt (2017). Optionally, explore robustness of your results to changes in the specification.
# indicator for observations usable in estimation of β dialysis[:inclag] = panellag(:inc1, dialysis, :provfs, :year, 1); dialysis[:inc2] = (dialysis[:inclag] .& (dialysis[:stations].>0) .& (dialysis[:labor].>0) .& (dialysis[:patient_years].>0) .& .!ismissing.(dialysis[:quality])); dialysis[:inc2][ismissing.(dialysis[:inc2])] .= false; (ωfunc, ηfunc) = errors_gm(:lpy, :logK, :logL, :quality, :Φ, :provfs, :year, dialysis, α; degree=1) function Qn(β) η = ηfunc(β) error("You must write the body of this function") end using Optim try res = optimize(Qn, # objective [0.0, 0.0], # lower bounds, should not be needed, but # may help numerical performance [1.0, 1.0], # upper bounds [0.4, 0.2], # initial value Fminbox(BFGS()), # algorithm autodiff=:forward) β = res.minimizer @show β catch err @info err res = nothing β = nothing end
Many, perhaps most, estimators in econometrics are extrumem estimators. That is, many estimators are defined by \[ \hat{\theta} = \argmax_{\theta \in \Theta} \hat{Q}_n(\theta) \] where \(\hat{Q}_n(\theta)\) is some objective function that depends on data. Examples include maximum likelihood, \[ \hat{Q}_n(\theta) = \frac{1}{n} \sum_{i=1}^n f(z_i | \theta) \] nonlinear least squares, \[ \hat{Q}_n(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - h(x_i,\theta))^2 \] and as we are using for this example, GMM, \[ \hat{Q}_n(\theta) = \left(\frac{1}{n} \sum_{i=1}^n g(z_i, \theta)\right)' \hat{W} \left(\frac{1}{n} \sum_{i=1}^n g(z_i, \theta)\right). \] See Newey and McFadden (1994) for more details and examples.
We will encounter extremum estimators often in this course, so it is useful to be familiar with their statistical properties. However, since this course is not focused on econometrics, we will just state some basic “high-level” conditions for consistency and asymptotic normality, and only give brief sketches of proofs. Our main goal is to find the asymptotic distribution of \(\hat{\theta}\), so that we can report standard errors and confidence regions. If \(Q_n\) is differentiable, then \[ 0 = \nabla \hat{Q}_n(\hat{\theta}) \] Taking a first order expansion around \(\theta_0\), \[ 0 \approx \nabla \hat{Q}_n(\theta_0) + (\hat{\theta}-\theta_0) \nabla^2 Q_n(\theta_0) \] Rearranging and multiplying by \(\sqrt{n}\), gives \[ \sqrt{n}(\hat{\theta}-\theta_0) \approx (\nabla^2 Q_n(\theta_0))^{-1} \sqrt{n} \hat{Q}_n(\theta_0) \] If a law of large number applies to \(\nabla^2 Q_n(\theta_0)\) and a central limit theorem applies to \(\sqrt{n} \hat{Q}_n(\theta_0)\), then \(\sqrt{n}(\hat{\theta}-\theta_0)\) will be asymptotically normal. The following theorem states this idea somewhat more precisely.
Consistency for extremum estimators: assume
\(\hat{Q}_n(\theta)\) converges uniformly in probability to \(Q_0(\theta)\)
\(Q_0(\theta)\) is uniquely maximized at \(\theta_0\).
\(\Theta\) is compact and \(Q_0(\theta)\) is continuous.
Asymptotic normality for extremum estimators: assume
\(\hat{\theta} \inprob \theta_0\)
\(\theta_0 \in interior(\Theta)\)
\(\sqrt{n} \nabla \hat{Q}_n(\theta_0) \indist N(0,\Sigma)\)
Then \(\sqrt{n} (\hat{\theta} - \theta_0) \indist N\left(0,H^{-1} \Sigma H^{-1} \right)\)
For a GMM objective function of the form: \[ [1/n \sum_i g_i(\theta)] W_n [1/n \sum g_i(\theta)]\], if we assume:
\(1/\sqrt{n} \sum_i g_i(\theta_0) \indist N(0,\Sigma)\)
\(1/n \sum_i \nabla g_i(\theta) \inprob E[\nabla g(\theta)] = D\), \(W_n \inprob W\)
\((D'WD)\) is nonsingular.
then the above theorem for asymptotic normality of extremum estimators implies that \[ \sqrt{n}(\hat{\theta} - \theta_0) \indist N(0,\Omega) \] where \[ \Omega= (D'WD)^{-1} (D' W \Sigma W D) (D'WD)^{-1}. \] If we additionally assume \(W_n \inprob \Sigma^{-1}\), e.g. observations are independent and \(W_n = \widehat{Var}(g_i(\theta))^{-1}\), then the asymptotic variance simplifies to \((D' \Sigma D)^{-1}\). This choice of \(W\) is efficient in that it leads to the smallest asymptotic variance.
The above applies to estimators that come from minimizing a single objective function. This application involves a multi-step estimator. First, we estimated \(\alpha\) and \(\Phi()\), then we estimated \(\beta\) by GMM using moments of the form: \[ \hat{\beta} = \argmin [1/n \sum_i g_i(\beta, \hat{\alpha}, \hat{\Phi})] W_n [1/n \sum g_i(\beta, \hat{\alpha}, \hat{\Phi}) ] \] A similar expansion as above will give \[ \sqrt{n}(\hat{\beta} - \beta_0) \approx -(D'WD)^{-1} D' W \left(1/\sqrt{n} \sum g_i(\beta, \hat{\alpha}, \hat{\Phi}) \right) \] where \(D=\Er[\nabla_\beta g(\beta, \alpha, \Phi)]\). If we also expand \(g_i\) in terms of \(\alpha\) and \(\Phi\), we will get, \[ \sqrt{n}(\hat{\beta} - \beta_0) \approx -(D'WD)^{-1} D' W \left(1/\sqrt{n} \sum g_i(\beta_0, \alpha_0, \Phi_0) \right) + \Er[\frac{\partial g_i}{\partial \alpha}(\beta,\alpha,\Phi)] \sqrt{n}(\hat{\alpha}-\alpha_0) + \Er[\frac{\partial g_i}{\partial \Phi}(\beta,\alpha,\Phi) \sqrt{n}(\hat{\Phi}-\Phi_0) ]. \] So, there will be additional variance in \(\hat{\beta}\) from the estimation of \(\hat{\alpha}\) and \(\hat{\Phi}\). Since \(\alpha\) is finite dimensional, it is not too difficult to derive the distribution of \(\sqrt{n}(\hat{\alpha}-\alpha_0)\) and estimate \(\Er[\frac{\partial g_i}{\partial \alpha}(\beta,\alpha,\Phi)]\). However, \(\hat{\Phi}\) is a function and is more difficult to deal with. Under some strong assumptions, \[ \Er[\frac{\partial g_i}{\partial \Phi}(\beta,\alpha,\Phi) \sqrt{n}(\hat{\Phi}-\Phi_0) ] \indist N, \] but the needed assumptions are slightly restrictive and are tedious to state. An alternative approach is to redefine \(g_i\) to ensure that \(E[\frac{\partial g_i}{\partial \Phi}] = 0\). This can be done by letting \[ \begin{align*} g_{jt}(\beta, \alpha, \Phi) = & \left(y_{jt} - \alpha q_{jt} - x_{jt} \beta - h(\Phi(w_{jt-1}) - x_{jt-1}\beta) \right) \left(x_{jt} - \Er[x|\Phi(w_{jt-1}) - x_{jt-1}\beta]\right) + \\ & - \left(y_{jt-1} - \alpha q_{jt-1} - \Phi(w_{jt-1})\right) h'(\Phi(w_{jt-1}) - x_{jt-1}\beta) \left(x_{jt} - \Er[x|\Phi(w_{jt-1}) - x_{jt-1}\beta]\right) \end{align*} \] where \(x = (k, l)\), and \(h(\Phi(w_{jt-1}) - x_{jt-1}) = \Er[y_{jt} - \alpha q_{jt} - x_{jt} \beta |\Phi(w_{jt-1}) - x_{jt-1}\beta]\). It is not too difficult to verify that \[ \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)] = 0 \] and \[ 0 = D_\Phi \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)]\;\;,\;\; 0 = D_h \Er[g_{jt}(\beta_0,\alpha_0,\Phi_0)] \] In other words, these moment conditions are orthogonal in the sense of Chernozhukov et al. (2018). Estimation error in \(\Phi\) and \(h\) only has second order effects on the estimate of \(\beta\). Under appropriate assumptions, these second order effects will vanish quickly enough that they can be ignored in the asymptotic distribution of \(\hat{\beta}\).
We can similarly deal with the uncertainty in \(\hat{\alpha}\) by redefining the moment condition to be orthogonal with respect to \(\alpha\). Let \[ \tilde{g}_{jt}(\beta,\alpha,\Phi) = g_{jt}(\beta, \alpha, \Phi) - \Er[ D_\alpha g_{jt}(\beta_0, \alpha, \Phi)] \frac{(y_{jt} - \alpha q_{jt} - \Phi(w_{jt})) (\Er[q|z_{jt},w_{jt}] - \Er[q|w_{jt}])}{\Er[(q - \Er[q|w])(\Er[q|z,w]-\Er[q|w])]}. \] Then \(\Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)] = 0\) and \[ 0 = D_\Phi \Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)]\;\;,\;\; 0 = D_\alpha \Er[\tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0)]. \] Hence, \[ \begin{align*} \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) = & \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \alpha_0, \Phi_0 ) + o_p(1) \end{align*} \]
Under appropriate assumptions a CLT will apply to \(\tilde{g}_{jt}\), so \[ \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\hat{\beta}, \hat{\alpha}, \hat{\Phi} ) = \frac{1}{\sqrt{n}} \sum_{j,t} \tilde{g}_{jt}(\beta_0,\alpha_0,\Phi_0) + o_p(1) \indist N(0,\Sigma) \] Furthermore, \(\Sigma\) can estimated by taking the sample (clustered) covariance of \(\tilde{g}_{jt}(\beta,\hat{\alpha},\hat{\Phi})\). Denote this by \(\hat{\Sigma}(\beta)\). We then have \[ Q_n^{CUE}(\beta_0) = \left(\frac{1}{n} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) \right)' \hat{\Sigma(\beta_0)}^{-1} \left(\frac{1}{n} \sum_{j,t} \tilde{g}_{jt}(\beta_0, \hat{\alpha}, \hat{\Phi} ) \right) \indist \chi^2_{dim(\tilde{g}_{jt})}. \] This can be used to test \(H_0: \beta = \beta_0\), or to form a confidence region by taking all values of \(\beta\) for which the test fails to reject. Such an inference procedure is robust to identification problems in that it does not require an assumption about \(D = D_\beta \Er[g_{jt}(\beta,\alpha_0,\Phi_0)]\) being full rank or \(Q_n\) being uniquely minimized at \(\beta_0\). If we are willing to assume strong identification in that \((D' \Sigma(\beta_0)^{-1} D)\) is invertible (and, loosely speaking, the estimated version of this is far from singular relative to its estimation error), then we can take an expansion to get \[ \sqrt{n}(\hat{\beta} - \beta_0) = (D'\Sigma^{-1} D)^{-1} D'\Sigma^{-1} \frac{1}{\sqrt{n}} \tilde{g}_{jt}(\beta_0, \alpha_0, \Phi_0) + o_p(1) \indist N(0, (D' \Sigma^{-1} D)^{-1}). \]
See my notes from 628 on extremum estimation and on bootstrap for more details.
The objective_gm
function in Dialysis.jl constructs the \(Q_n^{CUE}\) objective function described above. The following code minimizes it and plots a confidence region for \(\beta\) based on inverting the \(\chi^2\) test described above.
(obj, momenti, cue, varfn) = objective_gm(:lpy, :logK, :logL, :quality, :Φ, :provfs, :year, [:logK, :logL], dialysis, clusterid=:idcat, npregress=(xp,xd,yd)->polyreg(xp,xd,yd,degree=1, deriv=true)); res = optimize(b->cue(b,α), # objective [0.0, 0.0], # lower bounds, should not be needed, but # may help numerical performance [1.0, 1.0], # upper bounds [0.4, 0.2], # initial value Fminbox(BFGS()), # algorithm autodiff=:forward) res
Results of Optimization Algorithm * Algorithm: Fminbox with BFGS * Starting Point: [0.4,0.2] * Minimizer: [2.4281864085583334e-20,0.14987299477154925, …] * Minimum: 2.340586e+00 * Iterations: 7 * Convergence: true * |x - x’| ≤ 0.0e+00: false |x - x’| = 2.98e-17 * |f(x) - f(x’)| ≤ 0.0e+00 |f(x)|: true |f(x) - f(x’)| = 0.00e+00 |f(x)| * |g(x)| ≤ 1.0e-08: false |g(x)| = 4.44e+00 * Stopped by an increasing objective: true * Reached Maximum Number of Iterations: false * Objective Calls: 1271 * Gradient Calls: 1271
# calculate standard errors based on normal approximation βhat = res.minimizer gi = momenti(βhat,α) (Σ, n) = varfn(gi) using ForwardDiff using LinearAlgebra # for diag D = ForwardDiff.jacobian(β->mean(momenti(β,α), dims=1), βhat) Vβ = inv(D'*inv(Σ)*D)/n @show βhat
βhat = [2.42819e-20, 0.149873]
@show sqrt.(diag(Vβ))
sqrt.(diag(Vβ)) = [1.46929, 0.0594817] 2-element Array{Float64,1}: 1.469292635429146
0.05948171117514879
using Plots, Distributions Plots.gr() lb = [0., 0.] ub = [1., 1.] ntest = 200 βtest = [rand(2).*(ub-lb) .+ lb for i in 1:ntest]; βtest = vcat(βtest'...) fn = β->cue(β,α) pval = Vector{Float64}(undef, ntest); qval = similar(pval); Threads.@threads for i in 1:size(βtest,1) qval[i] = fn(βtest[i,:]); pval[i] = cdf(Chisq(2),qval[i]); end crit = 0.9 fig=scatter(βtest[:,1],βtest[:,2], group=(pval.<crit), legend=false, markersize=4, markerstrokewidth=0.0, seriesalpha=1.0, palette=:isolum, xlabel="βk", ylabel="βl") ngrid = 20 b = range.(lb,ub, length=ngrid) pc = Matrix{Float64}(undef, length(b[1]), length(b[2])) qc = similar(pc) Threads.@threads for i in CartesianIndices(pc) qc[i] = fn([b[1][i[1]],b[2][i[2]]]); pc[i] = cdf(Chisq(2),qc[i]); end fig=contour!(b[1],b[2],pc', levels = [0.75, 0.9, 0.95, 0.99], contour_labels=true) display(fig)
Here we plot \(Q^{CUE}_n(\beta)\).
Plots.plotly() surface(b[1],b[2],qc', xlabel="βk", ylabel="βl")
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–C68. https://doi.org/10.1111/ectj.12097.
Grieco, Paul L. E., and Ryan C. McDevitt. 2017. “Productivity and Quality in Health Care: Evidence from the Dialysis Industry.” The Review of Economic Studies 84 (3): 1071–1105. https://doi.org/10.1093/restud/rdw042.
Newey, Whitney K., and Daniel McFadden. 1994. “Chapter 36 Large Sample Estimation and Hypothesis Testing.” In, 4:2111–2245. Handbook of Econometrics. Elsevier. https://doi.org/https://doi.org/10.1016/S1573-4412(05)80005-4.