Assignment: pharmacy entry

Part I-III: data preparation

Author: Paul Schrimpf
Date: 2019-03-04

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About this document

This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebooks.

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]

Introduction

This assignment will estimate a model of pharmacy entry inspired by Bresnahan and Reiss (1991).

Getting started

https://vse.syzygy.ca provides a convenient browser based interface to Julia. Open it and log in. This assignment is in a git repository at https://github.com/UBCECON567/PharmacyEntry. Start by cloning the git repository to your syzygy directory. Open a terminal in syzygy (File -> New -> Terminal). This will open a Linux shell in your browser. To clone the git repository, enter

git clone https://github.com/UBCECON567/PharmacyEntry

This will create a directory called PharmacyEntry containing all the files related to this assignment.

Clicking on the folder icon near the top left of the screen opens a file browser panel. Use it to open the PharmacyEntry/notebooks folder. You can complete this assignment by modifying the pharmacyentry.ipynb notebook. I recommend creating a copy of this notebook, and then working on the copy. You can create a copy by right clicking in the file browser panel. Now open your copy of the notebook.

There might be some problem with previously installed julia packages and the upgrade to vse.syzygy.ca . If you can no longer run Julia notebooks, and are seeing “No Kernel!” in the top right of the screen, you can try removing your previously installed julia packages. Open a terminal and type “rm -rf ~/.julia”. Afterward, you will have to re-add any packages you need.

Part I: scraping pharmacy data

As in the previous assignment, we begin by loading required packages.

using Pkg 
Pkg.activate("..") 
#Pkg.resolve()
#Pkg.instantiate()
#
# If you are missing packages, you may have to uncomment one of the
# two above lines. However, they seem to cause mysterious problems with
# HTTP.get() (especially with https sites) later. If you find
# HTTP.get() is hanging, then (1) run this cell with Pkg.instantiate()
# once, (2) comment out Pkg.instantiate() (3) re-start the kernel and
# re-run everything. 
#

using Revise
if (!("../src" ∈ LOAD_PATH))
  push!(LOAD_PATH, "../src") 
end
using PharmacyEntry

Part of the appeal of entry models is that they can be estimated using readily available data. Bresnahan and Reiss (1991) used data on market demographic information and the number of firms within a category in each market. We can easily gather similar information. We will focus on pharmacies. We chose pharmacies because it was one of the industries included in Bresnahan and Reiss (1991), it fits with the theme of focusing on the industrial organization of health related industries, and a list of pharamacies in Canada is available. Pharmacies in Canada are provincially regulated. A list with the website of each provincial regulator can be found here.

Each provincial regulator provides a list of pharmacies with addresses. These can be downloaded and saved as DataFrames. Doing so is slightly tedious though because each provincial regulator’s website is different, so different code will be needed to parse each.

Let’s look at community pharmacies in BC, http://www.bcpharmacists.org/list-community-pharmacies. This list is already nicely formatted as a table. We could simply copy and paste its contents into a spreadsheet, save as cvs, and then load into Julia. That would be an okay method, but copying and pasting is slightly more error prone (or at least prone to undocumented and unreproducible errors) than writing code to parse the website. More importantly, other provinces have websites that are not so nicely formatted and will require code to parse. We will HTTP.jl to request webpages, Cascadia.jl to select by CSS paths, and regular expressions to find patterns in text. Although our code will be in Julia, CSS selectors and regular expressions are common and useful tools for scraping webpages with any programming language.

To scrape websites well, you need some understanding of html, CSS, and sometimes javascript. https://www.w3schools.com/ is a pretty comprehensive reference. Here is a very short explanation. CSS stands for “cascading style sheets.” CSS is used to control the formatting of webpages. html documents consist of a bunch of tagged elements (headlines, links, tables, paragraphs, divs, etc) with optional class specifications. CSS contain information about how to display each combination of tag and class. The combination(s) of tags and classes to which a given display style (font, color, etc) applies is called a CSS selector. Each part of a webpage that looks different will have a different CSS selector. If we want to scrape some particular information off a webpage, CSS selectors are usually a good way to pick out the pieces we need.

Returning to http://www.bcpharmacists.org/list-community-pharmacies, we can see that the pharmacy names and addresses are in table rows with alternating background shades of gray. There will be CSS selectors to pick out these rows. Most web browsers include tools to create css selectors. In either Firefox or Chrome (Safari and Edge likely have similar behavior, but I have not checked), if you right click anywhere on a website, there is an “Inspect Element” option (you possibly need to enable developer tools in the web browser options first). This will open the developer toolbar. In the toolbar there will be a collapsible tree listing the nested html tags on the website. As you hover over different tags, the corresponding part of the website will be highlighted. If you right click a tag, there will be an option to copy the corresponding CSS selector. Doing this on http://www.bcpharmacists.org/list-community-pharmacies, you will see that tr.odd and tr.even will select all rows from the table that we want. Here’s some Julia code to select these rows.

using HTTP, Gumbo, Cascadia

# download website
r =  HTTP.get("http://www.bcpharmacists.org/list-community-pharmacies");  

# parse website into tree using Gumbo.jl
h = Gumbo.parsehtml(String(r.body));

# select elements by CSS with Cascadia
rows = eachmatch(Cascadia.Selector("tr.odd, tr.even"), h.root);
@show length(rows)
@show typeof(rows)
@show typeof(rows[1])
display(rows[1])
display(rows[1].children)

As you saw from the developer tools in your browser, html documents can be organized into a tree structure of nested tags. Gumbo.jl is a Julia package that stores html pages in exactly this sort of tree structure. From the output of the code, we can see that it successfully selected all 1000+ rows of the table of pharmacies. Each row is a HTMLElement as described in the documentation for Gumbo.jl. Each row has children corresponding to the 5 columns of the table. Now, we can extract the text from the 5 columns of each row and store it in a DataFrame.

using DataFrames
function parserow(row)
  fields = nodeText.(row.children)
  fields = reshape(fields, (1, length(fields)))
end
txt = vcat(parserow.(rows)...)
bc = DataFrame(txt, [:name, :address, :manager, :phone, :fax])
describe(bc)

Later we are going to assign these pharmacies into isolated local markets and match them with census data. We will do so based on addresses. For this, it will be useful to parse the addresses into a consistent format. We will divide the address field into street, city, zip, and province. The dominant tool for matching, splitting, and substiting text based on patterns are regular expressions. Regular expressions are nearly as old as computers themselves, and they remain an important and powerfull tool for text processing.

Problem 1: familiarize yourself with regular expressions

This interactive tutorial is a good way to learn the basics of regular expressions. Work through it. There is nothing to turn in for this part, but there is a small task below.

Here is some Julia code using regular expressions to split addresses into their parts. See Julia’s documentation for more information.

bc[:street] = (a->replace(a, r"(.+)\n.+, BC.+\n.+"s => s"\1")).(bc[:address])
bc[:city]   = (a->replace(a, r".+\n(.+), BC.+\n.+"s => s"\1")).(bc[:address])
bc[:zip]  =     (a->replace(a,r".+(\p{L}\d\p{L}).?(\d\p{L}\d).*"s => s"\1 \2")).(bc[:address])
bc[:zipalt]    = (a->replace(a, r".+(\p{L}\d\p{L} \d\p{L}\d).*"s => s"\1")).(bc[:address])
bc[:province] = "BC" 
describe(bc)

Compare :zip and :zipalt in the code above? What is the difference between the regular expressions used? Give an example of a pattern that one matches, but not the other. What is the problem with :zipalt?

Problem 2: parse an additional province

The function loadpharmacydata in PharmacyEntry/src/pharmacies.jl downloads and parses the lists of pharmacies in BC and Manitoba. Add code to download data from at least one additional province. Note that the function loadBCdata and loadMBdata are not exported from the PharmacyEntry.jl module. To call non-exported functions from outside the module you must preface with the module name, e.g. bc = PharmacyEntry.loadBCdata(true). Either include your modified version of pharmacies.jl with what you turn in, or include the new function definitions in this notebook.

Part II: downloading census data

Problem 3: market definition

We will download information on area demographic and economic conditions from Statistics Canada. An important choice in entry models is deciding what geographic area should define a market. See e.g. Bresnahan and Reiss (1991) or Ellickson (2007). Statistics Canada provides data at various geographic levels. I somewhat arbitrarily chose to download data at the population centre level. Criticise or defend this choice. Optionally, modify census.jl to download census data at a different geographic level (the changes might be somewhat extensive).

Problem 4: choose census variables to extract

The census data contains a lot of information. For example, all the information for the Vancouver population centre is here. The documentation for the variables from StatCan is included in the assignment git repo as, https://github.com/ECON567/PharmacyEntry/data/98-401-X2016048_English_meta.txt. The const vars defined at the start of PharmacyEntry/src/census.jl is a list of variables from the census data to extract for future use. Briefly review these and add or delete variables as you see fit. We want variables that are related to either the revenues or costs of pharmacies. It is probably better to err on the side of including too much instead of too little because you can always choose to ignore some variables later.

Part III: geocoding

We must now link pharmacies and population centres together based on location. Statistics Canada provides shapefiles giving the boundaries of population centres. The loadcensusdata function in census.jl records the latitude and longitude of the geographic center of each population centre. For pharmacies we have have their addresses. There are number of online service that geocode addresses into latitude and longitude. I looked at a few of them and settled on using https://www.geocod.io/ for this project. The main reasons I chose this service are that its terms of service allow using and storing data generated from it in any way, and a free account allows enough lookups to complete this project. The function geocode! in geo.jl uses geocod.io to geocode the pharmacy addresses. If you want to run it (e.g. because you scraped pharmacies from additional provinces), you will need to get a free API key by signing up with https://www.geocod.io/. Save the key in a text file named geocodio.key and place it in PharmacyEntry/src/. Let me know if you have any problems or the free key is insufficient.

Geocoding is imperfect due to typos and errors in the address data, and possibly missing information in geocod.io’s map data. The function checklatlng! partially checks for errors by comparing the latitude and longitude from geocod.io with the boundaries of forward sortation areas (first 3 characters of zip codes). The plotmap function plots population centres and pharmacies. Pharmacies for which the latitude and longitude is not in the boundaries of the forward sortation area of its zip code are red. The others are green. For red pharmacies, the line goes from the pharmacy to the latitude and longitude at the center of its FSA. You can clearly see some pharmacies are misplaced, but other red ones might be okay.

using Statistics
pharm = loadpharmacydata()

reading pharmacy data from /home/paul/565/assignments/PharmacyEntry/data/ph armacies.csv

census= loadcensusdata()

checklatlng!(pharm, :lat, :lng, :zip)
@show mean(pharm[:zipmatch])

mean(pharm[:zipmatch]) = 0.6708369283865401 0.6708369283865401

plotmap(census, pharm)
# Configuration problem with jupyter on vse.syzygy.ca makes PlotlyJS
# not display. Problem with Weave.jl also make PlotlyJS not display in
# html output ... :(
#
# Bonus problem: find a way to plot a map that works on vse.syzygy.ca

Problem 5: assign pharmacies to markets

Decide on some rules for assigning pharmacies to markets (population centres). Implement your choice and create a variable giving the number of pharmacies in each population centre. The function distance_m in geo.jl might be useful. Create tables and/or graphs summarizing the distribution of number of pharmacies similar to Table 2 and Figure 3 from Bresnahan and Reiss (1991).

Problem 6: summary statistics and figures

Create additional summary statistics and figures.

This will be continued in a second notebook. Save your data frame to load it there.

import CSV

## For illustration purposes, here's a quick and dirty assignment of
## pharmacies to markets. I would not advise using this exactly.
subset = (x->x ∈ ["British Columbia",
                  "Manitoba",
                  "Manitoba/Saskatchewan",
                  "Newfoundland and Labrador",
                  "New Brunswick",
                  "New Brunswick/Quebec",
                  "Prince Edward Island"]).(census[:PROV_TERR_NAME_NOM] )
df = census[subset,:]
function closestPC(lat, lng; data=df)
  d = distance_m.(lng, lat, data[:lng], data[:lat])
  (dm, i) = findmin(d)
  data[:GEO_CODE][i]
end
lat = pharm[:lat]
lat[ismissing.(lat)] .= pharm[:ziplat][ismissing.(lat)]
lng = pharm[:lng]
lng[ismissing.(lng)] .= pharm[:ziplng][ismissing.(lng)]
pharm[:pc_geo_code] =  closestPC.(lat,lng)
df[:pharmacies] = (g->(sum(pharm[:pc_geo_code].==g))).(df[:GEO_CODE])

CSV.write("cleandata.csv",df);

Population centres assigned more than 10 pharmacies.

df[df[:pharmacies].>10,[:GEO_NAME, :pharmacies,  :PROV_TERR_NAME_NOM]]

44 rows × 3 columns

	GEO_NAME	pharmacies	PROV_TERR_NAME_NOM
	String⍰	Int64	String⍰
1	Bathurst	11	New Brunswick
2	Bay Roberts	12	Newfoundland and Labrador
3	Beausejour	12	Manitoba
4	Brandon	20	Manitoba
5	Campbell River	14	British Columbia
6	Charlottetown	22	Prince Edward Island
7	Chilliwack	22	British Columbia
8	Clarenville-Shoal Harbour	14	Newfoundland and Labrador
9	Corner Brook	12	Newfoundland and Labrador
10	Duncan	12	British Columbia
11	Edmundston	25	New Brunswick
12	Fredericton	20	New Brunswick
13	Kamloops	23	British Columbia
14	Kelowna	38	British Columbia
15	Moncton	35	New Brunswick
16	Nanaimo	32	British Columbia
17	Parksville	13	British Columbia
18	Penticton	16	British Columbia
19	Portage la Prairie	12	Manitoba
20	Prince George	21	British Columbia
21	Prince Rupert	12	British Columbia
22	Saint John	22	New Brunswick
23	Selkirk	11	Manitoba
24	St. John’s	63	Newfoundland and Labrador
25	Stony Mountain	42	Manitoba
26	Summerside	13	Prince Edward Island
27	Vancouver	192	British Columbia
28	Vernon	22	British Columbia
29	Victoria	98	British Columbia
30	White Rock	167	British Columbia
31	Winkler	12	Manitoba
32	Winnipeg	156	Manitoba
33	Abbotsford	44	British Columbia
34	Chatham - Douglastown	12	New Brunswick
35	Niverville	13	Manitoba
36	Lions Bay	29	British Columbia
37	Aldergrove	31	British Columbia
38	St. Adolphe	39	Manitoba
39	McEwen	17	New Brunswick
40	New Maryland	11	New Brunswick
41	Quispamsis - Rothesay	11	New Brunswick
42	Ile des Chênes	11	Manitoba
43	Ladner	287	British Columbia
44	Mission	11	British Columbia

Population centres with 0 pharmacies

df[df[:pharmacies].==0,[:GEO_NAME, :pharmacies, :lat, :lng, :PROV_TERR_NAME_NOM]]

20 rows × 5 columns

	GEO_NAME	pharmacies	lat	lng	PROV_TERR_NAME_NOM
	String⍰	Int64	Float64⍰	Float64⍰	String⍰
1	Cumberland	0	49.6242	-125.03	British Columbia
2	Fortune	0	47.0722	-55.8248	Newfoundland and Labrador
3	Killarney	0	49.1806	-99.6663	Manitoba
4	Logan Lake	0	50.4904	-120.812	British Columbia
5	Sicamous	0	50.8233	-118.98	British Columbia
6	Oakbank	0	49.9375	-96.8424	Manitoba
7	Lorette	0	49.7411	-96.8684	Manitoba
8	Crofton	0	48.8712	-123.651	British Columbia
9	Roberts Creek	0	49.4252	-123.644	British Columbia
10	Harrison Hot Springs	0	49.2962	-121.781	British Columbia
11	Mitchell	0	49.5335	-96.7636	Manitoba
12	St. Theresa Point	0	53.8275	-94.8511	Manitoba
13	Grunthal	0	49.4065	-96.8603	Manitoba
14	Blumenort	0	49.6033	-96.7006	Manitoba
15	Landmark	0	49.6714	-96.8177	Manitoba
16	St-Pierre-Jolys	0	49.4401	-96.9867	Manitoba
17	Canoe	0	50.7519	-119.235	British Columbia
18	Naramata	0	49.5929	-119.583	British Columbia
19	Telkwa	0	54.6893	-127.058	British Columbia
20	Cedar	0	49.1069	-123.856	British Columbia

Bresnahan, Timothy F., and Peter C. Reiss. 1991. “Entry and Competition in Concentrated Markets.” Journal of Political Economy 99 (5): pp. 977–1009. http://www.jstor.org/stable/2937655.

Ellickson, P.B. 2007. “Does Sutton Apply to Supermarkets?” The RAND Journal of Economics 38 (1): 43–59. http://onlinelibrary.wiley.com/doi/10.1111/j.1756-2171.2007.tb00043.x/abstract.