Assignment: pharmacy entry

Part I-III: data preparation

Author: Paul Schrimpf
Date: 2019-03-04

Creative
Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About this document

This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebooks.

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]

Introduction

This assignment will estimate a model of pharmacy entry inspired by Bresnahan and Reiss (1991).

Getting started

https://vse.syzygy.ca provides a convenient browser based interface to Julia. Open it and log in. This assignment is in a git repository at https://github.com/UBCECON567/PharmacyEntry. Start by cloning the git repository to your syzygy directory. Open a terminal in syzygy (File -> New -> Terminal). This will open a Linux shell in your browser. To clone the git repository, enter

git clone https://github.com/UBCECON567/PharmacyEntry

This will create a directory called PharmacyEntry containing all the files related to this assignment.

Clicking on the folder icon near the top left of the screen opens a file browser panel. Use it to open the PharmacyEntry/notebooks folder. You can complete this assignment by modifying the pharmacyentry.ipynb notebook. I recommend creating a copy of this notebook, and then working on the copy. You can create a copy by right clicking in the file browser panel. Now open your copy of the notebook.

There might be some problem with previously installed julia packages and the upgrade to vse.syzygy.ca . If you can no longer run Julia notebooks, and are seeing “No Kernel!” in the top right of the screen, you can try removing your previously installed julia packages. Open a terminal and type “rm -rf ~/.julia”. Afterward, you will have to re-add any packages you need.

Part I: scraping pharmacy data

As in the previous assignment, we begin by loading required packages.

using Pkg 
Pkg.activate("..") 
#Pkg.resolve()
#Pkg.instantiate()
#
# If you are missing packages, you may have to uncomment one of the
# two above lines. However, they seem to cause mysterious problems with
# HTTP.get() (especially with https sites) later. If you find
# HTTP.get() is hanging, then (1) run this cell with Pkg.instantiate()
# once, (2) comment out Pkg.instantiate() (3) re-start the kernel and
# re-run everything. 
#

using Revise
if (!("../src"  LOAD_PATH))
  push!(LOAD_PATH, "../src") 
end
using PharmacyEntry

Part of the appeal of entry models is that they can be estimated using readily available data. Bresnahan and Reiss (1991) used data on market demographic information and the number of firms within a category in each market. We can easily gather similar information. We will focus on pharmacies. We chose pharmacies because it was one of the industries included in Bresnahan and Reiss (1991), it fits with the theme of focusing on the industrial organization of health related industries, and a list of pharamacies in Canada is available. Pharmacies in Canada are provincially regulated. A list with the website of each provincial regulator can be found here.

Each provincial regulator provides a list of pharmacies with addresses. These can be downloaded and saved as DataFrames. Doing so is slightly tedious though because each provincial regulator’s website is different, so different code will be needed to parse each.

Let’s look at community pharmacies in BC, http://www.bcpharmacists.org/list-community-pharmacies. This list is already nicely formatted as a table. We could simply copy and paste its contents into a spreadsheet, save as cvs, and then load into Julia. That would be an okay method, but copying and pasting is slightly more error prone (or at least prone to undocumented and unreproducible errors) than writing code to parse the website. More importantly, other provinces have websites that are not so nicely formatted and will require code to parse. We will HTTP.jl to request webpages, Cascadia.jl to select by CSS paths, and regular expressions to find patterns in text. Although our code will be in Julia, CSS selectors and regular expressions are common and useful tools for scraping webpages with any programming language.

To scrape websites well, you need some understanding of html, CSS, and sometimes javascript. https://www.w3schools.com/ is a pretty comprehensive reference. Here is a very short explanation. CSS stands for “cascading style sheets.” CSS is used to control the formatting of webpages. html documents consist of a bunch of tagged elements (headlines, links, tables, paragraphs, divs, etc) with optional class specifications. CSS contain information about how to display each combination of tag and class. The combination(s) of tags and classes to which a given display style (font, color, etc) applies is called a CSS selector. Each part of a webpage that looks different will have a different CSS selector. If we want to scrape some particular information off a webpage, CSS selectors are usually a good way to pick out the pieces we need.

Returning to http://www.bcpharmacists.org/list-community-pharmacies, we can see that the pharmacy names and addresses are in table rows with alternating background shades of gray. There will be CSS selectors to pick out these rows. Most web browsers include tools to create css selectors. In either Firefox or Chrome (Safari and Edge likely have similar behavior, but I have not checked), if you right click anywhere on a website, there is an “Inspect Element” option (you possibly need to enable developer tools in the web browser options first). This will open the developer toolbar. In the toolbar there will be a collapsible tree listing the nested html tags on the website. As you hover over different tags, the corresponding part of the website will be highlighted. If you right click a tag, there will be an option to copy the corresponding CSS selector. Doing this on http://www.bcpharmacists.org/list-community-pharmacies, you will see that tr.odd and tr.even will select all rows from the table that we want. Here’s some Julia code to select these rows.

using HTTP, Gumbo, Cascadia

# download website
r =  HTTP.get("http://www.bcpharmacists.org/list-community-pharmacies");  

# parse website into tree using Gumbo.jl
h = Gumbo.parsehtml(String(r.body));

# select elements by CSS with Cascadia
rows = eachmatch(Cascadia.Selector("tr.odd, tr.even"), h.root);
@show length(rows)
@show typeof(rows)
@show typeof(rows[1])
display(rows[1])
display(rows[1].children)

As you saw from the developer tools in your browser, html documents can be organized into a tree structure of nested tags. Gumbo.jl is a Julia package that stores html pages in exactly this sort of tree structure. From the output of the code, we can see that it successfully selected all 1000+ rows of the table of pharmacies. Each row is a HTMLElement as described in the documentation for Gumbo.jl. Each row has children corresponding to the 5 columns of the table. Now, we can extract the text from the 5 columns of each row and store it in a DataFrame.

using DataFrames
function parserow(row)
  fields = nodeText.(row.children)
  fields = reshape(fields, (1, length(fields)))
end
txt = vcat(parserow.(rows)...)
bc = DataFrame(txt, [:name, :address, :manager, :phone, :fax])
describe(bc)

Later we are going to assign these pharmacies into isolated local markets and match them with census data. We will do so based on addresses. For this, it will be useful to parse the addresses into a consistent format. We will divide the address field into street, city, zip, and province. The dominant tool for matching, splitting, and substiting text based on patterns are regular expressions. Regular expressions are nearly as old as computers themselves, and they remain an important and powerfull tool for text processing.

Problem 1: familiarize yourself with regular expressions

This interactive tutorial is a good way to learn the basics of regular expressions. Work through it. There is nothing to turn in for this part, but there is a small task below.

Here is some Julia code using regular expressions to split addresses into their parts. See Julia’s documentation for more information.

bc[:street] = (a->replace(a, r"(.+)\n.+, BC.+\n.+"s => s"\1")).(bc[:address])
bc[:city]   = (a->replace(a, r".+\n(.+), BC.+\n.+"s => s"\1")).(bc[:address])
bc[:zip]  =     (a->replace(a,r".+(\p{L}\d\p{L}).?(\d\p{L}\d).*"s => s"\1 \2")).(bc[:address])
bc[:zipalt]    = (a->replace(a, r".+(\p{L}\d\p{L} \d\p{L}\d).*"s => s"\1")).(bc[:address])
bc[:province] = "BC" 
describe(bc)

Compare :zip and :zipalt in the code above? What is the difference between the regular expressions used? Give an example of a pattern that one matches, but not the other. What is the problem with :zipalt?

Problem 2: parse an additional province

The function loadpharmacydata in PharmacyEntry/src/pharmacies.jl downloads and parses the lists of pharmacies in BC and Manitoba. Add code to download data from at least one additional province. Note that the function loadBCdata and loadMBdata are not exported from the PharmacyEntry.jl module. To call non-exported functions from outside the module you must preface with the module name, e.g. bc = PharmacyEntry.loadBCdata(true). Either include your modified version of pharmacies.jl with what you turn in, or include the new function definitions in this notebook.

Part II: downloading census data

Problem 3: market definition

We will download information on area demographic and economic conditions from Statistics Canada. An important choice in entry models is deciding what geographic area should define a market. See e.g. Bresnahan and Reiss (1991) or Ellickson (2007). Statistics Canada provides data at various geographic levels. I somewhat arbitrarily chose to download data at the population centre level. Criticise or defend this choice. Optionally, modify census.jl to download census data at a different geographic level (the changes might be somewhat extensive).

Problem 4: choose census variables to extract

The census data contains a lot of information. For example, all the information for the Vancouver population centre is here. The documentation for the variables from StatCan is included in the assignment git repo as, https://github.com/ECON567/PharmacyEntry/data/98-401-X2016048_English_meta.txt. The const vars defined at the start of PharmacyEntry/src/census.jl is a list of variables from the census data to extract for future use. Briefly review these and add or delete variables as you see fit. We want variables that are related to either the revenues or costs of pharmacies. It is probably better to err on the side of including too much instead of too little because you can always choose to ignore some variables later.

Part III: geocoding

We must now link pharmacies and population centres together based on location. Statistics Canada provides shapefiles giving the boundaries of population centres. The loadcensusdata function in census.jl records the latitude and longitude of the geographic center of each population centre. For pharmacies we have have their addresses. There are number of online service that geocode addresses into latitude and longitude. I looked at a few of them and settled on using https://www.geocod.io/ for this project. The main reasons I chose this service are that its terms of service allow using and storing data generated from it in any way, and a free account allows enough lookups to complete this project. The function geocode! in geo.jl uses geocod.io to geocode the pharmacy addresses. If you want to run it (e.g. because you scraped pharmacies from additional provinces), you will need to get a free API key by signing up with https://www.geocod.io/. Save the key in a text file named geocodio.key and place it in PharmacyEntry/src/. Let me know if you have any problems or the free key is insufficient.

Geocoding is imperfect due to typos and errors in the address data, and possibly missing information in geocod.io’s map data. The function checklatlng! partially checks for errors by comparing the latitude and longitude from geocod.io with the boundaries of forward sortation areas (first 3 characters of zip codes). The plotmap function plots population centres and pharmacies. Pharmacies for which the latitude and longitude is not in the boundaries of the forward sortation area of its zip code are red. The others are green. For red pharmacies, the line goes from the pharmacy to the latitude and longitude at the center of its FSA. You can clearly see some pharmacies are misplaced, but other red ones might be okay.

using Statistics
pharm = loadpharmacydata()

reading pharmacy data from /home/paul/565/assignments/PharmacyEntry/data/ph armacies.csv

census= loadcensusdata()

checklatlng!(pharm, :lat, :lng, :zip)
@show mean(pharm[:zipmatch])

mean(pharm[:zipmatch]) = 0.6708369283865401 0.6708369283865401

plotmap(census, pharm)
# Configuration problem with jupyter on vse.syzygy.ca makes PlotlyJS
# not display. Problem with Weave.jl also make PlotlyJS not display in
# html output ... :(
#
# Bonus problem: find a way to plot a map that works on vse.syzygy.ca

Problem 5: assign pharmacies to markets

Decide on some rules for assigning pharmacies to markets (population centres). Implement your choice and create a variable giving the number of pharmacies in each population centre. The function distance_m in geo.jl might be useful. Create tables and/or graphs summarizing the distribution of number of pharmacies similar to Table 2 and Figure 3 from Bresnahan and Reiss (1991).

Problem 6: summary statistics and figures

Create additional summary statistics and figures.

This will be continued in a second notebook. Save your data frame to load it there.

import CSV

## For illustration purposes, here's a quick and dirty assignment of
## pharmacies to markets. I would not advise using this exactly.
subset = (x->x  ["British Columbia",
                  "Manitoba",
                  "Manitoba/Saskatchewan",
                  "Newfoundland and Labrador",
                  "New Brunswick",
                  "New Brunswick/Quebec",
                  "Prince Edward Island"]).(census[:PROV_TERR_NAME_NOM] )
df = census[subset,:]
function closestPC(lat, lng; data=df)
  d = distance_m.(lng, lat, data[:lng], data[:lat])
  (dm, i) = findmin(d)
  data[:GEO_CODE][i]
end
lat = pharm[:lat]
lat[ismissing.(lat)] .= pharm[:ziplat][ismissing.(lat)]
lng = pharm[:lng]
lng[ismissing.(lng)] .= pharm[:ziplng][ismissing.(lng)]
pharm[:pc_geo_code] =  closestPC.(lat,lng)
df[:pharmacies] = (g->(sum(pharm[:pc_geo_code].==g))).(df[:GEO_CODE])

CSV.write("cleandata.csv",df);

Population centres assigned more than 10 pharmacies.

df[df[:pharmacies].>10,[:GEO_NAME, :pharmacies,  :PROV_TERR_NAME_NOM]]

44 rows × 3 columns

GEO_NAME pharmacies PROV_TERR_NAME_NOM
String⍰ Int64 String⍰
1 Bathurst 11 New Brunswick
2 Bay Roberts 12 Newfoundland and Labrador
3 Beausejour 12 Manitoba
4 Brandon 20 Manitoba
5 Campbell River 14 British Columbia
6 Charlottetown 22 Prince Edward Island
7 Chilliwack 22 British Columbia
8 Clarenville-Shoal Harbour 14 Newfoundland and Labrador
9 Corner Brook 12 Newfoundland and Labrador
10 Duncan 12 British Columbia
11 Edmundston 25 New Brunswick
12 Fredericton 20 New Brunswick
13 Kamloops 23 British Columbia
14 Kelowna 38 British Columbia
15 Moncton 35 New Brunswick
16 Nanaimo 32 British Columbia
17 Parksville 13 British Columbia
18 Penticton 16 British Columbia
19 Portage la Prairie 12 Manitoba
20 Prince George 21 British Columbia
21 Prince Rupert 12 British Columbia
22 Saint John 22 New Brunswick
23 Selkirk 11 Manitoba
24 St. John’s 63 Newfoundland and Labrador
25 Stony Mountain 42 Manitoba
26 Summerside 13 Prince Edward Island
27 Vancouver 192 British Columbia
28 Vernon 22 British Columbia
29 Victoria 98 British Columbia
30 White Rock 167 British Columbia
31 Winkler 12 Manitoba
32 Winnipeg 156 Manitoba
33 Abbotsford 44 British Columbia
34 Chatham - Douglastown 12 New Brunswick
35 Niverville 13 Manitoba
36 Lions Bay 29 British Columbia
37 Aldergrove 31 British Columbia
38 St. Adolphe 39 Manitoba
39 McEwen 17 New Brunswick
40 New Maryland 11 New Brunswick
41 Quispamsis - Rothesay 11 New Brunswick
42 Ile des Chênes 11 Manitoba
43 Ladner 287 British Columbia
44 Mission 11 British Columbia

Population centres with 0 pharmacies

df[df[:pharmacies].==0,[:GEO_NAME, :pharmacies, :lat, :lng, :PROV_TERR_NAME_NOM]]

20 rows × 5 columns

GEO_NAME pharmacies lat lng PROV_TERR_NAME_NOM
String⍰ Int64 Float64⍰ Float64⍰ String⍰
1 Cumberland 0 49.6242 -125.03 British Columbia
2 Fortune 0 47.0722 -55.8248 Newfoundland and Labrador
3 Killarney 0 49.1806 -99.6663 Manitoba
4 Logan Lake 0 50.4904 -120.812 British Columbia
5 Sicamous 0 50.8233 -118.98 British Columbia
6 Oakbank 0 49.9375 -96.8424 Manitoba
7 Lorette 0 49.7411 -96.8684 Manitoba
8 Crofton 0 48.8712 -123.651 British Columbia
9 Roberts Creek 0 49.4252 -123.644 British Columbia
10 Harrison Hot Springs 0 49.2962 -121.781 British Columbia
11 Mitchell 0 49.5335 -96.7636 Manitoba
12 St. Theresa Point 0 53.8275 -94.8511 Manitoba
13 Grunthal 0 49.4065 -96.8603 Manitoba
14 Blumenort 0 49.6033 -96.7006 Manitoba
15 Landmark 0 49.6714 -96.8177 Manitoba
16 St-Pierre-Jolys 0 49.4401 -96.9867 Manitoba
17 Canoe 0 50.7519 -119.235 British Columbia
18 Naramata 0 49.5929 -119.583 British Columbia
19 Telkwa 0 54.6893 -127.058 British Columbia
20 Cedar 0 49.1069 -123.856 British Columbia

Bresnahan, Timothy F., and Peter C. Reiss. 1991. “Entry and Competition in Concentrated Markets.” Journal of Political Economy 99 (5): pp. 977–1009. http://www.jstor.org/stable/2937655.

Ellickson, P.B. 2007. “Does Sutton Apply to Supermarkets?” The RAND Journal of Economics 38 (1): 43–59. http://onlinelibrary.wiley.com/doi/10.1111/j.1756-2171.2007.tb00043.x/abstract.