Author:
Paul Schrimpf
Date: 2019-03-04
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebooks.
\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]
This assignment will estimate a model of pharmacy entry inspired by Bresnahan and Reiss (1991).
https://vse.syzygy.ca provides a convenient browser based interface to Julia. Open it and log in. This assignment is in a git repository at https://github.com/UBCECON567/PharmacyEntry. Start by cloning the git repository to your syzygy directory. Open a terminal in syzygy (File -> New -> Terminal). This will open a Linux shell in your browser. To clone the git repository, enter
git clone https://github.com/UBCECON567/PharmacyEntry
This will create a directory called PharmacyEntry
containing all the files related to this assignment.
Clicking on the folder icon near the top left of the screen opens a file browser panel. Use it to open the PharmacyEntry/notebooks
folder. You can complete this assignment by modifying the pharmacyentry.ipynb
notebook. I recommend creating a copy of this notebook, and then working on the copy. You can create a copy by right clicking in the file browser panel. Now open your copy of the notebook.
There might be some problem with previously installed julia packages and the upgrade to vse.syzygy.ca . If you can no longer run Julia notebooks, and are seeing “No Kernel!” in the top right of the screen, you can try removing your previously installed julia packages. Open a terminal and type “rm -rf ~/.julia”. Afterward, you will have to re-add any packages you need.
As in the previous assignment, we begin by loading required packages.
using Pkg Pkg.activate("..") #Pkg.resolve() #Pkg.instantiate() # # If you are missing packages, you may have to uncomment one of the # two above lines. However, they seem to cause mysterious problems with # HTTP.get() (especially with https sites) later. If you find # HTTP.get() is hanging, then (1) run this cell with Pkg.instantiate() # once, (2) comment out Pkg.instantiate() (3) re-start the kernel and # re-run everything. # using Revise if (!("../src" ∈ LOAD_PATH)) push!(LOAD_PATH, "../src") end using PharmacyEntry
Part of the appeal of entry models is that they can be estimated using readily available data. Bresnahan and Reiss (1991) used data on market demographic information and the number of firms within a category in each market. We can easily gather similar information. We will focus on pharmacies. We chose pharmacies because it was one of the industries included in Bresnahan and Reiss (1991), it fits with the theme of focusing on the industrial organization of health related industries, and a list of pharamacies in Canada is available. Pharmacies in Canada are provincially regulated. A list with the website of each provincial regulator can be found here.
Each provincial regulator provides a list of pharmacies with addresses. These can be downloaded and saved as DataFrames. Doing so is slightly tedious though because each provincial regulator’s website is different, so different code will be needed to parse each.
Let’s look at community pharmacies in BC, http://www.bcpharmacists.org/list-community-pharmacies. This list is already nicely formatted as a table. We could simply copy and paste its contents into a spreadsheet, save as cvs, and then load into Julia. That would be an okay method, but copying and pasting is slightly more error prone (or at least prone to undocumented and unreproducible errors) than writing code to parse the website. More importantly, other provinces have websites that are not so nicely formatted and will require code to parse. We will HTTP.jl to request webpages, Cascadia.jl to select by CSS paths, and regular expressions to find patterns in text. Although our code will be in Julia, CSS selectors and regular expressions are common and useful tools for scraping webpages with any programming language.
To scrape websites well, you need some understanding of html, CSS, and sometimes javascript. https://www.w3schools.com/ is a pretty comprehensive reference. Here is a very short explanation. CSS stands for “cascading style sheets.” CSS is used to control the formatting of webpages. html documents consist of a bunch of tagged elements (headlines, links, tables, paragraphs, divs, etc) with optional class specifications. CSS contain information about how to display each combination of tag and class. The combination(s) of tags and classes to which a given display style (font, color, etc) applies is called a CSS selector. Each part of a webpage that looks different will have a different CSS selector. If we want to scrape some particular information off a webpage, CSS selectors are usually a good way to pick out the pieces we need.
Returning to http://www.bcpharmacists.org/list-community-pharmacies, we can see that the pharmacy names and addresses are in table rows with alternating background shades of gray. There will be CSS selectors to pick out these rows. Most web browsers include tools to create css selectors. In either Firefox or Chrome (Safari and Edge likely have similar behavior, but I have not checked), if you right click anywhere on a website, there is an “Inspect Element” option (you possibly need to enable developer tools in the web browser options first). This will open the developer toolbar. In the toolbar there will be a collapsible tree listing the nested html tags on the website. As you hover over different tags, the corresponding part of the website will be highlighted. If you right click a tag, there will be an option to copy the corresponding CSS selector. Doing this on http://www.bcpharmacists.org/list-community-pharmacies, you will see that tr.odd
and tr.even
will select all rows from the table that we want. Here’s some Julia code to select these rows.
using HTTP, Gumbo, Cascadia # download website r = HTTP.get("http://www.bcpharmacists.org/list-community-pharmacies"); # parse website into tree using Gumbo.jl h = Gumbo.parsehtml(String(r.body)); # select elements by CSS with Cascadia rows = eachmatch(Cascadia.Selector("tr.odd, tr.even"), h.root); @show length(rows) @show typeof(rows) @show typeof(rows[1]) display(rows[1]) display(rows[1].children)
As you saw from the developer tools in your browser, html documents can be organized into a tree structure of nested tags. Gumbo.jl is a Julia package that stores html pages in exactly this sort of tree structure. From the output of the code, we can see that it successfully selected all 1000+ rows of the table of pharmacies. Each row is a HTMLElement as described in the documentation for Gumbo.jl. Each row has children corresponding to the 5 columns of the table. Now, we can extract the text from the 5 columns of each row and store it in a DataFrame.
using DataFrames function parserow(row) fields = nodeText.(row.children) fields = reshape(fields, (1, length(fields))) end txt = vcat(parserow.(rows)...) bc = DataFrame(txt, [:name, :address, :manager, :phone, :fax]) describe(bc)
Later we are going to assign these pharmacies into isolated local markets and match them with census data. We will do so based on addresses. For this, it will be useful to parse the addresses into a consistent format. We will divide the address field into street, city, zip, and province. The dominant tool for matching, splitting, and substiting text based on patterns are regular expressions. Regular expressions are nearly as old as computers themselves, and they remain an important and powerfull tool for text processing.
This interactive tutorial is a good way to learn the basics of regular expressions. Work through it. There is nothing to turn in for this part, but there is a small task below.
Here is some Julia code using regular expressions to split addresses into their parts. See Julia’s documentation for more information.
bc[:street] = (a->replace(a, r"(.+)\n.+, BC.+\n.+"s => s"\1")).(bc[:address]) bc[:city] = (a->replace(a, r".+\n(.+), BC.+\n.+"s => s"\1")).(bc[:address]) bc[:zip] = (a->replace(a,r".+(\p{L}\d\p{L}).?(\d\p{L}\d).*"s => s"\1 \2")).(bc[:address]) bc[:zipalt] = (a->replace(a, r".+(\p{L}\d\p{L} \d\p{L}\d).*"s => s"\1")).(bc[:address]) bc[:province] = "BC" describe(bc)
Compare :zip and :zipalt in the code above? What is the difference between the regular expressions used? Give an example of a pattern that one matches, but not the other. What is the problem with :zipalt?
The function loadpharmacydata
in PharmacyEntry/src/pharmacies.jl
downloads and parses the lists of pharmacies in BC and Manitoba. Add code to download data from at least one additional province. Note that the function loadBCdata
and loadMBdata
are not exported from the PharmacyEntry.jl
module. To call non-exported functions from outside the module you must preface with the module name, e.g. bc = PharmacyEntry.loadBCdata(true)
. Either include your modified version of pharmacies.jl with what you turn in, or include the new function definitions in this notebook.
We will download information on area demographic and economic conditions from Statistics Canada. An important choice in entry models is deciding what geographic area should define a market. See e.g. Bresnahan and Reiss (1991) or Ellickson (2007). Statistics Canada provides data at various geographic levels. I somewhat arbitrarily chose to download data at the population centre level. Criticise or defend this choice. Optionally, modify census.jl
to download census data at a different geographic level (the changes might be somewhat extensive).
The census data contains a lot of information. For example, all the information for the Vancouver population centre is here. The documentation for the variables from StatCan is included in the assignment git repo as, https://github.com/ECON567/PharmacyEntry/data/98-401-X2016048_English_meta.txt. The const vars
defined at the start of PharmacyEntry/src/census.jl
is a list of variables from the census data to extract for future use. Briefly review these and add or delete variables as you see fit. We want variables that are related to either the revenues or costs of pharmacies. It is probably better to err on the side of including too much instead of too little because you can always choose to ignore some variables later.
We must now link pharmacies and population centres together based on location. Statistics Canada provides shapefiles giving the boundaries of population centres. The loadcensusdata
function in census.jl
records the latitude and longitude of the geographic center of each population centre. For pharmacies we have have their addresses. There are number of online service that geocode addresses into latitude and longitude. I looked at a few of them and settled on using https://www.geocod.io/ for this project. The main reasons I chose this service are that its terms of service allow using and storing data generated from it in any way, and a free account allows enough lookups to complete this project. The function geocode!
in geo.jl
uses geocod.io to geocode the pharmacy addresses. If you want to run it (e.g. because you scraped pharmacies from additional provinces), you will need to get a free API key by signing up with https://www.geocod.io/. Save the key in a text file named geocodio.key
and place it in PharmacyEntry/src/
. Let me know if you have any problems or the free key is insufficient.
Geocoding is imperfect due to typos and errors in the address data, and possibly missing information in geocod.io’s map data. The function checklatlng!
partially checks for errors by comparing the latitude and longitude from geocod.io with the boundaries of forward sortation areas (first 3 characters of zip codes). The plotmap
function plots population centres and pharmacies. Pharmacies for which the latitude and longitude is not in the boundaries of the forward sortation area of its zip code are red. The others are green. For red pharmacies, the line goes from the pharmacy to the latitude and longitude at the center of its FSA. You can clearly see some pharmacies are misplaced, but other red ones might be okay.
using Statistics pharm = loadpharmacydata()
reading pharmacy data from /home/paul/565/assignments/PharmacyEntry/data/ph armacies.csv
census= loadcensusdata() checklatlng!(pharm, :lat, :lng, :zip) @show mean(pharm[:zipmatch])
mean(pharm[:zipmatch]) = 0.6708369283865401 0.6708369283865401
plotmap(census, pharm) # Configuration problem with jupyter on vse.syzygy.ca makes PlotlyJS # not display. Problem with Weave.jl also make PlotlyJS not display in # html output ... :( # # Bonus problem: find a way to plot a map that works on vse.syzygy.ca
Decide on some rules for assigning pharmacies to markets (population centres). Implement your choice and create a variable giving the number of pharmacies in each population centre. The function distance_m
in geo.jl
might be useful. Create tables and/or graphs summarizing the distribution of number of pharmacies similar to Table 2 and Figure 3 from Bresnahan and Reiss (1991).
Create additional summary statistics and figures.
This will be continued in a second notebook. Save your data frame to load it there.
import CSV ## For illustration purposes, here's a quick and dirty assignment of ## pharmacies to markets. I would not advise using this exactly. subset = (x->x ∈ ["British Columbia", "Manitoba", "Manitoba/Saskatchewan", "Newfoundland and Labrador", "New Brunswick", "New Brunswick/Quebec", "Prince Edward Island"]).(census[:PROV_TERR_NAME_NOM] ) df = census[subset,:] function closestPC(lat, lng; data=df) d = distance_m.(lng, lat, data[:lng], data[:lat]) (dm, i) = findmin(d) data[:GEO_CODE][i] end lat = pharm[:lat] lat[ismissing.(lat)] .= pharm[:ziplat][ismissing.(lat)] lng = pharm[:lng] lng[ismissing.(lng)] .= pharm[:ziplng][ismissing.(lng)] pharm[:pc_geo_code] = closestPC.(lat,lng) df[:pharmacies] = (g->(sum(pharm[:pc_geo_code].==g))).(df[:GEO_CODE]) CSV.write("cleandata.csv",df);
df[df[:pharmacies].>10,[:GEO_NAME, :pharmacies, :PROV_TERR_NAME_NOM]]
GEO_NAME | pharmacies | PROV_TERR_NAME_NOM | |
---|---|---|---|
String⍰ | Int64 | String⍰ | |
1 | Bathurst | 11 | New Brunswick |
2 | Bay Roberts | 12 | Newfoundland and Labrador |
3 | Beausejour | 12 | Manitoba |
4 | Brandon | 20 | Manitoba |
5 | Campbell River | 14 | British Columbia |
6 | Charlottetown | 22 | Prince Edward Island |
7 | Chilliwack | 22 | British Columbia |
8 | Clarenville-Shoal Harbour | 14 | Newfoundland and Labrador |
9 | Corner Brook | 12 | Newfoundland and Labrador |
10 | Duncan | 12 | British Columbia |
11 | Edmundston | 25 | New Brunswick |
12 | Fredericton | 20 | New Brunswick |
13 | Kamloops | 23 | British Columbia |
14 | Kelowna | 38 | British Columbia |
15 | Moncton | 35 | New Brunswick |
16 | Nanaimo | 32 | British Columbia |
17 | Parksville | 13 | British Columbia |
18 | Penticton | 16 | British Columbia |
19 | Portage la Prairie | 12 | Manitoba |
20 | Prince George | 21 | British Columbia |
21 | Prince Rupert | 12 | British Columbia |
22 | Saint John | 22 | New Brunswick |
23 | Selkirk | 11 | Manitoba |
24 | St. John’s | 63 | Newfoundland and Labrador |
25 | Stony Mountain | 42 | Manitoba |
26 | Summerside | 13 | Prince Edward Island |
27 | Vancouver | 192 | British Columbia |
28 | Vernon | 22 | British Columbia |
29 | Victoria | 98 | British Columbia |
30 | White Rock | 167 | British Columbia |
31 | Winkler | 12 | Manitoba |
32 | Winnipeg | 156 | Manitoba |
33 | Abbotsford | 44 | British Columbia |
34 | Chatham - Douglastown | 12 | New Brunswick |
35 | Niverville | 13 | Manitoba |
36 | Lions Bay | 29 | British Columbia |
37 | Aldergrove | 31 | British Columbia |
38 | St. Adolphe | 39 | Manitoba |
39 | McEwen | 17 | New Brunswick |
40 | New Maryland | 11 | New Brunswick |
41 | Quispamsis - Rothesay | 11 | New Brunswick |
42 | Ile des Chênes | 11 | Manitoba |
43 | Ladner | 287 | British Columbia |
44 | Mission | 11 | British Columbia |
df[df[:pharmacies].==0,[:GEO_NAME, :pharmacies, :lat, :lng, :PROV_TERR_NAME_NOM]]
GEO_NAME | pharmacies | lat | lng | PROV_TERR_NAME_NOM | |
---|---|---|---|---|---|
String⍰ | Int64 | Float64⍰ | Float64⍰ | String⍰ | |
1 | Cumberland | 0 | 49.6242 | -125.03 | British Columbia |
2 | Fortune | 0 | 47.0722 | -55.8248 | Newfoundland and Labrador |
3 | Killarney | 0 | 49.1806 | -99.6663 | Manitoba |
4 | Logan Lake | 0 | 50.4904 | -120.812 | British Columbia |
5 | Sicamous | 0 | 50.8233 | -118.98 | British Columbia |
6 | Oakbank | 0 | 49.9375 | -96.8424 | Manitoba |
7 | Lorette | 0 | 49.7411 | -96.8684 | Manitoba |
8 | Crofton | 0 | 48.8712 | -123.651 | British Columbia |
9 | Roberts Creek | 0 | 49.4252 | -123.644 | British Columbia |
10 | Harrison Hot Springs | 0 | 49.2962 | -121.781 | British Columbia |
11 | Mitchell | 0 | 49.5335 | -96.7636 | Manitoba |
12 | St. Theresa Point | 0 | 53.8275 | -94.8511 | Manitoba |
13 | Grunthal | 0 | 49.4065 | -96.8603 | Manitoba |
14 | Blumenort | 0 | 49.6033 | -96.7006 | Manitoba |
15 | Landmark | 0 | 49.6714 | -96.8177 | Manitoba |
16 | St-Pierre-Jolys | 0 | 49.4401 | -96.9867 | Manitoba |
17 | Canoe | 0 | 50.7519 | -119.235 | British Columbia |
18 | Naramata | 0 | 49.5929 | -119.583 | British Columbia |
19 | Telkwa | 0 | 54.6893 | -127.058 | British Columbia |
20 | Cedar | 0 | 49.1069 | -123.856 | British Columbia |
Bresnahan, Timothy F., and Peter C. Reiss. 1991. “Entry and Competition in Concentrated Markets.” Journal of Political Economy 99 (5): pp. 977–1009. http://www.jstor.org/stable/2937655.
Ellickson, P.B. 2007. “Does Sutton Apply to Supermarkets?” The RAND Journal of Economics 38 (1): 43–59. http://onlinelibrary.wiley.com/doi/10.1111/j.1756-2171.2007.tb00043.x/abstract.