Federal elections results by poll section part 1 : tidying data
The goal of this project is to study how the voting patterns in the 42nd Canadian General Election of 2015 was influenced by socioeconomic characteristics of voters at the poll level.
The project will be split in two (lengthy) posts.
In the first post, we will clean the election results and the election shapefiles and create a map of the results. Our goal is to create a
sfdata frame that will allow us to recreate this interactive
map by CBC. The polygons and the results data come as open data respectively from the Open Canada and the Elections Canada websites.
Data will be downloaded and binded using the
httrpackages. Data tables will be created using the
DTpackage and I will make sure that all the final databases are correct using the
In the second post, we will download the Census 2016 data at the dissemination area (DA) level. Dissemination areas are the smallest geographical unit for which statistics canada releases census data to the public and they have a population of around 400 and 700 persons. The census data and polygons will be downloaded using the excellent
cancensus package. We will then convert the dissemination areas to polls using the
sfpackage to identify the intersections and distribute voters
from the dissemination areas to the polling division by assuming that the population density is uniforme across the DA.
The relationship between different socioeconomic characteristics and the election results will then be visualised using
Code snippets I will be coming back to this script for, and remaining questions
Here are some of the code snippets I see myself coming back to this notebook for:
- union-ing polygons is done using sf::aggregate. Is there a better way to do it than my 2-steps aggregate detailed below? - Finding the intersection area of polygons is done using sf::st_intersection and may require st_buffer or st_make_valid, the latter requires installing RTools. - When using read_csv with unknown variable type, read everything as character then convert using type_convert() to prevent wrong guesses and NA’s. - Use single quotes when dealing with column names that have apostrophes. - colorFactor: pass the string value (domain = df\(value), but the factor *level* (domain = levels(df\)value)) - color palette for the federal parties
Somes questions surviving this analysis:
- Why is applying count() on sf data.frame so slow? In this project, I convert is using as.data.frame before hand to speed up the process. - union-ing polygons is done using sf::aggregate. Is there a better way to do it than my 2-steps aggregate detailed below? - Are my hypothesis related to the data correct ? What does poll_name == “SVR Group 1/RÉS Groupe 1” mean?
Getting the data and deciphering it
Polling divisions shapefile
The polling divisions shapefile is downloaded from Open Canada.
The zip contains multiple files, including a shapefile (PD_A.shp) and a data dictionary (Data Dictionary.pdf). A cursory look at the dictionnary gives us the following variable definitions.
We will see that together, the FED_NUM, PD_NUM and PD_NBR_SFX variables give us the smallest area for which the election results are available. They represent respectively the Federal Electoral District number, the Polling Division (PD) number and the Polling Division suffix.
The “ADV_POLL” variable is the number of the Advance Poll where people in the given polling division could vote before election day. There is no polygon for the advance poll, but once could reconstruct it by unioning the multiple PD that are included in it. We will not use distribute the votes from the advance poll to the polls that are part of it because this would add more noise.
The PD_TYPE variable denotes the type of the polling division. Value “N” is for Normal, “S” for single building polling division and “M” for “Mobile Poll”. As far as I can tell, this doesnt have an impact.
I create a “pollid” variable by concatenating FED_NUM and EMRP_NAME, where EMRP_NAME is already available in the shapefile and is the result of concatenating PD_NUM and PD_NBR_SFX. Since I will be merging the shapefile with the election results, I want to have a single row for each pollid values. Note that while PD_NBR_SFX is never missing, it is not added to EMRP_NAME if its value is 0. For example, if PD_NUM is 10 and PD_NBR_SFX is 0, then EMRP_NAME is “10”. However, if PD_NBR_SFX was 1, then EMRP_NAME would have been “10-1”.
We already have bad news. The “pollid” variable is not unique:
## options: ENCODING=Windows-1252 ## Reading layer `PD_A' from data source `/home/simon/git/snippets/content/post/data/downloads/PD_A.shp' using driver `ESRI Shapefile' ## Simple feature collection with 69534 features and 18 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 3658201 ymin: 658873 xmax: 9019157 ymax: 6083005 ## CRS: 3347
What is going on? Let’s look at the case with the most rows.
They apparently decided to create a row for each of the polygons of this poll.
We can “union” all the polygons in the shapefile so that we only have 1 row per pollid. This can be done using the sf::aggregate function.
I had to do the aggregation process in two steps because I couldnt find a way to do it in one step without generating huge files that would break my RAM.
Instead, I created an aggregated sf data.frame after dropping all variables.
Only then would the variables for the polling divisions be added. I had to drop the variables related to the address and name of the mobile poll buildings because each building in the mobile poll had its own polygon with its own address and building name.
Below is test made using
testthat that ensures that poll_shp, our final sf data.frame, only has 1 row per pollid.
Election results at the poll level
The data source
Getting this data to be compatible with the shapefile (one row per pollid) turned out to be much more of a pain than I first thought.
The data we use is downloaded from Election Canada website rather than the Open Canada website. There are two formats, with “format 2” being tidier.
There is one csv for each riding, and then there is 1 row per poll station*candidate as per this data dictionnary:
|Electoral District Number||The electoral district number.|
|Electoral District Name_English||The English name of the electoral district.|
|Electoral District Name_French||The French name of the electoral district.|
|Polling Station Number||The number assigned to the polling station,|
|for example: 3, 45A, 48-3, 601.|
|Polling Station Name||A name that generally represents the locality of the polling division boundary.|
|Void Poll Indicator||Indicates that a poll exists but has no electors.|
|No Poll Held Indicator||Indicates that the returning officer intended to hold this poll, but unforeseen circumstances prevented it.|
|Merge With||Indicates the number of the polling station with which the results of this poll were merged.|
|Rejected Ballots for Polling Station||The number of rejected ballots at this polling station.|
|Electors for Polling Station||The number of electors on the list of electors for this polling station.|
|Candidate’s Family Name||The family name of the candidate.|
|Candidate’s Middle Name||The middle name of the candidate.|
|Candidate’s First Name||The first name of the candidate.|
|Political Affiliation Name_English||The short-form English name of the candidate’s political affiliation.|
|Political Affiliation Name_French||The short-form French name of the candidate’s political affiliation.|
|Incumbent Indicator||“Y” if candidate was the incumbent, “N” otherwise.|
|Elected Candidate Indicator||“Y” if candidate was elected, “N” otherwise.|
|Candidate Poll Votes Count||The number of valid votes the candidate obtained at this polling station.|
If you pay attention more attention than I did at first, you will notice that the “Polling Station Number” is not the same thing than the PD_NUM in the shapefile. “48-3” is actually the EMRP_NAME. Also, what is the letter doing in “45A”? This appears to be because polls were sometimes broken down in “subpools”.
The “Merge With” column is another indicator of incoming trouble. When this column is not missing the votes and rejected ballots for the pool (or subpool) are 0. My best bet is that pools (or subpools) were sometimes merged with other pools before the ballots were counted.
a painful alternative source (keep away!)
The first dataset that I found is the one on the Open Canada website. To say that it isnt tidy data is an euphemism.
Like “format 2”, there are over 300 CSV files, one per riding. However, we have 1 single row per polling station, instead of one per polling station times the number of candidates. Instead, there is one column for each candidate, such as “Lorraine E. Barnett” in riding 10001 in Newfoundland.
If you want to map the results by party, like I do, you would have to fetch the information in a second CSV file that records the candidate information. This CSV file, named table_tableau12.csv has a column “Candidate”. However, the content of this column is NOT the name of the candidate as shown in the first csv, but rather the concatenation of candidate and his affiliation. The record for Lorraine e. Barnett reads as follow : “Lorraine E. Barnett Conservative/Conservateur”. Sometimes, there would be double asterisks included between the name of the candidate and the party, such as for “Judy M. Foote ** Liberal/Libéral” in riding 10002. I initally used regular expressions to remove all the 25-ish possible affiliations, then strip the trailing space and the eventual "**" to be able to separate the candidate name from the party and merge it back to the results-by-poll table.
Then, I realised that Format 2 existed outside the Open Data portal. I was not impressed.
Downloading election results at the poll level
Since there are 328 district and one CSV file for each of them, I used the
function to generate a list of each district and download them.
I first tried to download them using download.file() but I got an http error. I believe this is because the government doesnt let robots download from its website. The solution was to use the
httr::GET function, which can fake being a web browser to download the files.
Reading and EDA of poll results
I read all the csv files and concatenated them using
purrr:map_df. I ran into
a few issues:
- a character variable (“Polling Station Number”, or “poll_merged_with”) is wrongly guessed as integer. This means that some polling station numbers, such as 142A and 142B in district 24015 would be incorrectly read as NA. This was solved by reading all variables as characters then converting them using type_convert()
- Depending on which of the 328 district is being read, the columns poll_num and poll_merged_with could be integer or character. I mutate both of them to characters before binding the rows using map_df().
- Column names with apostrophes, such as
Candidate’s Middle Name/Second prénom du candidat, need to be surrounded by single quotes when wrangling them using
dplyrto prevent errors.
- Letters are sometimes appended as suffix to the “polling station number” or the “poll merged with”. I couldnt find an explanation in the documentation, but I believe these are “subpools”
- Some polls are named “SVR Group 1/RÉS Groupe 1” and “SVR Group 2 /RÉS Groupe 2”, I am not sure how to deal with them, so I ended up filtering them out.
Trailing letters in the poll name are a common occurrence: 29091 of the 396412 observations have a trailing letter in the poll name.
- Each “subpoll” has a different number of electors.
- Some of them may subpolls may or may not have been merged to another poll. (ex1 to 3)
- Some polls have had their subpolls merged to multiple different polls. (ex4)
- A poll doesnt have to be broken into subpools to be merged to another poll. (ex5)
Another important note mentionned before: when a poll is merged to another poll, the number of votes for each candidates and the number of rejected ballots is set to 0. This leads me to believe that they simply took the box containing the ballots and mixed it with the box from another poll before counting it.
Cleaning the poll results
Our desired output is a data frame containing one row per poll id, defined as the concatenation of FED_NUM and EMRP_NAME.
For each poll, we want to know the percentage of the total vote that was allocated to each of the main partys: Libéral, Convervateur, NPD, Bloc Québécois and Green Party.
I used a 2 step process:
1 - If the poll was not entirely merged to some other poll, then I use the data from the remaining subpoll(s) to know the vote distribution of that poll.
2 - If the poll was entirely merged to other poll(s), then I fetch the results of the poll where the most electors were sent.
How good is my link between poll_results and poll_shp?
Amazingly, we have results for ALL polls defined in the shapefile!
But what are these polls for which we dont have a shape? Mostly, this is the advance poll (numbers 600-630) for which we didnt expect to have a shapefile. There are a few exceptions. For example, why is there no PD_NUM==4 in FED_NUM 35058 in the shape file?
The really important thing for our purpose is to have results covering the whole map of Canada, which is the case.
## # A tibble: 3 x 15 ## FED_NUM dist_name EMRP_NAME poll_ballot_rej… poll_votes winner pct_winner ## <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> ## 1 35058 Mississa… 4 0 0 conse… NaN ## 2 35094 Scarboro… 4 0 0 autre NaN ## 3 48020 Edmonton… 4 0 0 autre NaN ## # … with 8 more variables: autre <dbl>, bloc <dbl>, conservateur <dbl>, ## # libéral <dbl>, npd <dbl>, vert <dbl>, poll_merged_with <chr>, pollid <chr> ## # A tibble: 123 x 2 ## EMRP_NAME n ## <chr> <int> ## 1 600 338 ## 2 601 338 ## 3 602 338 ## 4 603 338 ## 5 604 338 ## 6 605 337 ## 7 606 337 ## 8 607 336 ## 9 608 333 ## 10 609 326 ## # … with 113 more rows
Our link is pretty good. The final output of this notebook is a data frame named poll_final, which is basically the the initial shapefile to which we added the voting results. I exported it as a shapefile using sf::st_write and put it here on google drive.
Our results are the same than CBC, with one improvement: where they simply allowed the merged pools to be “not applicable”, we fetched the results from the poll it was merged to. I haven’t checked because I dropped the elector variable, but I suppose that CBC didnt some up the number of electors from the “mergee” and the “merged” in its shapefile. This may lead to an exagerated participation rate, where the votes of 2+ polling divisions are divided by the number of electors from a single polling division.
Below are some leaflets generated using this final data set
## Deleting source `/home/simon/git/snippets/content/post/data/interim/poll_final.shp' using driver `ESRI Shapefile' ## Writing layer `poll_final' to data source `/home/simon/git/snippets/content/post/data/interim/poll_final.shp' using driver `ESRI Shapefile' ## Writing 66335 features with 20 fields and geometry type Unknown (any).
Here is a leaflet of the poll results that were merged and for which the results were
pull from the “mergee”.