Members of Parliament
and web scraping

I was looking for small personal project on the Members of the House of Commons. Nothing serious. Just something to code and have fun with.

The result: an interactive timeline of the occupation of the Members of Parliament throughout History!

Surprising how involved farmers were in politics prior the 70’s, isn’t it? You may also notice how being a politician has become a full time occupation around 2000. Click on the chart below for all the details!

How did I create this interesting timeline? I coded my own program to extract data from the Parliament of Canada website!

#1 Find the information

While looking for information on the Members of Parliament, I found this interesting webpage. When you choose a general election (from the first one in 1867 to the last one in 2011), several details are presented on the candidates.


For example, here is the information for Calgary-Centre and Calgary-Centre-North, for the 2011 general election. As you can see, the occupation of each candidate is indicated.

Of course, we could ask the website administrators to send us the database. But it is much more exciting to extract the data ourselves! And no need to wait on anyone!

First step: To look at the source code of the webpage.

Code source history

Have you noticed the tag <table> on line 123? And the <tr> and <td> following tags?

These HTML tags create a HTML table on the webpage. It’s perfect for us! The structure is redundant. A program could easily go through it and extract whatever we want.

Now, let’s analyse the link of the webpage:

It’s quite simple. It’s a request for the server hosting the database, where all the information is stored.

Do you see the 41 ? It’s this number which indicates to the server that you wish to obtain the data for the 41st general election, in 2011.

What happens when you replace 41 by 40? The webpage that opens will give you the data for the 2008 election, which was the 40th general election!

With a link that simple, it will a piece of cake for the program we’re going to code. It will be able to open each webpage for each election without trouble!

#II Gather the data

Since there were 41 elections between 1867 and 2011, the data we want is on 41 different webpages.

Therefore, we will start by writing a short script that will save all the webpages on our hard drive.

I opened my text editor and coded a simple script in Python. I used the urllib2 library to open the webpages and the time library to impose a short delay.

First, we load the libraries. Then we create a loop that will run 41 times. We also created a variable n which will go from 1 to 41. Therefore, we could insert n into the link of the webpage to access all of them!

Then, we choose where we want to save the webpage. Again, we can use the variable n, this time for the name of the file on our hard drive.

And now, we open the webpage, we create the HTML file on our computer, we copy the webpage data into the HTML file, and we close the file!

At the end of the loop, I added a print statement. It will show you the program in action in the terminal window. I also added a one second delay to avoid sending too many requests at the same time to the Parliament server!

Once at the end, the program goes back to the start of the loop, downloads the next page, and so on and so forth, 41 times to get the 41 general elections. Here is the script in action:


Voila! In two minutes, we managed to save 41 webpages!

#III Sort the data

We now have all the data on our computer. However, it’s all over the place and impossible to use for the moment.

We have to write a second script in Python, that will go through the code of the 41 HTML files. This short program will extract the data we are interested in and copy it into a clean text file, like a csv file!

To do so, I used the BeautifulSoup library to work with the HTML code and the Unicodecsv library to make the final text file.

To start, we load the libraries and we create the text file, which will be delimited with tabs. We can also write the first row, for the column headers.

Then, we create a loop that will go through the 41 HTML files. Each of them have to be passed into BeautifulSoup before we can work with them. (The variable counter will be useful later, to follow what the script is doing.)

Here we go! We can extract the data now. First, we isolate everything between <tr> tags. This tag codes the lines of the HTML table. In the source code of each page, we can see that the second line is always an electoral district followed by the election date. Since the date always has the same format, we can extract the year very easily, by counting the characters from the right.

Have you noticed the small symbol “✓” on the line of the winning candidates? The source code indicates that it’s an image coded like this: <img src=’images/check.gif’>. It’s perfect for us! As shown below, it allows us to work just with the elected candidates, which are the Members of Parliament! We can now isolate each cell in the table, for each MP!

Everything is ready to extract the data! On each line, the candidate’s name is in the first cell, at index “0”. Then, it’s the political party, the occupation, the number of votes and the percentage of votes. We store each information into a variable.

Sometimes, Members of Parliament are elected by acclamation. When it’s the case, the check is replaced by accl. Therefore, we need to add another condition to our script to be sure we will have everyone!


And now, for the grand finale! We print all of the variables on the screen to be sure the right data is stored and we write the rows one by one into the final text file! I added a print statement with the variable counter. It will show us the script working on the terminal window.

In summary, our script will open each webpage, go through each line into the HTML tables, extract the data into the cells and then copy the information into a text file. Here’s what happens when we run the script:

Tada! In less than four minutes, we extracted the data of 41 HTML files and copied it in more than 10,000 rows into a text file! It’s awesome, isn’t it?

Here is the result when we open the text file with a spreadsheet application. A nice and clean file! Now, we just have to open it in Tableau to create the timeline, and it’s done!

Fichier final

Un commentaire sur “Members of Parliament
and web scraping

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *