The web scraping golden rules
Part I

This article was published on J-Source’s website.

Do you remember when Twitter lost 8 billion dollars in just a few hours? It was because of a web scraper, a tool companies use, as do the geekiest reporters!

Let’s go back in time. Last April, Twitter was supposed to annonce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.

These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on… Twitter! How ironic!

Nowadays, even bots have scoops from time to time!

A web scraper is simply a computer program that reads the html code from webpages and analyze it. With this kind of bot, it’s possible to extract data and information from websites.


Once the tweet published, traders go crazy. It’s a disaster for Twitter. Selerity, the bot’s company which is specialized in real time analysis, becomes the target of many critics. The company explains the situation a few minutes later.
 

For a bot, 45 seconds is an eternity. According to the company, it took only three seconds for its bot to publish the financial results!!

 

Web scraping and journalism

More and more public institutions publish their data on websites. Therefor, web scraping becomes a precious tool for reporters who know how to code.

In particular, I used a web scraper to compare the price of 12 000 products from the SAQ in Quebec with the price of 10 000 products of the LCBO in Ontario, for a story for the Journal Metro.

My colleague from Radio-Canada Florent Daudens also used a web scraper to compare the rent prices on Kijiji.

Another example: when I was in Sudbury, I decided to work on the food inspections in restaurants. All the results are published on the Sudbury Health Unit’s website. However, it’s impossible to download all the results. You can only verify the restaurants one by one.

Sante publique
I asked for the entire database where the results are stored. After a first refusal, I filed a freedom of information request. At the end, the Health Unit asked for a $2000 fee to process my request…

Instead of paying, I decided to code my own bot, that would extract all the results, directly from the website. Here is the result:


 

Coded in Python, my bot takes control of Google Chrome with the Selenium library (thanks toJean-Hugues Roy who talked me about it!). It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.

To do all of that by yourself would take you weeks… But for my bot, it takes only one night!

inspections restaurants

But while my bot was tirelessly extracting thousands of lines of code, one question was tormenting me: what are the ethical rules of web scraping?

Do we have the right to extract any information found on the web? Where is the limit between hacking and scraping? And how can you ensure that the process is transparent for the institutions targeted and the public that will read the story?

As reporters, because of the nature of our work, we have to respect the highest ethical standards. Otherwise, how could the public trust the facts we report to them?

Unfortunately, the code of conduct of the Fédération professionnelle des journalistes du Québec, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.

The Ethic guidelines of the Canadian Association of journalists, although more recent (2011), doesn’t shed any light on the matter either.

As Jean-Hugues Roy says it, journalism professor at UQAM : “There are new territories. There are new tools that push us to rethink what ethic is and ethic has to evolve with them.”

So, I decide to find the answers by myself, by contacting several data reporters in the country!

Stay tuned. The second part of this article will be published soon!

Follow me on Twitter, Facebook or LinkedIn so you won’t miss a thing!

PS: If you are willing to try to web scrape, I published a short tutorial last February.. You will see how to extract data from the Parliament of Canada website!

One thought on “The web scraping golden rules
Part I

  1. 鈼忋儔銈ゃ儎瑁姐儚銉偣銉勩偆銉笺儔銉忋儍銉?5124 銉°兂銈?甯藉瓙 銉堛儵銉冦儔 绱冲+ 銉忋儍銉?銉°兂銈恒儚銉冦儓 甯藉瓙 銉°兂銈?銉忋儍銉?甯藉瓙 甯藉瓙 甯藉瓙 甯藉瓙 甯藉瓙 甯藉

    Does your site have a contact page? I’m having trouble locating it but, I’d like to shoot you an email. I’ve got some creative ideas for your blog you might be interested in hearing. Either way, great site and I look forward to seeing it expand over time.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *