This article was published on J-Source’s website.
Do you remember when Twitter lost 8 billion dollars in just a few hours? It was because of a web scraper, a tool companies use, as do the geekiest reporters!
Let’s go back in time. Last April, Twitter was supposed to annonce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.
These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on… Twitter! How ironic!
Nowadays, even bots have scoops from time to time!
A web scraper is simply a computer program that reads the html code from webpages and analyze it. With this kind of bot, it’s possible to extract data and information from websites.
— Selerity (@Selerity) 28 Avril 2015
Once the tweet published, traders go crazy. It’s a disaster for Twitter. Selerity, the bot’s company which is specialized in real time analysis, becomes the target of many critics. The company explains the situation a few minutes later.
For a bot, 45 seconds is an eternity. According to the company, it took only three seconds for its bot to publish the financial results!!
Web scraping and journalism
More and more public institutions publish their data on websites. Therefor, web scraping becomes a precious tool for reporters who know how to code.
In particular, I used a web scraper to compare the price of 12 000 products from the SAQ in Quebec with the price of 10 000 products of the LCBO in Ontario, for a story for the Journal Metro.
Another example: when I was in Sudbury, I decided to work on the food inspections in restaurants. All the results are published on the Sudbury Health Unit’s website. However, it’s impossible to download all the results. You can only verify the restaurants one by one.
I asked for the entire database where the results are stored. After a first refusal, I filed a freedom of information request. At the end, the Health Unit asked for a $2000 fee to process my request…
Instead of paying, I decided to code my own bot, that would extract all the results, directly from the website. Here is the result:
Coded in Python, my bot takes control of Google Chrome with the Selenium library (thanks toJean-Hugues Roy who talked me about it!). It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.
To do all of that by yourself would take you weeks… But for my bot, it takes only one night!
But while my bot was tirelessly extracting thousands of lines of code, one question was tormenting me: what are the ethical rules of web scraping?
Do we have the right to extract any information found on the web? Where is the limit between hacking and scraping? And how can you ensure that the process is transparent for the institutions targeted and the public that will read the story?
As reporters, because of the nature of our work, we have to respect the highest ethical standards. Otherwise, how could the public trust the facts we report to them?
Unfortunately, the code of conduct of the Fédération professionnelle des journalistes du Québec, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.
The Ethic guidelines of the Canadian Association of journalists, although more recent (2011), doesn’t shed any light on the matter either.
So, I decide to find the answers by myself, by contacting several data reporters in the country!
Stay tuned. The second part of this article will be published soon!
PS: If you are willing to try to web scrape, I published a short tutorial last February.. You will see how to extract data from the Parliament of Canada website!