This article was published on J-Source’s website.
Web scraping is a way to extract information presented on websites. As I explained it in the first part of this article, web scraping is used by many companies.
It’s also a great tool for reporters that know how to code since more and more public institutions publish their data on their websites.
With web scrapers, which are also called “bots”, it’s possible to gather data for journalistic stories. For example, I created one to compare the alcohol prices between Quebec and Ontario.
My colleague Florent Daudens, who works for Radio-Canada, also used a web scraper to compare the rent prices in several neighbourhoods in Montreal with ads from Kijiji.
But what are the ethical rules that reporters have to follow while web scraping?
These rules are particularly important since, for non-geek people, web scraping looks like hacking.
Unfortunately, nor the Code of Ethics of the Fédération professionnelle des journalistes (FPJQ), nor the Ethical guidelines of the Canadian Association of journalists give a clear answer to this question.
So I looked for the answers myself, by asking the question to several data reporter colleagues!
#I Public data or not?
First consensus from data reporters: if an institution publishes data on its website, this data becomes automatically public.
Cédric Sam works for the South China Morning Post, in Hong Kong. He also worked for La Presse and Radio-Canada. « I do web scraping almost every day [ED: on the Chinese government websites]», he says.
For him, bots have as many rights as their humans creators. “Whether it’s a human who copies and pastes the data or a human who codes a computer program to do it, it’s the same. It’s like hiring 1000 people that would work for you. It’s the same result.”
However, government’s servers also host personal information about citizens. “Most of this data is hidden because it would otherwise violate privacy laws,” says William Wolfe-Wylie, developer for CBC and journalism teacher for the Centennial College and the Munk School.
Here is the very important limit between web scraping and hacking: the respect of the law.
Reporters should not pry into protected data. If a regular user can’t access it, journalists shouldn’t try to get it. “It’s very important that reporters acknowledge these legal barriers, which are legitimate ones, and respect them,” says William Wolfe-Wylie.
Another important detail to verify: the robots.txt file, which can be found at the root of the website and which states what is allowed to be scraped or not. For example, here is the file for the Royal Bank of Canada: http://www.rbcbanqueroyale.com/robots.txt
#II To identify yourself or not?
When you are a reporter and you want to ask someone questions, the first thing to do is to present yourself and the story you are working on.
But what should you do when it’s a bot that is sending queries to a server or a database? Should the same rule apply?
For Glen McGregor, national affairs reporter for the Ottawa Citizen, yes you should. “In the http headers [ED: parameters that identify users on a website], I put my name, my phone number and a note saying: “I am a reporter extracting data from this webpage. If you have any problem or concern, call me.’ So, if the web administrator suddenly sees a huge amount of hits on his website, freaks out and thinks he’s under attack, he can check who’s doing it. He will see my note and my phone number. I think it’s an important ethical thing to do.”
This clip is in French.
“Sometimes, I use proxys,” he says. “I change my IP address and I change my headers too, to make it look like a real human instead of a bot. I try to respect the rules, but I also try to be undetectable.”
This clip is in french.
To not identify yourself when you are extracting data from a website could be compared, in some ways, to doing interviews with a hidden mic or camera. The Code of Ethics from the FPJQ states some rules regarding this.
4 a) Undercover procedures
In certain cases, journalists are justified in obtaining the information they seek through undercover means: false identities, hidden microphones and cameras, imprecise information about the objectives of their news reports, spying, infiltrating…
These methods must always be the exception to the rule. Journalists use them when:
* the information sought is of definite public interest; for example, in cases where socially reprehensible actions must be exposed;
* the information cannot be obtained or verified by other means, or other means have already been used unsuccessfully;
* the public gain is greater than any inconvenience to individuals.
The public must be informed of the methods used.
Therefore, according to this article, it looks like best practise would be to identify yourself in your code, even if it’s a bot that does all the work.
However, if there’s a possibility that the targeted institution would change the availability of the data because a reporter tries to gather it, you should make yourself more discreet.
And for those who are afraid to be blocked if you identify as a reporter, don’t worry. To change your IP address is quite easy!
For some reporters, best practise is also to ask for the data before scraping it. For them, it’s only after a refusal that web scraping should be an option.
This interesting point has an advantage: if the institution answers quickly and gives you the raw data, it will save you time!
#III To publish your code or not?
Transparency is another very important aspect of journalism. Without it, the public wouldn’t trust the reporters’ work.
The vast majority of data reporters publish the data they used for their stories. This act of transparency shows that their reports are based on real facts that the public can check if it wants to. But what about their code?
An error in a web scraper script can completely skew the data that will be obtained. So should the code be public as well?
As a comparison, for open source softwares, to reveal the code is a must. The main reason is to allow others to improve the software, but also to give confidence to the users who can check what the software is doing in detail.
However, for coder-reporters, to reveal or not to reveal is a difficult choice.
“In some ways, we are businesses,” says Cédric Sam. I think that if you have a competitive edge and if you can continue to find stories with it, you should keep it to yourself. You can’t reveal everything all the time.”
For Roberto Rocha, the code shouldn’t be published.
“I really think that the tide lifts all boats,” says Philippe Gohier. “The more we share scripts and technology, the more it will help everybody. I’m not doing anything that someone can’t do with some effort. I am not reshaping the world.”
Jean-Hugues Roy agrees and adds that journalists should allow others to replicate their work, like scientists do by publishing their methodology.
This clip is in french.
Nonetheless, the professor specifies that there’re exceptions. Jean-Hugues Roy is currently working on a bot that would extract data from SEDAR, where documents from the Canadian publicly traded companies are published.
“I usually publish my code, but this one, I don’t know. It’s complicated and I put a lot of time into it.”
On an another hand, Glen McGregor doesn’t publish his scripts, but sends them if someone asks for them.
So, should we publish our scripts or not?
When a reporter has a source, he will do everything in his power to protect it. The reporter will do so to earn the confidence of his source, who will hopefully give him more sensitive information. But the reporter also does this to keep his source to himself!
So, at the end, isn’t a web scraper a bot version of a source?
Will reporters’ bots be patented sometimes soon?
Who knows? Perhaps one day a reporter will refuse to reveal his code the same way Daniel Leblanc refused to reveal the identity of his source called “Ma Chouette”.
After all, day after day, bots are looking more and more like humans!
PS: To respect the web infrastructure is of course another golden rule of web scraping. Since its more a technical detail than an ethical dilemma, I didn’t put it in this article. However, always leave several seconds between your requests! Don’t overload servers!