Public web data is being used for various purposes, from impactful research to creating and improving products.
One of the most common ways to acquire such data is through web scraping. Because it involves gathering large amounts of data published by businesses or individuals online, it raises questions such as "Is it ethical?" The short answer is yes if you're following certain practices, but we need to lay the groundwork before diving head-first into it.
This article aims to shine a light on the topic of ethical web scraping and explore the intensifying debate surrounding the use of public web data for developing AI. Before proceeding, please note that this is an informational article and does not constitute legal advice.
Scraping is an automated method of large-scale data collection from the web. There are around 202 million active websites on the internet today, and each of them contains a wealth of information and a variety of data valuable to people and businesses, the scope of which is hard to put into words. A website can be a small travel blog, the world's leading online marketplace, or anything in between.
One thing that all of these websites have in common is that most of them contain publicly available data. Gathering this data in an automated way is in line with the current regulatory landscape. However, certain conditions apply, and businesses need to handle public data with caution and care.
Organizations use web scraping or buy services from data providers specializing in web scraping. Large amounts of valuable data from the web are being used for a variety of purposes, such as:
Without web scraping, some of the most commonly used services, such as some of the most popular online search tools, wouldn't exist. Web scraping also has many applications in the field of research, such as environmental research.
Earlier, I pointed out that public web data must be obtained by following the current regulatory landscape as well as principles that ensure the overall safety of the process and the data gathered. However, sometimes, web scraping appears in a negative context because of cases where the legal and ethical principles are dismissed or the purpose of the action is malicious.
Let's examine the legal side of the question to help us understand what web scraping activities are in line with the current regulatory standards.
In this article, I'm discussing scraping exclusively public web data from publicly available online sources, which means that such data is available to anyone without signing up or logging into the website. Usually, data located behind login-secured areas is governed by the website's terms and conditions.
Some public web data may contain copyrighted material. When working with such data, you must follow applicable copyright laws.
Similarly to copyrighted materials, some public web data may contain data that is protected by privacy laws. Privacy regulations around the world are continuously evolving and vary by jurisdiction.
For example, the majority of the regulations in U.S. states, such as the California Consumer Privacy Act (CCPA), do not classify publicly available information as personal data. However, European regulations, like the General Data Protection Regulation (GDPR), do not exempt public data. Therefore, GDPR-related data security and privacy measures must be considered when collecting web data.
Now that you are familiar with the legal side of web scraping, I'll mention one of the most notable court cases that illustrates how these legal requirements apply to the real-life use of web scraping technologies.
In 2017, LinkedIn issued a cease-and-desist letter to HiQ Labs, a data science company that scraped publicly available LinkedIn data and used it to create tools and insights. This case eventually evolved into a 6-year long legal battle, which is now known as a landmark case in the web scraping industry.
The first court ruling favored hiQ Labs, but LinkedIn appealed, arguing that hiQ Labs was breaching the Computer Fraud and Abuse Act (CFAA). Still, at that time, the court decided that since the data that hiQ Labs scraped from LinkedIn was public, the company was not breaching the terms of CFAA.
As the legal dispute continued, the center of the case shifted to hiQ Labs's use of fake profiles to scrape LinkedIn's data. The second ruling in 2022 stated that scraping web data behind the login wall using a fake profile was a breach of the website's terms and conditions. Eventually, companies reached a settlement in which hiQ Labs agreed to stop scraping LinkedIn.
However, it is important to highlight that the previous precedent regarding CFAA and public data scraping was not overruled by the second ruling. Instead, the judgment decided on a different legal question mostly related to the User Agreement and the usage of fake accounts when collecting data from LinkedIn.
It is important to note that by creating fake accounts, hiQ Labs has also accepted LinkedIn's User Agreement prior to accessing LinkedIn's online services, which prohibits creating false identities.
Therefore, the United States District Court's order, dated 27 October 2022, held that LinkedIn's User Agreement prohibits scraping and unauthorized use of the scraped data and that hiQ breached LinkedIn's user agreement through turkers' creation of false identities on LinkedIn's platform.
You should still note that the field of public web data scraping is constantly changing, and relevant case law developments involving web data companies should be followed.
I recommend consulting legal experts about any business activities related to it.
As mentioned above, besides the legal side of public web data collection from the web, there are also other aspects related to the ethics of web scraping. In time, these principles became a part of an unwritten code of conduct for players in this field. The key considerations are listed below:
Responsible businesses treat ethical web scraping as a commitment inseparable from establishing themselves as reputable players in the public web data business.
Last year, a group of leading web data aggregation companies launched an Ethical Web Data Collection Initiative, which focuses on encouraging dialogue and improving digital peace of mind for consumers and companies. They have since announced a list of ethical web data collection principles, for example:
High-quality web data contributed to bringing one of the key types of AI technology, the large language model (LLM), to where it is today. Training LLMs to understand human language and generate context-aware responses requires huge amounts of data.
For example, the training of GPT-3, the predecessor of Chat GPT-4, required forty-five terabytes of text. Publicly available information from the internet is one of the key pillars of information being used to train AI.
The AI market is expected to grow exponentially in the next few years. Naturally, this raises questions about using web data to train AI and create AI products.
Some argue that companies behind this booming technology breach copyright laws by scraping online data without permission to train AI and create new products. Furthermore, it is further argued that large language models like ChatGPT use all data from the training dataset to generate responses, sometimes "mimicking" the original content.
To better understand both sides of the argument, let's look at some cases that have happened since OpenAI launched ChatGPT, one of the leading LLMs, and made it available to the public.
At the end of 2023, the New York Times newspaper sued OpenAI for using its content to train AI, the first major U.S. media outlet to do so. Since then, several other media outlets have also sued OpenAI, claiming that the company violates federal copyright laws by using their articles to train AI systems.
Some media companies took a different approach. For example, the Financial Times has made a content licensing deal with a generative AI company, allowing them to use Financial Times content to develop AI products.
OpenAI, on the other side of the discussion, argues that using public web data to train AI is fair use. From this point of view, many news publications are open to the public, with no logins or paywalls; therefore, they should be considered public web data. Similar to other content on the web, such as Wikipedia articles, company websites, social networking sites, and more.
At the same time, AI is undoubtedly revolutionizing how we work and do business. It also provides society with tools that can be used for societal good, assisting researchers and scientists in finding solutions to environmental, medical, and other global challenges. Moreover, many for-profit AI companies offer powerful models, such as GPT-3.5 or Gemini Pro, free of charge.
But it comes with challenges.
This is no black-and-white issue. It is no longer necessary to discuss the importance of AI systems becoming an integral part of our personal and professional lives. The expected trajectory of AI market growth proves this point.
However, it is still necessary to establish industry-specific principles that constitute acceptable use of web data for LLM training and separate the wheat from the chaff when it comes to AI training.
These principles might evolve and change over time. Still, they should aim for mutual agreement and common understanding, allowing entities on both sides of the argument to operate successfully and balance commercial interests while also creating the space for AI innovation for the public good.
While the ethics of scraping public web data and using it for training AI sparks many discussions, there are legitimate cases that prove the value and importance of these novel technologies.
Still, it is essential to follow the principles of ethical web scraping and, even more so, to continue working on maintaining a peaceful dialogue among all organizations involved as new technologies and challenges emerge.