AI course: crawlers – to block or not to block?

Editor’s note: we are republishing one of the emails from The Fix’s new AI newsletter course that offers perspective and practical advice on artificial intelligence for news leaders by Alberto Puliafito. You can subscribe for free to access the whole course.

As AI continues to shape the media landscape, the decision to permit or restrict AI crawlers from accessing your site has significant implications, which are legal, ethical, and strategic.

Today, we’ll explore the legal landscape surrounding robots.txt, discuss ongoing lawsuits that could reshape how content is accessed and used by AI, and examine the copyright issues at stake. This chapter will help you make informed decisions about managing your content in the era of AI.

Understanding robots.txt

The robots.txt file is a simple text file placed on your web server that instructs search engine bots and other web crawlers on how to interact with your site. It allows you to control which parts of your site are accessible to these crawlers and which are off-limits. While originally designed for search engines like Google, the rise of AI-driven tools that scrape the web for content has made the robots.txt file a critical point of control for publishers.

If you check, for example, the New York Times robots.txt file, you can find a list of agents disallowed. This means that The New York Times does not allow ChatGPT, ClaudeBot, FacebookBot and others to crawl its content.

As of earlier this week, more than half of the 1,158 news publishers checked by Ben Welsh on his Palewire project made the same choice.

This is the situation:

Crawler	Opt outs	Percent
OpenAI	594	51.3%
Google AI	510	44.0%
Common Crawl	550	47.5%

It’s a dynamic situation, so if you want to keep up-to-date, put this page among your favourites and check it on a regular basis.

[contentpost url=https://thefix.media/2023/9/8/generative-ai-threatens-to-undermine-news-media-business-models-how-are-news-publishers-reacting-and-adapting]

To block or not to block?

The robots.txt file can be used to restrict AI crawlers from accessing and scraping content on your site. However, on one side the AI developers could not respect the request; on the other side, the legal standing of using robots.txt as a method of copyright protection is still under debate. While some argue that blocking AI crawlers is a legitimate way to protect content, others contend that it may not be enough to enforce copyright claims in court.

Let’s add a couple of considerations. First: you can’t prevent any user from using your content by simply copying-and-past them, or reading them to a multimodal LLM, or summarising them. You can’t, unless you feed the LLM with the content (which, as you probably already imagine, completely misses the point: if you give a content to a LLM, it will be used by the LLM).

Second: are you sure that you don’t want your content being used as training by models developers? Are you sure you don’t want them to be accessible? Let’s say that you block Perplexity, for example.

If you block it, you will not be listed in the sources used by Perplexity (even if you are not guaranteed that its bot is not crawling your content archive!).

On the other side, blocking could be a leverage to get some money from these companies as they still need your content to provide their users with great answers (and that’s the reason why initiatives like Perplexity Publishers’ Program came out).

[contentpost url=https://thefix.media/2024/9/6/how-perplexity-ai-partners-with-major-publishers]

We can say the same for the legal battles like The New York Times against OpenAI and from economical agreements made by OpenAI with several publishers worldwide.

Depending on how these legal battles unfold, your decision on whether to block or allow AI crawlers could have significant implications for your digital strategy. If courts determine that robots.txt is a legally enforceable tool, it could empower publishers to protect their content more effectively.

However, if it’s deemed insufficient, you may need to explore additional legal protections. Moreover, if LLMs producers are making agreements with the biggest players, very few will be left for the smallest and independent.

Is scraping fair use? AI crawlers often scrape content to train models or generate outputs, raising serious copyright concerns. According to the vast majority in the journalistic field right now, when your content is used without permission, it not only violates copyright law but also devalues the original work. On the other side some AIs companies argue that scraping content for training models falls under fair use, especially if the content is transformed or used for non-commercial purposes. However, many publishers disagree, leading to legal disputes. In this battle, journalistic publishers are in the same field as other players of the so-called cultural industry.
“There's nothing right about stealing an artist's lifetime of work, extracting its fundamental value, and repackaging it to compete directly with the originals.” “Where [startup] Suno sees musicians, teachers, and everyday people using a new tool to create original music, record labels see a threat to their market share.”

This, in essence, is the back-and-forth between the Recording Industry Association of America, which represents the music industry in the United States, and Suno, one of the companies producing generative music AI that has been sued by U.S. record labels. Suno has admitted to using copyrighted music to train its AI, and as expected, is appealing to the doctrine of fair use. Udio, the other company involved in the lawsuit filed by the record labels, has taken an identical defensive stance: “What Udio has done—using existing recordings as data to extract and analyze in order to identify patterns in the sounds of various musical styles, all to enable people to create their own new works—is the quintessence of fair use.” Who is right? That will be decided by U.S. judges.

But are we sure that this makes sense for journalism? Yes, we don’t want big tech AI companies to make money using our jobs. But on the other side we need great journalism to spread as much as possible. Are we sure that getting some money again from big tech companies – already knowing that sooner or later they will stop giving money – is the solution?

As you can see, this is an issue that involves different considerations, departments and even ways to see our job and society.

[subscribeform]