Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590

rickcurran · 2024-10-18T17:58:12Z

Enhancement: Add known AI Crawler bots to disallow list in robots.txt to prevent crawling without specific user consent

This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt file to prevent AI bots from crawling site content without specific user consent.

This is done by changes to the do_robots function in the wp-includes/functions.php, this updated code loads a list of known AI Bots from a JSON file ai-bots-for-robots-txt.json add creates a User-agent: entry for each one and disallows their access.

Why is this needed?

My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.

Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:

“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”

This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.

I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.

Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.

With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.

I have updated the Trac link to an earlier ticket, this PR has been marked as a duplicate in Trac:

Trac ticket https://core.trac.wordpress.org/ticket/60805

… to prevent crawling without specific user consent This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt file to prevent AI bots from crawling site content without specific user consent. This is done by changes to the `do_robots` function in the `wp-includes/functions.php`, this updated code loads a list of known AI Bots from a JSON file `ai-bots-for-robots-txt.json` add creates a `User-agent:` entry for each one and disallows their access.

github-actions · 2024-10-18T17:58:25Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props rickcurran.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2024-10-18T17:58:30Z

Hi @rickcurran! 👋

Thank you for your contribution to WordPress! 💖

It looks like this is your first pull request to wordpress-develop. Here are a few things to be aware of that may help you out!

No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description.

Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making.

More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook.

Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook.

If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook.

The Developer Hub also documents the various coding standards that are followed:

Thank you,
The WordPress Project

github-actions · 2024-10-18T18:12:20Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590

Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590

rickcurran commented Oct 18, 2024 •

edited

Loading

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590

Are you sure you want to change the base?

Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590

Conversation

rickcurran commented Oct 18, 2024 • edited Loading

Why is this needed?

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

Test using WordPress Playground

Some things to be aware of

rickcurran commented Oct 18, 2024 •

edited

Loading