Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent #7590
+73
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enhancement: Add known AI Crawler bots to disallow list in robots.txt to prevent crawling without specific user consent
This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual
robots.txt
file to prevent AI bots from crawling site content without specific user consent.This is done by changes to the do_robots function in the
wp-includes/functions.php
, this updated code loads a list of known AI Bots from a JSON fileai-bots-for-robots-txt.json
add creates a User-agent: entry for each one and disallows their access.Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.
Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:
This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.
I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.
Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.
With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.
I have updated the Trac link to an earlier ticket, this PR has been marked as a duplicate in Trac:
Trac ticket https://core.trac.wordpress.org/ticket/60805