ai-robots-txt / ai.robots.txt Goto Github PK

View Code? Open in Web Editor NEW

838.0 22.0 27.0 254 KB

A list of AI agents and robots to block.

Home Page: https://github.com/ai-robots-txt/ai.robots.txt/releases.atom

License: MIT License

PHP 25.45% Python 74.55%

ai crawlers crawling privacy

ai.robots.txt's People

Contributors

Stargazers

Watchers

ai.robots.txt's Issues

Help populating table-of-bot-metrics.md

Help populate the table-of-bot-metrics.md to clarify bot activity.

ImageSift AI or not?

Reading ImageSift's about page, this seems to be an image search site rather than AI. If so, then ImageSift is out of scope for this project.

/cc @jsheard

PS. Apologies for not raising this until now. I've been on holiday for a few days.

Add FAQ

We could prime some questions in a FAQ from the hacker news discussion. The main one is along the lines of "Why would an AI web crawler respect robots.txt?"

If we wanted to be brave, we could enable a wiki for this repo!

Add bots blocked by Tumblr

I was looking at the robots.txt on my Tumblr-hosted blog: https://blog.corbin.io/robots.txt

There are a few bots Tumblr is blocking that aren't in this list:

# Common Crawl's crawler
User-agent: CCBot
Disallow: /

# SentiBot's crawler
User-agent: sentibot
Disallow: /

# ImageSift's AI crawler
User-agent: ImagesiftBot
Disallow: /

# TurnitinBot crawler
User-agent: TurnitinBot
Disallow: /

Not sure if all of these should be blocked or just some of them.

It might be useful to have an official mechanism for following updates to the repository, so site owners would know when their lists need to be updated. I just subscribed to the commits feed for the main robots.txt file with my RSS app:

https://github.com/ai-robots-txt/ai.robots.txt/commits/main/robots.txt.atom

That only covers changes to that file, though, I'm not sure if ai.txt is important here. Creating a new release for each change might be better? I'm not sure.

At minimum, it might be worth adding that RSS feed to the Readme as a basic existing notification system. I don't think many people know you can follow commits to individual files like that.

FriendlyCrawler and ImagesiftBot removed by mistake?

I noticed those are gone now, and it looks like they disappeared in the move to the JSON build process.

I assume that was an oversight but I thought I'd ask in case it was actually intended.

If you block AdsBot-Google, google ads won't show your ads?

If you block AdsBot-Google, google ads won't show your ads?
Can anyone confirm this?

If so, maybe add some warning in docs?

Add img2dataset

img2dataset is software that spiders sites for AI training. It's not run by a specific company so hits can come from anywhere.

It claims to honor robots.txt with the "img2dataset" user agent token, and X-Robots-Tag or HTML <meta> directives "noai" and "noimageai".

Mention Spawning AI’s `ai.txt` file

tl;dr: independent body suggests adding new file to site root directory to ask organizations that use AI for text and data mining (TDM) to NOT use site data for training purposes.

From their FAQ:

An ai.txt file is a simple text file placed in the root directory (or .well-known/) of your website that communicates with data miners. It provides instructions on whether the text and media files hosted on your domain can be used to train commercial AI models.

An example file I created:

# Spawning AI
# Prevent datasets from using the following file types

User-Agent: *
Disallow: /
Disallow: *

PS: Thanks for this project 💖

Resources:

refers to broken links for apache.conf.txt and nginx.conf.txt.

Would you remove them?

Or better yet, move from the GitHub wiki (which community members can't contribute to through PRs) to a docs folder in the code or the README.md

ai-robots-txt / ai.robots.txt Goto Github PK

ai.robots.txt's People

Contributors

Stargazers

Watchers

Forkers

ai.robots.txt's Issues

Recommend Projects

Recommend Topics

Recommend Org