ai-robots-txt / ai.robots.txt Goto Github PK
View Code? Open in Web Editor NEWA list of AI agents and robots to block.
Home Page: https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
License: MIT License
A list of AI agents and robots to block.
Home Page: https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
License: MIT License
Help populate the table-of-bot-metrics.md to clarify bot activity.
Reading ImageSift's about page, this seems to be an image search site rather than AI. If so, then ImageSift is out of scope for this project.
/cc @jsheard
PS. Apologies for not raising this until now. I've been on holiday for a few days.
We could prime some questions in a FAQ from the hacker news discussion. The main one is along the lines of "Why would an AI web crawler respect robots.txt?"
If we wanted to be brave, we could enable a wiki for this repo!
I was looking at the robots.txt on my Tumblr-hosted blog: https://blog.corbin.io/robots.txt
There are a few bots Tumblr is blocking that aren't in this list:
# Common Crawl's crawler
User-agent: CCBot
Disallow: /
# SentiBot's crawler
User-agent: sentibot
Disallow: /
# ImageSift's AI crawler
User-agent: ImagesiftBot
Disallow: /
# TurnitinBot crawler
User-agent: TurnitinBot
Disallow: /
Not sure if all of these should be blocked or just some of them.
It might be useful to have an official mechanism for following updates to the repository, so site owners would know when their lists need to be updated. I just subscribed to the commits feed for the main robots.txt
file with my RSS app:
https://github.com/ai-robots-txt/ai.robots.txt/commits/main/robots.txt.atom
That only covers changes to that file, though, I'm not sure if ai.txt
is important here. Creating a new release for each change might be better? I'm not sure.
At minimum, it might be worth adding that RSS feed to the Readme as a basic existing notification system. I don't think many people know you can follow commits to individual files like that.
I noticed those are gone now, and it looks like they disappeared in the move to the JSON build process.
I assume that was an oversight but I thought I'd ask in case it was actually intended.
If you block AdsBot-Google, google ads won't show your ads?
Can anyone confirm this?
If so, maybe add some warning in docs?
img2dataset is software that spiders sites for AI training. It's not run by a specific company so hits can come from anywhere.
It claims to honor robots.txt
with the "img2dataset" user agent token, and X-Robots-Tag or HTML <meta>
directives "noai" and "noimageai".
tl;dr: independent body suggests adding new file to site root directory to ask organizations that use AI for text and data mining (TDM) to NOT use site data for training purposes.
From their FAQ:
An ai.txt file is a simple text file placed in the root directory (or .well-known/) of your website that communicates with data miners. It provides instructions on whether the text and media files hosted on your domain can be used to train commercial AI models.
An example file I created:
# Spawning AI
# Prevent datasets from using the following file types
User-Agent: *
Disallow: /
Disallow: *
PS: Thanks for this project ๐
Resources:
The idea is to scrape the content of Dark Visitors using a bot and generate PRs for this project. A bit like dependabot.
Should come include Apache and Nginx rewrites for the bots that don't respect. Bytespider doesn't seem to respect. Or they only read robots.txt infrequently...I dunno which.
This wiki page
https://github.com/ai-robots-txt/ai.robots.txt/wiki/Frequently-asked-questions
refers to broken links for apache.conf.txt and nginx.conf.txt.
Would you remove them?
Or better yet, move from the GitHub wiki (which community members can't contribute to through PRs) to a docs folder in the code or the README.md
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.