monperrus / crawler-user-agents Goto Github PK

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:

License: MIT License

Python 64.76% PHP 5.23% Go 30.01%

crawler-user-agents's Introduction

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

NPM package: https://www.npmjs.com/package/crawler-user-agents
Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents
PyPi package: https://pypi.org/project/crawler-user-agents/

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library:

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Javascript

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Python

Install with pip install crawler-user-agents

Then:

import crawleruseragents
if crawleruseragents.is_crawler("googlebot/"):
   # do something

Go

Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler' URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

contain a single addition
specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Voight-Kampff (Ruby)
isbot (Ruby)
crawlers (Clojure)
isBot (Node.JS)

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

Crawler-Detect (PHP)
BrowserDetector (PHP)
browscap (JSON files)

crawler-user-agents's People

Contributors

Stargazers

Watchers

Forkers

ffeast theprofitcms seamoss priestd09 jasonlfunk jangie gurneyhallack geometria-lab edounn vskarich istar uazure briggsreschke firec0der halloffame crackcomm johncoker onthewing eggpi babelian lucke84 codenoble stianlohna karim-el-sayed marcusirgens chiqui3d soundsgood-co coalaweb mehcode mattmarcum hadryan metulburr yamax2 haremmaster neurolabs akxlr holmesal mkdynamic ultcombo fouadfth hloeffler vetyy rduque1 dan-blanchard dustmoo babarinde genderkit ctheller dancedirect tchaw qinjiangbo bilejpes robertasg guoshouyan doncoi151189 optionalg romanlubushkin ipark cygnusb2b ashpreetbedi mirabellette actus10 jacamars mnabila shimpeko deanlj oaons warmlyyours-bot danijelgombac shashank734 sholmesy fleafieldy marios88 alexwayfer maxschmeling davalapar mzubairsaleem wattachai kongdewen shanonvl alpharootbeta 973432436 olical growkudos tachyean liguobao jprincipe aronjames ryangclare ahoz floptwo perosb nikwen fgy58963 razkaroth ledge74 vesurbag mlnchk hhy5277 alirezaasadi

crawler-user-agents's Issues

Add a metadata property to reflect general purpose of the UA

Hi,

Great library!

I am using this with a React app to return a static site if the UA is a scraper rather than a human viewer. The reason is that many scrapers used by sharing UA like Facebook/Twitter/Slack do not render javascript, so my app content does not preview correctly.

It would be very helpful if we could sort and designate which UAs are used for which general purpose. For example, I thought I had captured all the UAs that are typically used by social sites to render preview links, but a user just sent me a screenshot of a Messenger message where the preview did not render correctly. This is because I had not configured my app to return static for that particular UA.

I think a simple property of general_purpose: would be helpful. With options of browser, scrape: search, scrape: preview_render, etc.

What do you think?

duplicate entries

Nothing major, just a FYI :)

alex@quatermain:~$ cat crawler-user-agents.json | awk -F\" '/"pattern":/ { print $4 }' | sort | uniq -d
dotbot
Twitterbot

alex@quatermain:~$ cat crawler-user-agents.json | awk -F\" '/"pattern":/ { print $4 }' | tr A-Z a-z | sort | uniq -d
dotbot
livelapbot
twitterbot

Add pattern for Facebook app prefetching

Not sure if this is something that crawler-user-agents should be covering or not.

I have found that I can identify requests coming from the Facebook app that are pre-fetches of the page (i.e. not genuine visits) using this pattern:

FBRV/[0-9][0-9]

Genuine requests from the Facebook IAB instead have

FBRV/0

I have, however, not been able to find this documented anywhere, so this may possibly not continue to be the case in future.

If this is something that would make sense to include in crawler-user-agents I can make a PR for it - otherwise mark this wontfix!

Feature-Request: Difference between good and bad bots?

There should be two lists of good and bad bots. Example: Google and Bing are probably legitimate for most users.

Please indicate when it was added to the list to better find new entries. ;)

Why is "Googlebot" not recognized without trailing slash?

https://support.google.com/webmasters/answer/182072 says Googlebot is used as user agent for the default web crawler.

But https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json#L3 only recognizes Googlebot/ with trailing slash.

Thanks.

How to Install in Apache?

I know my question is beginner ...
But I really liked the project and I want to use it correctly ...

And my doubts are these:

1 - How to install correctly on a shared host.
2 - how to configure the folders that I do not want to be scanned.
3 - What should we replace in files and this extension is placed in a separate folder?
4 - Is there any robots for only images, like twitter and pinterest these are in the list?

Thank you for your attention!
And congratulations it's what I was looking for ...

Suggestion: test against false positives

Context

I use the package to distinguish crawlers from human users in HTTP server. The logic is to prevent crawlers from "spoiling" one time links shared in Discord and similar chats which request all the links sent to chats to make preview. Because the link is one-time, the request from the crawler uses it and it does not open when human user opens it. I solved this by blocking access from crawlers to such links. If you need more details, please see starius/pasta#8

Danger of false positives

If some legit browser sends User Agent which accidentally matches one of patterns, the user won't be able to access the link, because the site will treat this request as originated by a crawler.

I guess, other uses of this package will also benefit if false positives are minimized.

Proposed solution

Let's add a test to CI which runs most common User Agents through the patterns and fails if any of them matches.
The list of patterns can be loaded from here: https://github.com/microlinkhq/top-user-agents/tree/master/src
If somebody adds a pattern which matches any of them, it will be early detected and prevented.
Also if some popular browser starts using some User Agent accidentally matching one of patterns, this will also trigger the test failure.

Add Pingdom bots

It would be great to have atleast the common pingdom bot user agents from this list in the json.
https://developers.whatismybrowser.com/useragents/explore/software_name/pingdom-bot/

Voyager crawler

I am using Voight-Kampf gem to identify wether a visit is a crawler and I recently noticed several non-bot requests detected as robots. They all match with the voyager crawler, corresponding to kosmix.com (a website that doesn't seem to be online anynore) but they do because the user agent includes the string "VOYAGER2 DG310" which happens to be a smartphone by Doogee.

I think this value should probably either be removed (as the site seems to be offline) or be changed to "voyager ", adding a blank space at the end of the string, but I'm not sure which option is the best solution and thats why I open a issue instead a PR.

Raw list of user-agents

Hello,

I have access to an important database and I make a list of 71 user-agent uses by bots and which are not listed in this repository.
I haven't time to make a proper pull request but if someone have time, it can make it with the list on this link.

https://privatebin.mirabellette.eu/?2caed8cd254e76a5#mDZoPge1GFGzHvEANyi3WFOkiBuADw6ZytFsa/dkG78=

or below

'Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)'
'Mozilla/5.0 (compatible; SemrushBot/1.2bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)'
'Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)'
'Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
'Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, [email protected])'
'BUbiNG (+http://law.di.unimi.it/BUbiNG.html)'
'Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)'
'Wget/1.18 (linux-gnu)'
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
'Mozilla/5.0 (compatible; Cliqzbot/2.0; +http://cliqz.com/company/cliqzbot)'
'Mozilla/5.0 (compatible; SemrushBot/2bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'
'CCBot/2.0 (http://commoncrawl.org/faq/)'
'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'
'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
'Mozilla/5.0 (compatible; SemrushBot/3bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (compatible; Findxbot/1.0; +http://www.findxbot.com)'
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)'
'ZoominfoBot (zoominfobot at zoominfo dot com)'
'Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)'
'Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; [email protected])'
'Mozilla/5.0 (compatible; SemrushBot/6bl; +http://www.semrush.com/bot.html)'
'TelegramBot (like TwitterBot)'
'Mozilla/5.0 (compatible; bnf.fr_bot; +http://www.bnf.fr/fr/outils/a.dl_web_capture_robot.html)'
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0'
'Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)'
'Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)'
'Googlebot/2.1 (+http://www.googlebot.com/bot.html)'
'yacybot (-global; amd64 Linux 4.4.0-116-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'MauiBot ([email protected])'
'MauiBot ([email protected])'
'Mozilla/5.0 (compatible; SemrushBot/1.0bm; +http://www.semrush.com/bot.html)'
'yacybot (-global; amd64 Linux 4.4.0-127-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; M bot 60 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.68 Mobile Safari/537.36'
'Mozilla/5.0 (compatible; AhrefsBot/5.2; News; +http://ahrefs.com/robot/)'
'yacybot (-global; amd64 Linux 4.4.0-128-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; M bot 60 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36'
'yacybot (-global; amd64 Linux 4.4.0-128-generic; java 1.8.0_171; GMT/en) http://yacy.net/bot.html'
'CCBot/2.0 (https://commoncrawl.org/faq/)'
'yacybot (-global; amd64 Linux 4.4.0-131-generic; java 1.8.0_171; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; CUBOT MAGIC Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 UCBrowser/11.5.2.1188 (UCMini) Mobile'
'Mozilla/5.0 (Linux; Android 7.0; CUBOT MAGIC Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36'
'Mozilla/5.0 (compatible; SEOkicks; +https://www.seokicks.de/robot.html)'
'Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)'
'yacybot (-global; amd64 Linux 4.4.0-131-generic; java 1.8.0_181; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 6.0; IDbot553 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36'
'yacybot (/global; amd64 Linux 4.4.0-1031-aws; java 1.8.0_191-heroku; Etc/en) http://yacy.net/bot.html'
'yacybot (/global; amd64 Linux 4.15.18; java 1.8.0_192; Europe/de) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-141-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-142-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)'
'yacybot (-global; amd64 Linux 4.4.0-143-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36'
'Mozilla/5.0 (compatible; Snapbot/1.0; +http://www.snapchat.com)'
'MBCrawler/1.0 (https://monitorbacklinks.com/robot)'
'yacybot (-global; amd64 Linux 4.4.0-145-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; Go-http-client/1.1; [email protected])'
'Mozilla/5.0 (compatible; FemtosearchBot/1.0; http://femtosearch.com)'
'yacybot (/global; amd64 Linux 5.0.6-gnu-1; java 11.0.3; UTC/en) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-150-generic; java 1.8.0_212; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; SemrushBot/3bl; +http://www.semrush.com/bot.html) AppEngine-Google; (+http://code.google.com/appengine; appid: stutorialses-hrd)'
'Linguee Bot (http://www.linguee.com/bot; [email protected])'
''Mozilla/5.0 (compatible; DuckDuckBot-Https/1.1; https://duckduckgo.com/duckduckbot)\''
'Mozilla/5.0 (compatible; SemrushBot/6bl; +http://www.semrush.com/bot.html) AppEngine-Google; (+http://code.google.com/appengine; appid: s~tutorialses-hrd)'
'Mozilla/5.0 (compatible; bnf.fr_bot; +https://www.bnf.fr/fr/capture-de-votre-site-web-par-le-robot-de-la-bnf)'
'Keybot Translation-Search-Machine'
'Gigabot (1.1 1.2)'

Publish to npm

This list is is great and I want to use it in my project, but it would be easier to use if it's published to npm. Would you mind publishing this to npm?

Problem with AdsBot-Google user agent

Hello,

I was using your awesome list to filter out some robots traffic, but recently results got worse than before, I found out it is because of change on AdsBot-Google user agent. It stopped matching mobile version of AdsBot-Google-Mobile (https://support.google.com/adwords/answer/2404197).

Would it be possible to just change "AdsBot-Google(?!-)" back to "AdsBot-Google"?

I think all user agents that contains AdsBot-Google.*, now or in future, are bots, so it is probably more robust to use is that way.

Something nice to have: I am also using re2 library to match user agents (https://github.com/google/re2) but such regex is not supported in re2 and it must fallback to original re lib which is bit slower, so it would be wonderful to be able to stay compatible with re2.

  {
    "pattern": "AdsBot-Google(?!-)",
    "url": "https://support.google.com/webmasters/answer/1061943?hl=en"
  },
  {
    "pattern": "AdsBot-Google-Mobile-Apps",
    "addition_date": "2017/08/08",
    "url": "https://support.google.com/webmasters/answer/1061943?hl=en"
},

update version on NPM

Good day!

NPM version was published 3 years ago: https://www.npmjs.com/package/crawler-user-agents
And package.json that version doesn't contain "main": "crawler-user-agents.json".

Add links to language specific libraries and alternative databases in the README file

Other projects on GitHub provide language specific front-ends to crawler-user-agents, which people visiting the GitHub page for this project may not know about:

https://github.com/biola/Voight-Kampff (Ruby)
https://github.com/Hentioe/isbot (Ruby)

I suggest the README should list both of these so that people can find them easily.

This project is also one of a number of different projects on GitHub with databases of crawlers. Other projects may be more suitable to their needs than this one.

I've listed some other projects that allow people to identify crawlers by user agent below. These are the projects I can find that have good quality, comprehensive databases, and which seem to be being actively maintained. Others obviously exist but these appear to be the good quality ones.

https://github.com/gorangajic/isbot (Node.JS)
https://github.com/JayBizzle/Crawler-Detect (PHP)
https://github.com/mimmi20/BrowserDetector (PHP)
https://github.com/browscap/browscap (has extremely extensive JSON database of UAs including a crawlers section)

I'd suggest that these would also be useful additions to the README.

Consider add Google-Read-Aloud to the bot list

I don't see Google-Read-Aloud in the JSON file.

Please add the data if this it is not intentionally left out.

Feature request virustotal

Please add VirusTotal crawlers. They have several bots like next. You can check it by yourself on https://www.virustotal.com/gui/home/url:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)
AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
As well https://redirectdetective.com/ and https://wheregoes.com/

Yandex Browser

{ "pattern": "yandex" },
is too wide. There is also Yandex.Browser.

Yandex's bot's signature includes YandexBot.

Add Freshping bot

Freshping is used for monitoring. It has a broad free tier and as such is quite useful even for hobby or small projects.

The user agent I see is: FreshpingBot/1.0 (+https://freshping.io/)

Here are more official details: https://support.freshping.io/en/support/solutions/articles/237620-why-are-my-checks-always-down-even-when-my-website-is-up-

JSON syntax issues

When I use json.lint your json syntax is fine, but when I try to use it on WAMP (PHP 5.6) I get syntax errors. I narrowed it down to the line

"pattern": "Mediapartners \\(Googlebot\\)",

(I hope the backslashes will show in this comment)

I changed it to this and it worked fine:

"pattern": "Mediapartners [(]Googlebot[)]",

Since your JSON is technically valid you may not want to bother "fixing" what isn't broken, but I thought I would mention it here in case others run into the problem.

New bot names to add

The following have appeared in my logs recently and could be added:

TelegramBot
Discordbot
DuckDuckGo-Favicons-Bot
Ocarinabot
Cliqzbot
epicbot
SEMrushBot
Primalbot
GnowitNewsbot
Primalbot
Leikibot
LinkArchiver
linkfluence
PaperLiBot
infegy.com
Digg Deeper
dcrawl
Snacktory
AndersPinkBot
Fyrebot
EveryoneSocialBot
LivelapBot

Hope this is helpful!

Ordering by most common

Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.

I did a very quick optimization using the frequency reported on this page:

https://deviceatlas.com/blog/list-of-web-crawlers-user-agents

And then I put all your patterns (concatenated with |) into 2 preg_match() calls:

if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}

Providing a script to produce that might be a help...?

Adding new version of existing bot entry

I've found that I'm getting "Python-requests/2.26.0" passing the check for bots.
Clearly, this version is not in the instances object array for the python-requests pattern.

Is adding version 2.26.0 something that should also be PR'd if I would like to request it get added?
I know to PR if a NEW pattern is found that isn't in the json already.
Just unclear on adding a version to an already existing pattern in your repository.

Apologies if this seems like a silly question, just trying to do it right. 😁
Thanks!

Automatic npm continuous deployments broken

Found this error in the Travis CI logs:

$ npm version `node -e 'pacote=require("pacote");pacote.manifest("crawler-user-agents").then(pkgJson => { console.log(pkgJson.version); });'`
(node:3731) UnhandledPromiseRejectionWarning: ReferenceError: URL is not defined
    at regKeyFromURI (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/auth.js:7:18)
    at getAuth (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/auth.js:49:18)
    at regFetch (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/index.js:55:16)
    at RegistryFetcher.packument (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/registry.js:86:15)
    at RegistryFetcher.manifest (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/registry.js:117:17)
    at Object.manifest (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/index.js:16:45)
    at [eval]:1:33
    at ContextifyScript.Script.runInThisContext (vm.js:50:33)
    at Object.runInThisContext (vm.js:139:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
(node:3731) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:3731) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Instagram bots details

https://mpulp.mobi/2016/07/12/instagram-user-agent/

iPhone:
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334

Android:
Mozilla/5.0 (Linux; Android 6.0.1; SM-G935T Build/MMB29M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/51.0.2704.81 Mobile Safari/537.36 Instagram 8.4.0 Android (23/6.0.1; 560dpi; 1440x2560; samsung; SM-G935T; hero2qltetmo; qcom; en_US

Add linespider

See https://help2.line.me/linesearchbot/web/?contentId=50006055&lang=en

Linespider is a Web crawler managed by LINE. The user agents are as follows :
⋅Mozilla/5.0 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH)
⋅Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Linespider/1.1; +https://lin.ee/4dwXkTH) Chrome/W.X.Y.Z Safari/537.36
The user agent string "W.X.Y.Z" contains the version number of the browser used by the user agent.

Add AI agents

AI tools like ChatGPT can crawl webpages. Their user agents should be added to this list.

https://darkvisitors.com is creating an index of AI crawler user agents.

add test code for the validation mechanism

so as to allow full-confidence refactoring, eg #62

I noticed a recent bot which I believe is coming from Tencent but can't find documentation for - Ottabot

Hi! 👋

Firstly, thanks for your work on this project! 🙂

Today I used patch-package to patch [email protected] for the project I'm working on.

Here is the diff that solved my problem:

diff --git a/node_modules/crawler-user-agents/crawler-user-agents.json b/node_modules/crawler-user-agents/crawler-user-agents.json
index 13c9910..70836b8 100644
--- a/node_modules/crawler-user-agents/crawler-user-agents.json
+++ b/node_modules/crawler-user-agents/crawler-user-agents.json
@@ -5140,5 +5140,13 @@
       "Mozilla/5.0 (compatible; StractBot/0.1; open source search engine; +https://trystract.com/webmasters)"
     ],
     "url": "https://trystract.com/webmasters"
+  },
+  {
+    "pattern": "Ottabot",
+    "addition_date": "2023/09/06",
+    "instances": [
+      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 (Ottabot/0.1.0)"
+    ],
+    "url": ""
   }
 ]

This issue body was partially generated by patch-package.

Add Dynatrace / RuxitSynthetic into the list

Hey, a bit new to this, I could not figure out how to do a pull request. I have a monitoring bot to add to the list, prepared the JSON below.

{
  "pattern": "RuxitSynthetic",
  "addition_date": "2023/02/16",
  "url": "https://www.dynatrace.com/support/help/platform-modules/digital-experience/synthetic-monitoring/browser-monitors/configure-browser-monitors#expand--default-user-agent",
  "instances" : ["RuxitSynthetic/1.0"]
}

One thing I am struggling with, some parts of the user-agent are dynamic, according to their docs. So I only added the specific unique part into instances.

Can you help me with the implementation? Thanks!

Addition suggestion

Could you add Cookiebot?

{
  "pattern": "Cookiebot",
  "addition_date": "2022/01/23",
  "url": "https://www.cookiebot.com/",
  "instances": [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Cookiebot/1.0; +http://cookiebot.com/) Chrome/97.0.4692.71 Safari/537.36"
  ]
}

port CI to GithubActions

Travis CI is hardly usable.

Anybody to contribute the corresponding Github Actions workflow?

that would be great thanks!

Code-friendly license

I respect every developer’s right to pick their own license, but have you considered using a more code-friendly license?

Unfortunately, CC licenses are problematic for code, as they are more targeted at artwork and their definitions of „sharing“ and „mixing“ are ambiguous in the context of code.

My suggestion would be to switch to either the MIT or the LGPL license. MIT is rather permissive and very common these days, LGPL is more similar to CC-BY-SA.

Some user agents have incorrect capitalisation

Some user agents do not have the correct capitalisation and are currently all lowercase.

In particular:

AhrefsBot
Baiduspider
Wget

This means they are not matching when used with "grep -f" and I would assume would be breaking with other regex engines also.

BingPreview

I found this user agent in my logs a few days ago:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Seems to be a bot: https://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0

Done all, see below

The ones that are now handled (removed from the above list for sake of clarity):

~~IAS crawler (ias_crawler; http://integralads.com/site-indexing-policy/)~~
~~SBL-BOT (http://sbl.net)~~
~~Mozilla/5.0 (compatible; seoscanners.net/1; [email protected])~~
~~Mozilla/5.0 (compatible; oBot/2.3.1; +http://filterdb.iss.net/crawler/)~~
~~Mozilla/5.0 (compatible; TinEye-bot/1.31; +http://www.tineye.com/crawler.html)~~
~~LCC (+http://corpora.informatik.uni-leipzig.de/crawler_faq.html)~~
~~ArchiveTeam ArchiveBot/20170106.02 (wpull 2.0.2)~~
~~ICC-Crawler/2.0 (Mozilla-compatible; ; http://ucri.nict.go.jp/en/icccrawler.html)~~
~~Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Exabot-Thumbnails)~~
Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html), Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)

Not kept:

~~WordChampBot~~ confidential, not active, seems dead (see http://www.wordchamp.com)
~~Mozilla/5.0 (compatible; Flinkhubbot/1.1; [email protected] )~~ not active anymore

AddThis

HTTP_USER_AGENT : AddThis.com robot [email protected]
https://www.webmasterworld.com/search_engine_spiders/3876346.htm

{
"pattern": "AddThis",
"url": "https://www.addthis.com",
"instances": ["AddThis.com robot [email protected]"],
"addition_date": "2015/06/02"
}

Add WordPress Bot/Crawler

WordPress has its own bot/crawler

https://user-agents.net/bots/wordpress

Thanks!

Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)

Probably worth calling out into its own category as "Centurybot9" or something. Thank you for this great project!

add a script to generate robots.txt

It would be great to be able to generate a valid robots.txt from the JSON file, see #169 (comment)