Code Monkey home page Code Monkey logo

crawler-user-agents's Introduction

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library:

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Javascript

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Python

Install with pip install crawler-user-agents

Then:

import crawleruseragents
if crawleruseragents.is_crawler("googlebot/"):
   # do something

Go

Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler' URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

  • contain a single addition
  • specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
  • contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
  • result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

crawler-user-agents's People

Contributors

alexwayfer avatar boegie avatar dan-blanchard avatar dspinellis avatar eggpi avatar fale avatar fallax avatar fekir avatar flopana avatar haumacher avatar iceq1337 avatar jbyoshi avatar kavacky avatar kongdewen avatar marios88 avatar mattmarcum avatar mirabellette avatar monperrus avatar nikwen avatar noctivityinc avatar omrilotan avatar pbinkley avatar perosb avatar robinki avatar samuelgiles avatar shimpeko avatar synhershko avatar ultcombo avatar vanushwashere avatar vetyy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawler-user-agents's Issues

Add a metadata property to reflect general purpose of the UA

Hi,

Great library!

I am using this with a React app to return a static site if the UA is a scraper rather than a human viewer. The reason is that many scrapers used by sharing UA like Facebook/Twitter/Slack do not render javascript, so my app content does not preview correctly.

It would be very helpful if we could sort and designate which UAs are used for which general purpose. For example, I thought I had captured all the UAs that are typically used by social sites to render preview links, but a user just sent me a screenshot of a Messenger message where the preview did not render correctly. This is because I had not configured my app to return static for that particular UA.

I think a simple property of general_purpose: would be helpful. With options of browser, scrape: search, scrape: preview_render, etc.

What do you think?

duplicate entries

Nothing major, just a FYI :)

alex@quatermain:~$ cat crawler-user-agents.json | awk -F\" '/"pattern":/ { print $4 }' | sort | uniq -d
dotbot
Twitterbot

alex@quatermain:~$ cat crawler-user-agents.json | awk -F\" '/"pattern":/ { print $4 }' | tr A-Z a-z | sort | uniq -d
dotbot
livelapbot
twitterbot

Add pattern for Facebook app prefetching

Not sure if this is something that crawler-user-agents should be covering or not.

I have found that I can identify requests coming from the Facebook app that are pre-fetches of the page (i.e. not genuine visits) using this pattern:

FBRV/[0-9][0-9]

Genuine requests from the Facebook IAB instead have

FBRV/0

I have, however, not been able to find this documented anywhere, so this may possibly not continue to be the case in future.

If this is something that would make sense to include in crawler-user-agents I can make a PR for it - otherwise mark this wontfix!

How to Install in Apache?

I know my question is beginner ...
But I really liked the project and I want to use it correctly ...

And my doubts are these:

1 - How to install correctly on a shared host.
2 - how to configure the folders that I do not want to be scanned.
3 - What should we replace in files and this extension is placed in a separate folder?
4 - Is there any robots for only images, like twitter and pinterest these are in the list?

Thank you for your attention!
And congratulations it's what I was looking for ...

Suggestion: test against false positives

Context

I use the package to distinguish crawlers from human users in HTTP server. The logic is to prevent crawlers from "spoiling" one time links shared in Discord and similar chats which request all the links sent to chats to make preview. Because the link is one-time, the request from the crawler uses it and it does not open when human user opens it. I solved this by blocking access from crawlers to such links. If you need more details, please see starius/pasta#8

Danger of false positives

If some legit browser sends User Agent which accidentally matches one of patterns, the user won't be able to access the link, because the site will treat this request as originated by a crawler.

I guess, other uses of this package will also benefit if false positives are minimized.

Proposed solution

Let's add a test to CI which runs most common User Agents through the patterns and fails if any of them matches.
The list of patterns can be loaded from here: https://github.com/microlinkhq/top-user-agents/tree/master/src
If somebody adds a pattern which matches any of them, it will be early detected and prevented.
Also if some popular browser starts using some User Agent accidentally matching one of patterns, this will also trigger the test failure.

Voyager crawler

I am using Voight-Kampf gem to identify wether a visit is a crawler and I recently noticed several non-bot requests detected as robots. They all match with the voyager crawler, corresponding to kosmix.com (a website that doesn't seem to be online anynore) but they do because the user agent includes the string "VOYAGER2 DG310" which happens to be a smartphone by Doogee.

I think this value should probably either be removed (as the site seems to be offline) or be changed to "voyager ", adding a blank space at the end of the string, but I'm not sure which option is the best solution and thats why I open a issue instead a PR.

Raw list of user-agents

Hello,

I have access to an important database and I make a list of 71 user-agent uses by bots and which are not listed in this repository.
I haven't time to make a proper pull request but if someone have time, it can make it with the list on this link.

https://privatebin.mirabellette.eu/?2caed8cd254e76a5#mDZoPge1GFGzHvEANyi3WFOkiBuADw6ZytFsa/dkG78=

or below

'Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)'
'Mozilla/5.0 (compatible; SemrushBot/1.2bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)'
'Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)'
'Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
'Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, [email protected])'
'BUbiNG (+http://law.di.unimi.it/BUbiNG.html)'
'Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)'
'Wget/1.18 (linux-gnu)'
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
'Mozilla/5.0 (compatible; Cliqzbot/2.0; +http://cliqz.com/company/cliqzbot)'
'Mozilla/5.0 (compatible; SemrushBot/2
bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'
'CCBot/2.0 (http://commoncrawl.org/faq/)'
'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'
'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
'Mozilla/5.0 (compatible; SemrushBot/3bl; +http://www.semrush.com/bot.html)'
'Mozilla/5.0 (compatible; Findxbot/1.0; +http://www.findxbot.com)'
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)'
'ZoominfoBot (zoominfobot at zoominfo dot com)'
'Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)'
'Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; [email protected])'
'Mozilla/5.0 (compatible; SemrushBot/6
bl; +http://www.semrush.com/bot.html)'
'TelegramBot (like TwitterBot)'
'Mozilla/5.0 (compatible; bnf.fr_bot; +http://www.bnf.fr/fr/outils/a.dl_web_capture_robot.html)'
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0'
'Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)'
'Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)'
'Googlebot/2.1 (+http://www.googlebot.com/bot.html)'
'yacybot (-global; amd64 Linux 4.4.0-116-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'MauiBot ([email protected])'
'MauiBot ([email protected])'
'Mozilla/5.0 (compatible; SemrushBot/1.0bm; +http://www.semrush.com/bot.html)'
'yacybot (-global; amd64 Linux 4.4.0-127-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; M bot 60 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.68 Mobile Safari/537.36'
'Mozilla/5.0 (compatible; AhrefsBot/5.2; News; +http://ahrefs.com/robot/)'
'yacybot (-global; amd64 Linux 4.4.0-128-generic; java 1.8.0_151; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; M bot 60 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36'
'yacybot (-global; amd64 Linux 4.4.0-128-generic; java 1.8.0_171; GMT/en) http://yacy.net/bot.html'
'CCBot/2.0 (https://commoncrawl.org/faq/)'
'yacybot (-global; amd64 Linux 4.4.0-131-generic; java 1.8.0_171; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 7.0; CUBOT MAGIC Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 UCBrowser/11.5.2.1188 (UCMini) Mobile'
'Mozilla/5.0 (Linux; Android 7.0; CUBOT MAGIC Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36'
'Mozilla/5.0 (compatible; SEOkicks; +https://www.seokicks.de/robot.html)'
'Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)'
'yacybot (-global; amd64 Linux 4.4.0-131-generic; java 1.8.0_181; GMT/en) http://yacy.net/bot.html'
'Mozilla/5.0 (Linux; Android 6.0; IDbot553 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36'
'yacybot (/global; amd64 Linux 4.4.0-1031-aws; java 1.8.0_191-heroku; Etc/en) http://yacy.net/bot.html'
'yacybot (/global; amd64 Linux 4.15.18; java 1.8.0_192; Europe/de) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-141-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-142-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)'
'yacybot (-global; amd64 Linux 4.4.0-143-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36'
'Mozilla/5.0 (compatible; Snapbot/1.0; +http://www.snapchat.com)'
'MBCrawler/1.0 (https://monitorbacklinks.com/robot)'
'yacybot (-global; amd64 Linux 4.4.0-145-generic; java 1.8.0_191; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; Go-http-client/1.1; [email protected])'
'Mozilla/5.0 (compatible; FemtosearchBot/1.0; http://femtosearch.com)'
'yacybot (/global; amd64 Linux 5.0.6-gnu-1; java 11.0.3; UTC/en) http://yacy.net/bot.html'
'yacybot (-global; amd64 Linux 4.4.0-150-generic; java 1.8.0_212; Europe/en) http://yacy.net/bot.html'
'Mozilla/5.0 (compatible; SemrushBot/3
bl; +http://www.semrush.com/bot.html) AppEngine-Google; (+http://code.google.com/appengine; appid: stutorialses-hrd)'
'Linguee Bot (http://www.linguee.com/bot; [email protected])'
''Mozilla/5.0 (compatible; DuckDuckBot-Https/1.1; https://duckduckgo.com/duckduckbot)\''
'Mozilla/5.0 (compatible; SemrushBot/6
bl; +http://www.semrush.com/bot.html) AppEngine-Google; (+http://code.google.com/appengine; appid: s~tutorialses-hrd)'
'Mozilla/5.0 (compatible; bnf.fr_bot; +https://www.bnf.fr/fr/capture-de-votre-site-web-par-le-robot-de-la-bnf)'
'Keybot Translation-Search-Machine'
'Gigabot (1.1 1.2)'

Publish to npm

This list is is great and I want to use it in my project, but it would be easier to use if it's published to npm. Would you mind publishing this to npm?

Problem with AdsBot-Google user agent

Hello,

I was using your awesome list to filter out some robots traffic, but recently results got worse than before, I found out it is because of change on AdsBot-Google user agent. It stopped matching mobile version of AdsBot-Google-Mobile (https://support.google.com/adwords/answer/2404197).

Would it be possible to just change "AdsBot-Google(?!-)" back to "AdsBot-Google"?

I think all user agents that contains AdsBot-Google.*, now or in future, are bots, so it is probably more robust to use is that way.

Something nice to have: I am also using re2 library to match user agents (https://github.com/google/re2) but such regex is not supported in re2 and it must fallback to original re lib which is bit slower, so it would be wonderful to be able to stay compatible with re2.

  {
    "pattern": "AdsBot-Google(?!-)",
    "url": "https://support.google.com/webmasters/answer/1061943?hl=en"
  },
  {
    "pattern": "AdsBot-Google-Mobile-Apps",
    "addition_date": "2017/08/08",
    "url": "https://support.google.com/webmasters/answer/1061943?hl=en"
},

Add links to language specific libraries and alternative databases in the README file

Other projects on GitHub provide language specific front-ends to crawler-user-agents, which people visiting the GitHub page for this project may not know about:

I suggest the README should list both of these so that people can find them easily.

This project is also one of a number of different projects on GitHub with databases of crawlers. Other projects may be more suitable to their needs than this one.

I've listed some other projects that allow people to identify crawlers by user agent below. These are the projects I can find that have good quality, comprehensive databases, and which seem to be being actively maintained. Others obviously exist but these appear to be the good quality ones.

I'd suggest that these would also be useful additions to the README.

Feature request virustotal

Please add VirusTotal crawlers. They have several bots like next. You can check it by yourself on https://www.virustotal.com/gui/home/url:

  1. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
  2. Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
  3. AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
  4. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)
  5. AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
    As well https://redirectdetective.com/ and https://wheregoes.com/

Yandex Browser

{ "pattern": "yandex" },
is too wide. There is also Yandex.Browser.

Yandex's bot's signature includes YandexBot.

JSON syntax issues

When I use json.lint your json syntax is fine, but when I try to use it on WAMP (PHP 5.6) I get syntax errors. I narrowed it down to the line

"pattern": "Mediapartners \\(Googlebot\\)",

(I hope the backslashes will show in this comment)

I changed it to this and it worked fine:

"pattern": "Mediapartners [(]Googlebot[)]",

Since your JSON is technically valid you may not want to bother "fixing" what isn't broken, but I thought I would mention it here in case others run into the problem.

New bot names to add

The following have appeared in my logs recently and could be added:

  • TelegramBot
  • Discordbot
  • DuckDuckGo-Favicons-Bot
  • Ocarinabot
  • Cliqzbot
  • epicbot
  • SEMrushBot
  • Primalbot
  • GnowitNewsbot
  • Primalbot
  • Leikibot
  • LinkArchiver
  • linkfluence
  • PaperLiBot
  • infegy.com
  • Digg Deeper
  • dcrawl
  • Snacktory
  • AndersPinkBot
  • Fyrebot
  • EveryoneSocialBot
  • LivelapBot

Hope this is helpful!

Ordering by most common

Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.

I did a very quick optimization using the frequency reported on this page:

https://deviceatlas.com/blog/list-of-web-crawlers-user-agents

And then I put all your patterns (concatenated with |) into 2 preg_match() calls:

if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}

Providing a script to produce that might be a help...?

Adding new version of existing bot entry

I've found that I'm getting "Python-requests/2.26.0" passing the check for bots.
Clearly, this version is not in the instances object array for the python-requests pattern.

Is adding version 2.26.0 something that should also be PR'd if I would like to request it get added?
I know to PR if a NEW pattern is found that isn't in the json already.
Just unclear on adding a version to an already existing pattern in your repository.

Apologies if this seems like a silly question, just trying to do it right. 😁
Thanks!

Automatic npm continuous deployments broken

Found this error in the Travis CI logs:

$ npm version `node -e 'pacote=require("pacote");pacote.manifest("crawler-user-agents").then(pkgJson => { console.log(pkgJson.version); });'`
(node:3731) UnhandledPromiseRejectionWarning: ReferenceError: URL is not defined
    at regKeyFromURI (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/auth.js:7:18)
    at getAuth (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/auth.js:49:18)
    at regFetch (/home/travis/build/monperrus/crawler-user-agents/node_modules/npm-registry-fetch/index.js:55:16)
    at RegistryFetcher.packument (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/registry.js:86:15)
    at RegistryFetcher.manifest (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/registry.js:117:17)
    at Object.manifest (/home/travis/build/monperrus/crawler-user-agents/node_modules/pacote/lib/index.js:16:45)
    at [eval]:1:33
    at ContextifyScript.Script.runInThisContext (vm.js:50:33)
    at Object.runInThisContext (vm.js:139:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
(node:3731) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:3731) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Instagram bots details

https://mpulp.mobi/2016/07/12/instagram-user-agent/

iPhone:
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334

Android:
Mozilla/5.0 (Linux; Android 6.0.1; SM-G935T Build/MMB29M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/51.0.2704.81 Mobile Safari/537.36 Instagram 8.4.0 Android (23/6.0.1; 560dpi; 1440x2560; samsung; SM-G935T; hero2qltetmo; qcom; en_US

I noticed a recent bot which I believe is coming from Tencent but can't find documentation for - Ottabot

Hi! 👋

Firstly, thanks for your work on this project! 🙂

Today I used patch-package to patch [email protected] for the project I'm working on.

Here is the diff that solved my problem:

diff --git a/node_modules/crawler-user-agents/crawler-user-agents.json b/node_modules/crawler-user-agents/crawler-user-agents.json
index 13c9910..70836b8 100644
--- a/node_modules/crawler-user-agents/crawler-user-agents.json
+++ b/node_modules/crawler-user-agents/crawler-user-agents.json
@@ -5140,5 +5140,13 @@
       "Mozilla/5.0 (compatible; StractBot/0.1; open source search engine; +https://trystract.com/webmasters)"
     ],
     "url": "https://trystract.com/webmasters"
+  },
+  {
+    "pattern": "Ottabot",
+    "addition_date": "2023/09/06",
+    "instances": [
+      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 (Ottabot/0.1.0)"
+    ],
+    "url": ""
   }
 ]

This issue body was partially generated by patch-package.

Add Dynatrace / RuxitSynthetic into the list

Hey, a bit new to this, I could not figure out how to do a pull request. I have a monitoring bot to add to the list, prepared the JSON below.

{
  "pattern": "RuxitSynthetic",
  "addition_date": "2023/02/16",
  "url": "https://www.dynatrace.com/support/help/platform-modules/digital-experience/synthetic-monitoring/browser-monitors/configure-browser-monitors#expand--default-user-agent",
  "instances" : ["RuxitSynthetic/1.0"]
}

One thing I am struggling with, some parts of the user-agent are dynamic, according to their docs. So I only added the specific unique part into instances.

Can you help me with the implementation? Thanks!

Addition suggestion

Could you add Cookiebot?

{
  "pattern": "Cookiebot",
  "addition_date": "2022/01/23",
  "url": "https://www.cookiebot.com/",
  "instances": [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Cookiebot/1.0; +http://cookiebot.com/) Chrome/97.0.4692.71 Safari/537.36"
  ]
}

port CI to GithubActions

Travis CI is hardly usable.

Anybody to contribute the corresponding Github Actions workflow?

that would be great thanks!

Code-friendly license

I respect every developer’s right to pick their own license, but have you considered using a more code-friendly license?

Unfortunately, CC licenses are problematic for code, as they are more targeted at artwork and their definitions of „sharing“ and „mixing“ are ambiguous in the context of code.

My suggestion would be to switch to either the MIT or the LGPL license. MIT is rather permissive and very common these days, LGPL is more similar to CC-BY-SA.

Some user agents have incorrect capitalisation

Some user agents do not have the correct capitalisation and are currently all lowercase.

In particular:

  • AhrefsBot
  • Baiduspider
  • Wget

This means they are not matching when used with "grep -f" and I would assume would be breaking with other regex engines also.

Escape dots

Hi,
The 'pattern' field is a regular expression, isn't it?
If so, shouldn't the dots be escaped? (. instead of .)

Some missing bot UAs

Hello.
Good Job on this collection.
Here are some UAs we have found in our website logs that seem to be bots and the current version does not include:

  • Done all, see below

The ones that are now handled (removed from the above list for sake of clarity):

Not kept:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.