Code Monkey home page Code Monkey logo

robotstxt's Introduction

A ‘robots.txt’ Parser and ‘Webbot’/‘Spider’/‘Crawler’ Permissions Checker

ropensci_footer

Status

lines of R code: 1007, lines of test code: 1758

Project Status: Active – The project has reached a stable, usable state and is being actively developed. cran checks Codecov

Development version

0.7.13 - 2020-08-19 / 20:39:24

Description

Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.

License

MIT + file LICENSE
Peter Meissner [aut, cre], Kun Ren [aut, cph] (Author and copyright holder of list_merge.R.), Oliver Keys [ctb] (original release code review), Rich Fitz John [ctb] (original release code review)

Citation

citation("robotstxt")

BibTex for citing

toBibtex(citation("robotstxt"))

Contribution - AKA The-Think-Twice-Be-Nice-Rule

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:

As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant (https://www.contributor-covenant.org/), version 1.0.0, available at https://www.contributor-covenant.org/version/1/0/0/code-of-conduct/

Installation

Installation and start - stable version

install.packages("robotstxt")
library(robotstxt)

Installation and start - development version

devtools::install_github("ropensci/robotstxt")
library(robotstxt)

Usage

Robotstxt class documentation

?robotstxt

Simple path access right checking (the functional way) …

library(robotstxt)
options(robotstxt_warn = FALSE)


paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"), 
  domain = "wikipedia.org", 
  bot    = "*"
)
##  wikipedia.org
## [1]  TRUE FALSE

paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc", 
    "https://wikipedia.org/w/"
  )
)
##  wikipedia.org                       wikipedia.org
## [1]  TRUE FALSE

… or (the object oriented way) …

library(robotstxt)
options(robotstxt_warn = FALSE)

rtxt <- 
  robotstxt(domain = "wikipedia.org")

rtxt$check(
  paths = c("/api/rest_v1/?doc", "/w/"), 
  bot   = "*"
)
## [1]  TRUE FALSE

Retrieval

Retrieving the robots.txt file for a domain:

# retrieval
rt <- 
  get_robotstxt("https://petermeissner.de")

# printing
rt
## [robots.txt]
## --------------------------------------
## 
## # just do it - punk

Interpretation

Checking whether or not one is supposadly allowed to access some resource from a web server is - unfortunately - not just a matter of downloading and parsing a simple robots.txt file.

First there is no official specification for robots.txt files so every robots.txt file written and every robots.txt file read and used is an interpretation. Most of the time we all have a common understanding on how things are supposed to work but things get more complicated at the edges.

Some interpretation problems:

  • finding no robots.txt file at the server (e.g. HTTP status code 404) implies that everything is allowed
  • subdomains should have there own robots.txt file if not it is assumed that everything is allowed
  • redirects involving protocol changes - e.g. upgrading from http to https - are followed and considered no domain or subdomain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the original domain
  • redirects from subdomain www to the doamin is considered no domain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the subdomain originally requested

Event Handling

Because the interpretation of robots.txt rules not just depends on the rules specified within the file, the package implements an event handler system that allows to interpret and re-interpret events into rules.

Under the hood the rt_request_handler() function is called within get_robotstxt(). This function takes an {httr} request-response object and a set of event handlers. Processing the request and the handlers it checks for various events and states around getting the file and reading in its content. If an event/state happened the event handlers are passed on to the request_handler_handler() along for problem resolution and collecting robots.txt file transformations:

  • rule priorities decide if rules are applied given the current state priority
  • if rules specify signals those are emitted (e.g. error, message, warning)
  • often rules imply overwriting the raw content with a suitable interpretation given the circumstances the file was (or was not) retrieved

Event handler rules can either consist of 4 items or can be functions - the former being the usual case and that used throughout the package itself. Functions like paths_allowed() do have parameters that allow passing along handler rules or handler functions.

Handler rules are lists with the following items:

  • over_write_file_with: if the rule is triggered and has higher priority than those rules applied beforehand (i.e. the new priority has an higher value than the old priority) than the robots.txt file retrieved will be overwritten by this character vector
  • signal: might be "message", "warning", or "error" and will use the signal function to signal the event/state just handled. Signaling a warning or a message might be suppressed by setting the function paramter warn = FALSE.
  • cache should the package be allowed to cache the results of the retrieval or not
  • priority the priority of the rule specified as numeric value, rules with higher priority will be allowed to overwrite robots.txt file content changed by rules with lower priority

The package knows the following rules with the following defaults:

  • on_server_error :
  • given a server error - the server is unable to serve a file - we assume that something is terrible wrong and forbid all paths for the time being but do not cache the result so that we might get an updated file later on
on_server_error_default
## $over_write_file_with
## [1] "User-agent: *\nDisallow: /"
## 
## $signal
## [1] "error"
## 
## $cache
## [1] FALSE
## 
## $priority
## [1] 20
  • on_client_error :
  • client errors encompass all HTTP status 4xx status codes except 404 which is handled directly
  • despite the fact that there are a lot of codes that might indicate that the client has to take action (authentication, billing, … see: https://de.wikipedia.org/wiki/HTTP-Statuscode) in the case of retrieving robots.txt with simple GET request things should just work and any client error is treated as if there is no file available and thus scraping is generally allowed
on_client_error_default
## $over_write_file_with
## [1] "User-agent: *\nAllow: /"
## 
## $signal
## [1] "warning"
## 
## $cache
## [1] TRUE
## 
## $priority
## [1] 19
  • on_not_found :
  • HTTP status code 404 has its own handler but is treated the same ways other client errors: if there is no file available and thus scraping is generally allowed
on_not_found_default
## $over_write_file_with
## [1] "User-agent: *\nAllow: /"
## 
## $signal
## [1] "warning"
## 
## $cache
## [1] TRUE
## 
## $priority
## [1] 1
  • on_redirect :
  • redirects are ok - often redirects redirect from HTTP schema to HTTPS - robotstxt will use whatever content it has been redirected to
on_redirect_default
## $cache
## [1] TRUE
## 
## $priority
## [1] 3
  • on_domain_change :
  • domain changes are handled as if the robots.txt file did not exist and thus scraping is generally allowed
on_domain_change_default
## $signal
## [1] "warning"
## 
## $cache
## [1] TRUE
## 
## $priority
## [1] 4
  • on_file_type_mismatch :
  • if {robotstxt} gets content with content type other than text it probably is not a robotstxt file, this situation is handled as if no file was provided and thus scraping is generally allowed
on_file_type_mismatch_default
## $over_write_file_with
## [1] "User-agent: *\nAllow: /"
## 
## $signal
## [1] "warning"
## 
## $cache
## [1] TRUE
## 
## $priority
## [1] 6
  • on_suspect_content :
  • if {robotstxt} cannot parse it probably is not a robotstxt file, this situation is handled as if no file was provided and thus scraping is generally allowed
on_suspect_content_default
## $over_write_file_with
## [1] "User-agent: *\nAllow: /"
## 
## $signal
## [1] "warning"
## 
## $cache
## [1] TRUE
## 
## $priority
## [1] 7

Design Map for Event/State Handling

from version 0.7.x onwards

While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.

Features and Problems handled:

  • now handles corner cases of retrieving robots.txt files
  • e.g. if no robots.txt file is available this basically means “you can scrape it all”
  • but there are further corner cases (what if there is a server error, what if redirection takes place, what is redirection takes place to different domains, what if a file is returned but it is not parsable, or is of format HTML or JSON, …)

Design Decisions

  1. the whole HTTP request-response-chain is checked for certain event/state types
    • server error
    • client error
    • file not found (404)
    • redirection
    • redirection to another domain
  2. the content returned by the HTTP is checked against
    • mime type / file type specification mismatch
    • suspicious content (file content does seem to be JSON, HTML, or XML instead of robots.txt)
  3. state/event handler define how these states and events are handled
  4. a handler handler executes the rules defined in individual handlers
  5. handler can be overwritten
  6. handler defaults are defined that they should always do the right thing
  7. handler can …
    • overwrite the content of a robots.txt file (e.g. allow/disallow all)
    • modify how problems should be signaled: error, warning, message, none
    • if robots.txt file retrieval should be cached or not
  8. problems (no matter how they were handled) are attached to the robots.txt’s as attributes, allowing for …
    • transparency
    • reacting post-mortem to the problems that occured
  9. all handler (even the actual execution of the HTTP-request) can be overwritten at runtime to inject user defined behaviour beforehand

Warnings

By default all functions retrieving robots.txt files will warn if there are

  • any HTTP events happening while retrieving the file (e.g. redirects) or
  • the content of the file does not seem to be a valid robots.txt file.

The warnings in the following example can be turned of in three ways:

(example)

library(robotstxt)

paths_allowed("petermeissner.de")
##  petermeissner.de
## [1] TRUE

(solution 1)

library(robotstxt)

suppressWarnings({
  paths_allowed("petermeissner.de")
})
##  petermeissner.de
## [1] TRUE

(solution 2)

library(robotstxt)

paths_allowed("petermeissner.de", warn = FALSE)
##  petermeissner.de
## [1] TRUE

(solution 3)

library(robotstxt)

options(robotstxt_warn = FALSE)

paths_allowed("petermeissner.de")
##  petermeissner.de
## [1] TRUE

Inspection and Debugging

The robots.txt files retrieved are basically mere character vectors:

rt <- 
  get_robotstxt("petermeissner.de")

as.character(rt)
## [1] "# just do it - punk\n"

cat(rt)
## # just do it - punk

The last HTTP request is stored in an object

rt_last_http$request
## Response [https://petermeissner.de/robots.txt]
##   Date: 2020-09-03 19:05
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

But they also have some additional information stored as attributes.

names(attributes(rt))
## [1] "problems" "cached"   "request"  "class"

Events that might change the interpretation of the rules found in the robots.txt file:

attr(rt, "problems")
## $on_redirect
## $on_redirect[[1]]
## $on_redirect[[1]]$status
## [1] 301
## 
## $on_redirect[[1]]$location
## [1] "https://petermeissner.de/robots.txt"
## 
## 
## $on_redirect[[2]]
## $on_redirect[[2]]$status
## [1] 200
## 
## $on_redirect[[2]]$location
## NULL

The {httr} request-response object that allwos to dig into what exactly was going on in the client-server exchange.

attr(rt, "request")
## Response [https://petermeissner.de/robots.txt]
##   Date: 2020-09-03 19:05
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

… or lets us retrieve the original content given back by the server:

httr::content(
  x        = attr(rt, "request"), 
  as       = "text",
  encoding = "UTF-8"
)
## [1] "# just do it - punk\n"

… or have a look at the actual HTTP request issued and all response headers given back by the server:

# extract request-response object
rt_req <- 
  attr(rt, "request")

# HTTP request
rt_req$request
## <request>
## GET http://petermeissner.de/robots.txt
## Output: write_memory
## Options:
## * useragent: libcurl/7.64.1 r-curl/4.3 httr/1.4.1
## * ssl_verifypeer: 1
## * httpget: TRUE
## Headers:
## * Accept: application/json, text/xml, application/xml, */*
## * user-agent: R version 3.6.3 (2020-02-29)

# response headers
rt_req$all_headers
## [[1]]
## [[1]]$status
## [1] 301
## 
## [[1]]$version
## [1] "HTTP/1.1"
## 
## [[1]]$headers
## $server
## [1] "nginx/1.10.3 (Ubuntu)"
## 
## $date
## [1] "Thu, 03 Sep 2020 19:05:45 GMT"
## 
## $`content-type`
## [1] "text/html"
## 
## $`content-length`
## [1] "194"
## 
## $connection
## [1] "keep-alive"
## 
## $location
## [1] "https://petermeissner.de/robots.txt"
## 
## attr(,"class")
## [1] "insensitive" "list"       
## 
## 
## [[2]]
## [[2]]$status
## [1] 200
## 
## [[2]]$version
## [1] "HTTP/1.1"
## 
## [[2]]$headers
## $server
## [1] "nginx/1.10.3 (Ubuntu)"
## 
## $date
## [1] "Thu, 03 Sep 2020 19:05:45 GMT"
## 
## $`content-type`
## [1] "text/plain"
## 
## $`content-length`
## [1] "20"
## 
## $`last-modified`
## [1] "Thu, 03 Sep 2020 15:33:01 GMT"
## 
## $connection
## [1] "keep-alive"
## 
## $etag
## [1] "\"5f510cad-14\""
## 
## $`accept-ranges`
## [1] "bytes"
## 
## attr(,"class")
## [1] "insensitive" "list"

Transformation

For convenience the package also includes a as.list() method for robots.txt files.

as.list(rt)
## $content
## [1] "# just do it - punk\n"
## 
## $robotstxt
## [1] "# just do it - punk\n"
## 
## $problems
## $problems$on_redirect
## $problems$on_redirect[[1]]
## $problems$on_redirect[[1]]$status
## [1] 301
## 
## $problems$on_redirect[[1]]$location
## [1] "https://petermeissner.de/robots.txt"
## 
## 
## $problems$on_redirect[[2]]
## $problems$on_redirect[[2]]$status
## [1] 200
## 
## $problems$on_redirect[[2]]$location
## NULL
## 
## 
## 
## 
## $request
## Response [https://petermeissner.de/robots.txt]
##   Date: 2020-09-03 19:05
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

Caching

The retrieval of robots.txt files is cached on a per R-session basis. Restarting an R-session will invalidate the cache. Also using the the function parameter froce = TRUE will force the package to re-retrieve the robots.txt file.

paths_allowed("petermeissner.de/I_want_to_scrape_this_now", force = TRUE, verbose = TRUE)
##  petermeissner.de                      rt_robotstxt_http_getter: force http get
## [1] TRUE
paths_allowed("petermeissner.de/I_want_to_scrape_this_now",verbose = TRUE)
##  petermeissner.de                      rt_robotstxt_http_getter: cached http get
## [1] TRUE

More information

robotstxt's People

Contributors

dmi3kno avatar gittaca avatar karthik avatar maelle avatar mine-cetinkaya-rundel avatar pedrobtz avatar petermeissner avatar sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

robotstxt's Issues

Fix documentation errors

ropensci/software-review#25 by @sckott

✔  checking for missing documentation entries
W  checking for code/documentation mismatches
   Codoc mismatches from documentation object 'robotstxt':
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(domain = "mydomain.com")
     Argument names in code not in docs:
       text
     Mismatches in argument default values:
       Name: 'domain' Code: NULL Docs: "mydomain.com"
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(text = "User-agent: *\nDisallow: /")
     Argument names in code not in docs:
       domain
     Mismatches in argument names:
       Position: 1 Code: domain Docs: text
     Mismatches in argument default values:
       Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"

W  checking Rd \usage sections
   Undocumented arguments in documentation object 'get_robotstxt'
     ‘warn’

   Undocumented arguments in documentation object 'robotstxt'
     ‘domain’ ‘text’

use memoise caching approach

ropensci/software-review#25 by @richfitz

I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about.

cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
  x <- httr::parse_url(url)
  obj <- cache[[x$hostname]]
  if (is.null(obj)) {
    cache[[x$hostname]] <- obj <- robotstxt(hostname)
  }
  obj$check(x$path)
}

Test cases for new features in redirects

The new system to handle states of request and responses as well as content returned via get_robotstxt is the following.

Prior to the new system things were handle from a perspective of some crude desired behavior ... if file server returns 404 than ... . Now there are a series of 'events' that can be handled. Handling can be done via the standard handler (request_handler_handler(request, handler, res, info), ) or via a user supplied function. The term events is used very broadly and basically means that there is a set of things that the system will notice and handler functions are triggered to take care of manipulating the returned content of the robots.txt file, decide whether or not the result should be cached and if some messaging should be done: none, message, warning, or error.

So far everything seems to work fine with all tests that were in place already - meaning that most of the package should work as expected. But things related to HTTP requests and their interpretation were never ever really tested (because it's more complicated to mock HTTP request back and forth and because it was not really an issue for interpreting permissions - until now).

tests and testing needed for...

  • redirect to "www."-subdomain
  • general HTTP interaction
  • server error
  • client error
  • not found
  • redirects
  • domain change
  • file type mismatch
  • suspect file content

declare encoding="UTF-8"

ropensci/software-review#25 by @richfitz

Avoid httr::content(rtxt) in package code (see ?httr::content) in favour of checking that the return type was correct. In development versions of httr I find I also have to declare encoding="UTF-8" to avoid a message on every request, too.

Parsing would fail for comment in last line

robots.txt example:

User-agent: bot_1
Disallow: /path_1

# User-agent: bot_2
# Disallow: /path_2

# Sitemap: /sitemap.php

--> Error:

Error: Test failed: 'Commented-out tokens get parsed correctly'
* 'names' attribute [2] must be the same length as the vector [0]
Backtrace:
 1. testthat::expect_true(...)
 5. robotstxt::parse_robotstxt(rtxt_ct)
 6. robotstxt::rt_get_fields(txt, "allow") …robotstxtRparse_robotstxt.R:7:2
 7. base::lapply(...) …robotstxtRrt_get_fields.R:38:2
 8. robotstxt:::FUN(X[[i]], ...)

See PR: #59

10.7554/ELIFE.09944/ROBOTS.TXT

Hi there

I was checking through our recent Crossref failures report and found this: 10.7554/ELIFE.09944/ROBOTS.TXT

It had 115 entries for the eLife prefix that did not resolve. This is not a DOI :-)

I thought you'd be interested in case it is something to do with this code as it is suffixed with "ROBOTS.TXT".

Sorry if this is nothing to do with you, but I thought I'd check!

Melissa
(eLife)

Event on_redirect resulting in bad behaviour

get_robotstxt(domain = "en.wikipedia.org")
[robots.txt]
--------------------------------------

User-agent: *
Allow: /



[problems]
--------------------------------------

- on_redirect 
   https://en.wikipedia.org/robots.txt 


[attributes]
--------------------------------------

* overwrite = TRUE 
* cached    = TRUE 

versus

get_robotstxt(domain = "https://en.wikipedia.org")
[robots.txt]
--------------------------------------

# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper



[...]



[attributes]
--------------------------------------

* overwrite = FALSE 
* cached    = TRUE 

special handling of redirects

In case of redirects to other domains instead of just assuming the returned data is the requested robots.txt file robotstxt should try to get the robots.txt file for that new domain:

Example:

  • GET github.io/robots.txt -> redirects to -> page.github.com
  • check for redirect to new domain?
  • yes: GET page.github.com/robots.txt
  • no: use result of redirect for robots.txt parsing and checking

GOV.UK Crawl-delay

Hi, thank you for this very handy package, which I learned about via https://github.com/dmi3kno/polite/.

I think there might be bug in the way Crawl-delay is parsed. Here is GOV.UK's robots.txt

User-agent: *
Disallow: /*/print$
# Don't allow indexing of user needs pages
Disallow: /info/*
Sitemap: https://www.gov.uk/sitemap.xml
# https://ahrefs.com/robot/ crawls the site frequently
User-agent: AhrefsBot
Crawl-delay: 10
# https://www.deepcrawl.com/bot/ makes lots of requests. Ideally
# we'd slow it down rather than blocking it but it doesn't mention
# whether or not it supports crawl-delay.
User-agent: deepcrawl
Disallow: /
# Complaints of 429 'Too many requests' seem to be coming from SharePoint servers
# (https://social.msdn.microsoft.com/Forums/en-US/3ea268ed-58a6-4166-ab40-d3f4fc55fef4)
# The robot doesn't recognise its User-Agent string, see the MS support article:
# https://support.microsoft.com/en-us/help/3019711/the-sharepoint-server-crawler-ignores-directives-in-robots-txt
User-agent: MS Search 6.0 Robot
Disallow: /

Which robotstxt interprets as a Crawl-delay of 10 for everyone.

robotstxt::robotstxt("https://www.gov.uk/")$crawl_delay
#>         field        useragent value
#> 1 Crawl-delay                *    10
#> 2 Crawl-delay        AhrefsBot    10
#> 3 Crawl-delay        deepcrawl    10
#> 4 Crawl-delay MSSearch6.0Robot    10

My interpretation is that no delay is specified for the * user-agent, and there is also no specific limit in GOV.UK's guidance.

You can scrape GOV.UK as long as you respect our robots.txt policy. If you make too many requests, your access will be limited until your request rate drops. This is known as rate limiting.

improve validity check for robots.txt files

At the moment robotstxt does very basic checking of validity (can the file be parsed or not).

Standard files like HTML, (JSON, XML, ...) that might be parsable but clearly no robots.txt to the human eye should be rejected as well.

If verbosity is turned on robotstxt should inform user about rejects.

Partial matching warnings

@petermeissner It works, thank you! Though I do get warnings.

library(robotstxt)
packageVersion("robotstxt")
#> [1] '0.7.2'

# works, but warning
paths_allowed("https://www.google.com")
#>  www.google.com
#> Warning in FUN(X[[i]], ...): partial argument match of 'x' to 'xp'
#> [1] TRUE

# also works, but also warning
paths_allowed("https://google.com")
#>  google.com
#> Warning in FUN(X[[i]], ...): partial argument match of 'x' to 'xp'
#> [1] TRUE

Created on 2020-05-04 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       macOS Catalina 10.15.4      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_GB.UTF-8                 
#>  ctype    en_GB.UTF-8                 
#>  tz       Europe/London               
#>  date     2020-05-04                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date       lib source                             
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                     
#>  backports      1.1.6      2020-04-05 [1] CRAN (R 4.0.0)                     
#>  callr          3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                     
#>  cli            2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                     
#>  codetools      0.2-16     2018-12-24 [1] CRAN (R 4.0.0)                     
#>  crayon         1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                     
#>  curl           4.3        2019-12-02 [1] CRAN (R 4.0.0)                     
#>  desc           1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                     
#>  devtools       2.3.0      2020-04-10 [1] CRAN (R 4.0.0)                     
#>  digest         0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                     
#>  ellipsis       0.3.0      2019-09-20 [1] CRAN (R 4.0.0)                     
#>  evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.0)                     
#>  fansi          0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                     
#>  fs             1.4.1      2020-04-04 [1] CRAN (R 4.0.0)                     
#>  future         1.17.0     2020-04-18 [1] CRAN (R 4.0.0)                     
#>  future.apply   1.5.0      2020-04-17 [1] CRAN (R 4.0.0)                     
#>  globals        0.12.5     2019-12-07 [1] CRAN (R 4.0.0)                     
#>  glue           1.4.0      2020-04-03 [1] CRAN (R 4.0.0)                     
#>  highr          0.8        2019-03-20 [1] CRAN (R 4.0.0)                     
#>  htmltools      0.4.0.9003 2020-05-01 [1] Github (rstudio/htmltools@984b39c) 
#>  httr           1.4.1      2019-08-05 [1] CRAN (R 4.0.0)                     
#>  knitr          1.28       2020-02-06 [1] CRAN (R 4.0.0)                     
#>  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.0.0)                     
#>  magrittr       1.5        2014-11-22 [1] CRAN (R 4.0.0)                     
#>  memoise        1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                     
#>  pkgbuild       1.0.7      2020-04-25 [1] CRAN (R 4.0.0)                     
#>  pkgload        1.0.2      2018-10-29 [1] CRAN (R 4.0.0)                     
#>  prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                     
#>  processx       3.4.2      2020-02-09 [1] CRAN (R 4.0.0)                     
#>  ps             1.3.2      2020-02-13 [1] CRAN (R 4.0.0)                     
#>  R6             2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                     
#>  Rcpp           1.0.4.6    2020-04-09 [1] CRAN (R 4.0.0)                     
#>  remotes        2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                     
#>  rlang          0.4.6      2020-05-02 [1] CRAN (R 4.0.0)                     
#>  rmarkdown      2.1        2020-01-20 [1] CRAN (R 4.0.0)                     
#>  robotstxt    * 0.7.2      2020-05-04 [1] Github (ropensci/robotstxt@891f1d4)
#>  rprojroot      1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                     
#>  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                     
#>  spiderbar      0.2.2      2019-08-19 [1] CRAN (R 4.0.0)                     
#>  stringi        1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                     
#>  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                     
#>  testthat       2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                     
#>  triebeard      0.3.0      2016-08-04 [1] CRAN (R 4.0.0)                     
#>  urltools       1.7.3      2019-04-14 [1] CRAN (R 4.0.0)                     
#>  usethis        1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)     
#>  withr          2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                     
#>  xfun           0.13       2020-04-13 [1] CRAN (R 4.0.0)                     
#>  yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                     
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Originally posted by @mine-cetinkaya-rundel in #50 (comment)

Warning in README.rmd example code

library(robotstxt)

paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"), 
  domain = "wikipedia.org", 
  bot    = "*"
)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'
## 
 wikipedia.org
## [1]  TRUE FALSE

paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc", 
    "https://wikipedia.org/w/"
  )
)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'
## 
 wikipedia.org                      
 wikipedia.org
## [1]  TRUE FALSE

add other fields to parse_robotstxt()

at the moment parse_robotstxt() extracts:

  • useragents
  • comments
  • peremissions
  • sitemap

but not

  • possible other fields

include all other fields in other.

parse_robotstxt <- function(txt){
  # return
  res <-
    list(
      useragents  = get_useragent(txt),
      comments    = get_comments(txt),
      permissions = get_permissions(txt),
      sitemap     = get_fields(txt, type="sitemap")
#      other = get_fields...??? TODO #
    )
  return(res)
}

paths_allowed gives error if www is included in URL

See reprex below:

library(robotstxt)

# doesn't work
paths_allowed("https://www.google.com")
#> www.google.com
#> Error in if (is_http) {: argument is of length zero

# works
paths_allowed("https://google.com")
#>  google.com                      No encoding supplied: defaulting to UTF-8.
#> [1] TRUE

Created on 2020-05-01 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       macOS Catalina 10.15.4      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_GB.UTF-8                 
#>  ctype    en_GB.UTF-8                 
#>  tz       Europe/London               
#>  date     2020-05-01                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date       lib source        
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports      1.1.6   2020-04-05 [1] CRAN (R 4.0.0)
#>  callr          3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
#>  cli            2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
#>  codetools      0.2-16  2018-12-24 [1] CRAN (R 4.0.0)
#>  crayon         1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
#>  curl           4.3     2019-12-02 [1] CRAN (R 4.0.0)
#>  desc           1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools       2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
#>  digest         0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
#>  ellipsis       0.3.0   2019-09-20 [1] CRAN (R 4.0.0)
#>  evaluate       0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi          0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
#>  fs             1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
#>  future         1.17.0  2020-04-18 [1] CRAN (R 4.0.0)
#>  future.apply   1.5.0   2020-04-17 [1] CRAN (R 4.0.0)
#>  globals        0.12.5  2019-12-07 [1] CRAN (R 4.0.0)
#>  glue           1.4.0   2020-04-03 [1] CRAN (R 4.0.0)
#>  highr          0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  htmltools      0.4.0   2019-10-04 [1] CRAN (R 4.0.0)
#>  httr           1.4.1   2019-08-05 [1] CRAN (R 4.0.0)
#>  knitr          1.28    2020-02-06 [1] CRAN (R 4.0.0)
#>  listenv        0.8.0   2019-12-05 [1] CRAN (R 4.0.0)
#>  magrittr       1.5     2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise        1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
#>  pkgbuild       1.0.7   2020-04-25 [1] CRAN (R 4.0.0)
#>  pkgload        1.0.2   2018-10-29 [1] CRAN (R 4.0.0)
#>  prettyunits    1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
#>  processx       3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
#>  ps             1.3.2   2020-02-13 [1] CRAN (R 4.0.0)
#>  R6             2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
#>  Rcpp           1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#>  remotes        2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
#>  rlang          0.4.5   2020-03-01 [1] CRAN (R 4.0.0)
#>  rmarkdown      2.1     2020-01-20 [1] CRAN (R 4.0.0)
#>  robotstxt    * 0.6.2   2018-07-18 [1] CRAN (R 4.0.0)
#>  rprojroot      1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
#>  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  spiderbar      0.2.2   2019-08-19 [1] CRAN (R 4.0.0)
#>  stringi        1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat       2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
#>  usethis        1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
#>  withr          2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun           0.13    2020-04-13 [1] CRAN (R 4.0.0)
#>  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

It looks like the issue is related to httr and it's quite likely it's possible this PR might fix it, but I'm not sure.

Case-sensitive robots.txt results in incorrect crawl delay

I'm trying to use the polite package for, well, polite, web-scraping. On problem I've run into is that it uses the robotstxt values for the crawl-delays, but in this specific example, it ends up with a crawl delay of 2000 (using the first line with *), which doesn't actually match the robots.txt values.

library(robotstxt)
r <- robotstxt("https://r-bloggers.com")
r$crawl_delay
#>         field useragent value
#> 1 Crawl-delay Googlebot     1
#> 2 Crawl-delay     spbot  2000
#> 3 Crawl-delay   BLEXBot  2000
#> 4 Crawl-delay         *  2000
#> 5 Crawl-delay         *    20

I think the problem is that one of the User-agents defined in the robots.txt file has a capital "A". Is this something that should definitely be fixed by the site, or would it be possible to make the argument matching case-insensitive?

https://r-bloggers.com/robots.txt

Showing only part...

User-agent: Googlebot-Mobile
Allow: /

User-agent: Googlebot
Crawl-delay: 1

User-agent: spbot
Crawl-delay: 2000

User-agent: BLEXBot
Crawl-delay: 2000

User-Agent: AhrefsBot 
Crawl-delay: 2000

User-agent: * 
Crawl-delay: 20

Thanks!

Incomplete paths

Just wonder if this is expected behavior that robotstxt does not seem to be able to check incomplete path (i.e. path without ending /)?

robotstxt::paths_allowed("w", domain="https://www.wikipedia.org")
#>  https://www.wikipedia.org                      
#> [1] TRUE

robotstxt::paths_allowed("w/", domain="https://www.wikipedia.org")
#>  https://www.wikipedia.org                      
#> [1] FALSE

From the point of view of the user wikipedia.org/w is perfectly valid url.

Package installment

The installment of additional pacakges (curl, digest and jsonlite in my case) collides with already installed packages:

devtools::install_github("petermeissner/robotstxt")
Downloading GitHub repo petermeissner/robotstxt@master
Installing robotstxt
Installing 3 packages: curl, digest, jsonlite
package ‘curl’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘curl’
package ‘digest’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘digest’

release 0.7.x

The next version of robots.txt should be released to CRAN.

Possible logic/regex error with `paths_allowed()`

(explanation follows)

library(robotstxt)

get_robotstxt("www.cdc.gov")
## # Ignore FrontPage files
## User-agent: *
## Disallow: /_borders
## Disallow: /_derived
## Disallow: /_fpclass
## Disallow: /_overlay
## Disallow: /_private
## Disallow: /_themes
## Disallow: /_vti_bin
## Disallow: /_vti_cnf
## Disallow: /_vti_log
## Disallow: /_vti_map
## Disallow: /_vti_pvt
## Disallow: /_vti_txt
## 
## # Do not index the following URLs
## Disallow: /travel/
## Disallow: /flu/espanol/
## Disallow: /migration/
## Disallow: /Features/SpinaBifidaProgram/
## Disallow: /concussion/HeadsUp/training/
## 
## # Don't spider search pages
## Disallow: /search.do
## 
## # Don't spider email-this-page pages
## Disallow: /email.do
##  
## # Don't spider printer-friendly versions of pages
## Disallow: /print.do
## 
## # Rover is a bad dog
## User-agent: Roverbot
## Disallow: /
## 
## # EmailSiphon is a hunter/gatherer which extracts email addresses for spam-mailers to use
## User-agent: EmailSiphon
## Disallow: /
## 
## # Exclude MindSpider since it appears to be ill-behaved
## User-agent: MindSpider
## Disallow: /
## 
## # Sitemap link per CR14586
## Sitemap: http://www.cdc.gov/niosh/sitemaps/sitemapsNIOSH.xml
## 
paths_allowed("/asthma/asthma_stats/default.htm", "www.cdc.gov")
## [1] FALSE

Via: https://technicalseo.com/seo-tools/robots-txt/

image

And:

import urllib.robotparser as robotparser

parser = robotparser.RobotFileParser()
parser.set_url('http://www.cdc.gov/robots.txt')
parser.read()
print(parser.can_fetch("*", "/asthma/asthma_stats/default.htm"))
## True

I was prepping a blog post to introduce a function that would prefix an httr::GET() request with a robots.txt path check (which is part of a larger personal project I'm working on). As you can see, it did not work properly on the CDC site which does, indeed, allow scraping.

It works fine on others in the examples I was using, such as https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.

I haven't poked at the code yet to see what may be causing the disconnect but I'll see if I can figure out what's going on.

I issued first since this might just trigger an "Aha!" on your end with a quick fix :-)

dropping R6 dependency and use list implementation instead

ropensci/software-review#25 by @richfitz

Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird (x <- robotstxt() not x <- robotstxt$new()).

robotstxt <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

(this passes checks in the package by deleting only the $new in the test file).

Please use git tags and releases

hi 👋

We want all rOpenSci pkgs to consistently keep track of changes. You already use NEWS.md, thanks for that. Please also do the following:

  • git tag each CRAN version - you've git tagged some, but not all cran versions. To git tag commits for previous CRAN versions, you can find the commit, then do e.g., git tag -a v1.2 9fceb02 where 9fceb02 is the commit sha
  • use the releases tab in this repo to include the associated NEWS items for each tag/version, e.g., see https://github.com/ropensci/taxize/releases/tag/v0.8.8 - you've already done this, but please continue to do so.

Can you please do the two above items from now on?

Error when calling rt$check

Trying to reproduce the example code provided in ?robotstxt, the following line throws an error:

rt$check( paths = c("/", "forbidden"), bot="*")
Error in stri_replace_first_regex(string, pattern, replacement, opts_regex = attr(pattern, :
object 'tmp' not found

This happens regardless of the domain (I tried google.com and wikipedia.org).

future_lapply() in future package is deprecated

get_robotstxts("https://google.com", use_futures = T) gives a warning with version 1.8.1 of future:

The implementation of future_lapply() in the 'future' package has been deprecated. Please use the one in the 'future.apply' package instead.

Save cached/normal as attribute

As mentioned on Twitter, you are making a check if cached version of robotstxt exists, but you never write this property into the object.

if ( verbose == TRUE ){
message("rt_robotstxt_http_getter: force http get")
}
} else if ( !is.null(rt_cache[[domain]]) ) {
request <-
rt_cache[[domain]]
if ( verbose == TRUE ){
message("rt_robotstxt_http_getter: cached http get")
}
} else if ( is.null(rt_cache[[domain]]) ){
request <-
rt_robotstxt_http_getter(
domain = domain,
user_agent = user_agent,
ssl_verifypeer = ssl_verifypeer[1]
)
rt_cache[[domain]] <- request
if ( verbose == TRUE ){
message("rt_robotstxt_http_getter: normal http get")
}

Could we please save/update it in the request object? I want to be able to tell whether I have fresh copy or a copy from cache in code (without turning on the verbose mode).

transfer to ropensci

👋 @petermeissner! we no longer transfer onboarded packages to ropenscilabs but instead directly to ropensci so we should move this repo. Are you ok with my doing it? If so when would you have the time to update all links? I wouldn't transfer before you have time to do that, and it's not urgent to transfer anyway.

If possible could you please add this badge to the README of rnaturalearth when updating CI links?
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)

Thanks for your patience! 🙏

verbosity (minor redesign)

Make get_robotstxt(..., verbose = FALSE) and paths_allowed(..., verbose = FALSE) verbose but tuning verbosity out by defaulting to set parameter verbose to FALSE.

Improve validity check: treat error messages as invalid

When using a try(robotstxt::get_robotstxt("domain.tld")) expression to enable skipping over curl or SSL errors, I'm running into the problem that a subsequent is_valid_robotstxt(data$get_robotstxt_output) check treats the error message that was saved, as a valid robots.txt.

To reproduce:

> robotstxt::get_robotstxt("domain.tld")
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Could not resolve host: domain.tld

> robotstxt::is_valid_robotstxt("Error in curl::curl_fetch_memory(url, handle = handle) : 
  Could not resolve host: domain.tld")
[1] TRUE

Would it make sense to explicitly grepl for ([Dd]is)?allow: and User-agent: in is_valid_robotstxt.R to make the check stricter?

Related to #32.

Second level domains and HTTP redirects

I wonder how robotstxt deals with redirects

# downloads html, signals a warning
rt <- robotstxt::get_robotstxt("github.io")
#> <!DOCTYPE html>
#> <html lang="en">
#> [...]
#> Warning message:
#> In robotstxt::get_robotstxt("github.io") :
#>   get_robotstxt(): github.io; Not valid robots.txt.

robotstxt::is_valid_robotstxt(rt)
#> [1] TRUE

# what's going on? 
httr::GET("github.io")
#> Response [https://pages.github.com/]
#>  Date: 2018-07-26 20:54
#>  ...

# this one is also valid
rt <- robotstxt::get_robotstxt("pages.github.com")
robotstxt::is_valid_robotstxt(rt)
#> [1] TRUE

# so what did we get?
print(rt)
#> Sitemap: https://pages.github.com/sitemap.xml

Couple of questions:

  • How do we handle redirects? A simple GET would indicate that we're looking in the wrong place. Should we warn the user and get the redirected robots.txt instead? Anyway better than rendering html header...
  • What happens if robots.txt is not delegated to second-level domain? Clearly github.com has its very proper robots.txt, but none of the second-level domains has its own properly populated file (checked pages.github.com, education.github.com, blog.github.com, raw.github.com, with an exception of gist.github.com). In my view, file is not valid if there isn't a single rule in it. It is not necessarily corrupt - it just doesn't exist. Shouldn't we, again, warn user and return robots.txt from first-level domain instead?

In fact raw.github.com is explicitly disallowed,

robotstxt::paths_allowed(paths="raw", domain="github.com", bot="*")
#> [1] FALSE

but it is written in slightly unusual way: there's no github.com/raw so the rule in main robots.txt actually refers to second level domain.

httr::GET("https://github.com/raw")
#> Response [https://github.com/raw]
#>   Date: 2018-07-26 21:43
#>   Status: 404
#> [...]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.