Code Monkey home page Code Monkey logo

diosts's Introduction

diosts

The disclose.io security.txt scraper (diosts) takes a list of domains as the input, retrieves and validates the security.txt if available and outputs it in the disclose.io JSON format.

Installation

Prerequisites: a working Golang installation >= 1.13

go get github.com/disclose/diosts/cmd/diosts

Usage

cat domains.txt | ~/go/bin/diosts -t <threads> 2>diosts.log >securitytxt.json

This wil try and scrape the security.txt from the domains listed in domains.txt, with <threads> parallel threads (defaults to 8). Logging (with information on each of the domains in the input) will be written to diosts.log (because it's output to stderr) and a JSON array of retrieved security.txt information in disclose.io format will be written to securitytxt.json.

For each input, the following URIs are tried, in order:

  1. https://<domain>/.well-known/security.txt
  2. https://<domain>/security.txt
  3. http://<domain>/.well-known/security.txt
  4. http://<domain>/security.txt

Any non-fatal violations of the security.txt specification will be logged.

Build

Note: building is not necessary if you use the installation instructions, Go will take care of this for you.

git clone https://github.com/disclose/diosts
cd diosts
go build ./cmd/diosts

Notes

Redirects

According to the specifications, a redirect should be followed when retrieving security.txt. However:

When retrieving the file and any resources referenced in the file, researchers should record any redirects since they can lead to a different domain or IP address controlled by an attacker. Further inspections of such redirects is recommended before using the information contained within the file.

At this point, we blindly accept redirects within the same organization (e.g., google.com to www.google.com is accepted). Any other redirect is logged as an error, to be dealt with later.

Canonical

A security.txt should contain a Canonical field with a URL pointing to the canonical version of the security.txt. We should check if we retrieved the security.txt from the canonical URL and if not, do so.

Program name

Currently, we use the input domain name as program name. This might or might not be correct, especially with redirects and canonical URL entries. To be discussed later.

diosts's People

Contributors

gi-el avatar hakluke avatar kenjoe41 avatar yesnet0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

diosts's Issues

panic: runtime error: index out of range

Conditions:
Running the script using the following list of domains on Ubuntu 18.04 LTS (go version go1.10.4 linux/amd64)

facebook.com
google.com
youtube.com
twitter.com
instagram.com
linkedin.com
microsoft.com
apple.com
wikipedia.org
plus.google.com

Result:

panic: runtime error: index out of range

goroutine 7 [running]:
github.com/disclose/diosts/internal/pkg/discloseio.FromSecurityTxt(0xc4201a8700, 0x7e9d16)
	/root/go/src/github.com/disclose/diosts/internal/pkg/discloseio/discloseio.go:71 +0x3a1
github.com/disclose/diosts/internal/app/run.(*Writer).Start.func1(0xc42010e300)
	/root/go/src/github.com/disclose/diosts/internal/app/run/writer.go:50 +0x195
created by github.com/disclose/diosts/internal/app/run.(*Writer).Start
	/root/go/src/github.com/disclose/diosts/internal/app/run/writer.go:34 +0x5c

Expected:
Normal completion.

Allow for comparison and automatic PR against diodb

The main function of diosts is to provide authoritative updates to diodb based on data it parses from security.txt files.

For each URL processed by diosts it would be ideal if diosts (or another small service that consumes diosts output, and potential the output of other similar scrapers) could:

  • Check if the insertion of a new object into diodb is appropriate,
  • Check if the updating of a key pair for an existing diodb object is appropriate, and
  • Formulate and push a PR to the diodb repo for review on a per object basis, vs in a batch.

Rename the "program_name" field to "security_txt_domain"

Can we rename the "program_name" field to "security_txt_domain" denoting the domain the security.txt was retrieved from?

"program_name" in diodb is tied to the company, organization, or business unit responsible for the policy and intake channel which is a looser coupling than the data returned by diosts.

Looking at the merged data, I'm thinking that we should consider outputting the diosts data to a separate data store to allow for the differences between security.txt rendering of this information and that which we collect in the diodb.

panic: runtime error: index out of range

Condition:
go version go1.11.6 linux/arm
cat top10milliondomains.txt | ~/go/bin/diosts -t 100 2>diosts.log >securitytxt.json
Source: https://www.domcop.com/files/top/top10milliondomains.csv.zip

Expected result:
Completion of task

Result:

goroutine 116 [running]:
github.com/disclose/diosts/pkg/securitytxt.baseDomain(0x2706707, 0x12, 0x12, 0x7)
	/home/pi/go/src/github.com/disclose/diosts/pkg/securitytxt/domain.go:209 +0x10c
github.com/disclose/diosts/pkg/securitytxt.checkRedirect(0x2dc8f80, 0x2838978, 0x1, 0x2, 0x2c436e0, 0x27)
	/home/pi/go/src/github.com/disclose/diosts/pkg/securitytxt/domain.go:190 +0x12c
net/http.(*Client).checkRedirect(0x207ab40, 0x2dc8f80, 0x2838978, 0x1, 0x2, 0x0, 0x600001)
	/usr/lib/go-1.11/src/net/http/client.go:416 +0x4c
net/http.(*Client).do(0x207ab40, 0x2dc9380, 0x0, 0x0, 0x0)
	/usr/lib/go-1.11/src/net/http/client.go:607 +0x738
net/http.(*Client).Do(0x207ab40, 0x2dc9380, 0x2126f30, 0x27, 0x0)
	/usr/lib/go-1.11/src/net/http/client.go:509 +0x24
net/http.(*Client).Get(0x207ab40, 0x2126f30, 0x27, 0x2126f30, 0x27, 0x0)
	/usr/lib/go-1.11/src/net/http/client.go:398 +0x7c
github.com/disclose/diosts/pkg/securitytxt.(*DomainClient).GetBody(0x20787d0, 0x2126f30, 0x27, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/pi/go/src/github.com/disclose/diosts/pkg/securitytxt/domain.go:143 +0x94
github.com/disclose/diosts/pkg/securitytxt.(*DomainClient).GetDomainBody(0x20787d0, 0x2304f50, 0x7, 0x1)
	/home/pi/go/src/github.com/disclose/diosts/pkg/securitytxt/domain.go:114 +0x15c
github.com/disclose/diosts/pkg/securitytxt.(*DomainClient).GetSecurityTxt(0x20787d0, 0x2304f50, 0x7, 0x0, 0x0, 0x0)
	/home/pi/go/src/github.com/disclose/diosts/pkg/securitytxt/domain.go:88 +0xcc
github.com/disclose/diosts/internal/app/run.(*WorkerPool).work(0x207ab60, 0x208e200)
	/home/pi/go/src/github.com/disclose/diosts/internal/app/run/workerpool.go:52 +0xac
created by github.com/disclose/diosts/internal/app/run.(*WorkerPool).Run
	/home/pi/go/src/github.com/disclose/diosts/internal/app/run/workerpool.go:34 +0x54

New diosts output fields

  • "source" = "diosts-$version" (This will identify the script version and where the information came from in light of the broader diodb data corpus, as well as any other bots we add in the future)
  • "last_update" = Timestamp that the script was run
  • "contact_email" = Contact email if present, and separate from "contact_url"
  • "retrieval_url" = The specific and full URL that the security.txt data was retrieved from (This will be used for is-alive garbage collection checks later)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.