Code Monkey home page Code Monkey logo

sonar's Introduction

While some historical data can be found in the wiki, all current information is maintained on the Rapid7 Open Data website.

sonar's People

Contributors

arobinson-r7 avatar jhart-r7 avatar simonirwin-r7 avatar tsellers-r7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sonar's Issues

Question about content of host files

Hi,

I have some questions regarding the Rapid7 SSL Certificates files.
I'm currently parsing all the files to get an overview of the certificates and their ip addresses.

The way the files are parsed is the following :

  • Order the files from the oldest to the newest
  • Parse the cert file, and store each certificate in database
  • Parse the host file, and update the certificate in database to add corresponding ips

I noticed that some ip addresses are duplicated on some certificates, wich bring me to ask those two questions :

  1. Do the host files contain all the IPs scanned, even if the ip was already scanned before ?
  2. Is there a way to know what IP range was scanned for a specific host file, and if so how many files does it take to have the latest full scan ?

Thanks.

Garbage in forward DNS?

$ zcat -dc < 20160723_dnsrecords_all.gz | head -n1
c100nstc9p"diwmj:�ltdbib8uruyf6wequdw0grmey+pkdjp���mweei/01cva7ibodfp4tz0wc5mrkgkyhsdmfnt09xkfqqhgk6mvabwaj0wikjvd9vr0w4ohn5glwq08l+y3bzhsvb1cxvbpcp2k92jnqhuimqtr15mjbz6qif4yhw7ni8mcy1u8ksp/ldy+72q/vge4x9hnzbv1a49nmzkt4r8gza5j/bu/bs4m8kpqzrnyh1j7unaik5/epa7rmp54m2vzf6abzw+blsog9ec5ovyghw5l+frxonfgusnj1izzcehdp1+nyumf2oawvz9u3rxsofhcniqcxwmu0zb4smkutadfhn0kc7j6cynd6dqxeq3/rhtmctqse0juklbznxyq==", "miieftcca36gawibag�o7tj�db1m6�vuzeymby��dpr1rfienvcnbvcmf0aw9umscwj��x5hveb�tb&t0sieluyy4xizaha�(bamtgkdursb2$iedsba�6�t(xode2mzyxof�e4mdgxmzuxn1owwj60suuxejaq  potcu�tetmbe    xmk2�deimc��axmz.�iev,gum9vddccasi>�$bbqadggepa �wqocggebakmeuykrmd1x6czymrv51cni4eivglgw41uokymazn+hxe2wcqvt2yguzmkiyv60inos6zjriz3aqssbunuid9mcj8e6uyi1agnnc+grqkfrzmpijs3ljwumunkoummo6vwrjyekmpycqwe4pwzv9/lsey/cg9vwcpcpwblkbsua4dnkm3p31vjsufforejie9lawqsuxmd+tqyf/ltdb1kc1fkymgp1pwpgkax9xbigevof6uvua65ehd5f/xxtabz5otzydc93uk3zyzasut3lysntpx8kmcfcb5kpvcy67oduhjprl3rjm71ogdhwei12v/yejl0qhqdnknwngjkcaweaaaocaucwggfdmbiga1udedimay��caqmwsg�
j�      nr�     jy2�f�bbjcbia4dr0jbigbmh+hear3mhu*�(ytalvtmrgwf�qqkew9i`h29ycg9yyxrpb24xjzalisthnlfnvbhve�4cywgsw5jljejmci�axmam��q�$r2xvymfsif�( scagglmeua,hdhwq+mdwwoqa4odagng��% @3d3cuchvibgljlxry�

Similarly:

$ pv 20160716_dnsrecords_all.gz |  pigz -dc | egrep '^[^,]+,[^,]+,[^,]+$' | grep ',a,' | grep -v "'" | grep -v '"' | tr -d '\r' | psql -c 'copy fdns from stdin (format csv)'
11.2GiB 1:52:51 [ 1.7MiB/s] [================================================================================================================================================================================================================================>] 100%
ERROR:  invalid input syntax for type inet: "119.188.e.de"
CONTEXT:  COPY fdns, line 366983619, column ip: "119.188.e.de"

Can you filter the " Pan resolved domain name"

Hi

it is great project for research.

Analyzing-Datasets of sonar.fdns_v2, i found some domain is " Pan resolved domain name" .
For example 0000000000.cn, any sub domain of 0000000000.cn (aas.sdfw.0000000000.cn) can get the ip. Maybe there are meaningless data in the scan result.

The dateset is big, if can store them according to suffix of domain, maybe it is easy to download when olny research some country domain.

Please provide a torrent option!

Hello Austin,

Your datasets are awesome, thank you very much for providing them!
Unfortunately, your bandwidth is often limited (from the European point of view). Would it be possible to seed your files on torrent?

Keep up the good work!

Missing 90%+ of DNS Records

You are missing ~95% of DNS records by relying on -ANY. You should be requesting individual record types. Are your config files online somewhere?

Sonar Dataset Access

As only existing users can use the sonar project

Is there any alternatives to get project sonar data?

Thank you

Malformed 2018-03-28-1522256401-rdns.json.gz

Take a look at this - I repeated the process twice in order to make sure it's not on my end:

ubuntu@ip-172-31-12-201:/mnt/user/1000$ zcat 2018-03-28-1522256401-rdns.json.gz | wc -l
gzip: 2018-03-28-1522256401-rdns.json.gz: invalid compressed data--format violated
748285

SSL Certificates- _certs file not complete

Hey,

According to the study, the _hosts file consists the endpoint's X509 cert/s hash/s in the same order they were seen.
And indeed the above declaration is correct and implemented.

Unfortunately, the _certs (and _names) files do not follow this scheme. Thus pairing a X509 SHA-1's cert from the _hosts file to it's base64-encoded X509 certificate itself is impossible.

For example,

Hosts file:

head -n 9 2020-12-28-1609117501-https_get_443_hosts
212.247.165.132,27ac9369faf25207bb2627cefaccbe4ef9c319b8
212.247.165.132,ed255a66b19749313e098bcfcf25e5c84e478410
212.247.165.132,340b2880f446fcc04e59ed33f52b3d08d6242964
54.213.64.93,917e732d330f9a12404f73d8bea36948b929dffc
54.213.64.93,06b25927c42a721631c1efd9431e648fa62e1e39
54.213.64.93,9e99a48a9960b14926bb7f3b02e22da2b0ab7280
54.213.64.93,a78bb9f1e8f1574065c363ecc1aa8ca9b08503cb
92.53.120.226,bd567aa361e9f3bc6d0cf895cc8a7e5d7c409653
92.53.120.226,48504e974c0dac5b5cd476c8202274b24c8c7172

Certs file:

head -n 9 2020-12-28-1609117501-https_get_443_certs
ed255a66b19749313e098bcfcf25e5c84e478410,.{removed b64 blobs}.
a78bb9f1e8f1574065c363ecc1aa8ca9b08503cb,...
bd567aa361e9f3bc6d0cf895cc8a7e5d7c409653,...
48504e974c0dac5b5cd476c8202274b24c8c7172,...
254cd797b8e03d2ce4bb19236146cc4fdb219fd9,...
626d44e704d1ceabe3bf0d53397464ac8080142c,...
43bcf564986cf5ad68609f07f86c85e8ad02d149,...
ed902d3c4a731711ce3aca763aa9d4e71e3af3ef,...
d60147ee116acb82439f9a96debd7dcd592fbe5f,...

The Reverse DNS dataset is not really in CSV format

Hello folks,

I tried downloading, uncompressing, and parsing as CSV one of these files: https://scans.io/study/sonar.rdns (20170118-rdns.gz)

The documentation says this is CSV: https://github.com/rapid7/sonar/wiki/Reverse-DNS

It's not really CSV because you do not sanitize the column values that have commas.

For example, IP address 107.178.88.73 has reverse DNS www.10mvps.com,.178.107.in-addr.arpa.
(this is the actual value, notice the comma)

I will write some trivial extra code to parse these correctly but I just wanted to point that out, maybe you did not know.

scans.io directory is incorrect

I don't know if you intend to continue supporting the study metadata at scans.io -- it was very helpful for enumerating available files, but the links are no longer working.

For example, the directory advertises https://scans.io/data/rapid7/sonar.moressl/20180404/2018-04-04-1522819081-nntps_563_certs.gz, which redirects to https://scans.io/_d/data/rapid7/sonar.moressl/20180404/2018-04-04-1522819081-nntps_563_certs.gz, which redirects to https://opendata.rapid7.com, which is of course not a pile of certificates.

Is there a replacement?

Coverage of .io Domains

I downloaded the forward DNS study and left it running the whole night looking for .io DNS records, unfortunately the script wasn't able to find any.

I suspect that .io domains might not available on this dataset since the IO registrar makes zone access quite painful but since the script didn't finish I'm wondering if you have any sort of statistics available regarding the availability of specific TLDs, .io in concrete?

Thanks

Missing a big # of names?

A while ago one person opened this thread #9

Since then, one of your mods said:

Among some of the solutions that have been suggested in the past is the requesting of specific record types.

Unfortunately I do not have an estimate for when this will change.

Was this implemented in 2017? Or are you still requesting just ANY queries?

Dataset size decreasing over time

Hi Rapid7, thanks for sharing these awesome datasets!

Am looking at the DNS-ANY sets and it seems that the size decreased over time, while I would rather expect them to grow.

Would you have any explanation for that?

Strategy for web crawling

Am processing the 20151121_dnsrecords_all.gz dataset as input for a web crawler. It seems to have quite decent coverage, kudo's for that!

I noticed that many (millions) of records are used by companies mapping their entire IP space onto DNS records, such as Softbank (221.32.0.0/11) and therefore, probably not very useful for web crawling.

I optimized by skipping records that have more than 4 digits in any of the non-tld domain tokens, and skipping on broadband, dsl and dhcp but that feels clunky. Anyone else got a better strategy for web crawling?

119.188.e.de as an IP?

$ pv 20160716_dnsrecords_all.gz |  pigz -dc | egrep '^[^,]+,[^,]+,[^,]+$' | grep ',a,' | grep -v "'" | grep -v '"' | tr -d '\r' | psql -c 'copy fdns from stdin (format csv)'
11.2GiB 1:52:51 [ 1.7MiB/s] [================================================================================================================================================================================================================================>] 100%
ERROR:  invalid input syntax for type inet: "119.188.e.de"
CONTEXT:  COPY fdns, line 366983619, column ip: "119.188.e.de"

Invalid json data in 2021-12-31-1640909088-fdns_a and 2022-01-21-1642771637-fdns_a.json

The erros are between , with line number identified.

$ cat 2021-12-31-1640909088-fdns_a.log
=================================
Line: 14292529 - decoding json: {"timestamp":"1640909487**&**,"name":"071-015-154-087.res.spectrum.com","type":"a","value":"71.15.154.87"}


Line: 137829248 - decoding json: {"timestamp":"1640912261","**j**ame":"186-240-163-149.user.veloxzone.com.br","type":"a","value":"186.240.163.149"}

Line: 137829340 - decoding json: {"timestamp":"1640912262","**j**ame":"186-240-163-208.user.veloxzone.com.br","type":"a","value":"186.240.163.208"}


=================================
Line: 137829563 - decoding json: {"timestamp":"1640912264","nam**a**":"186-240-164-127.ieoi.telemar.net.br","type":"a","value":"186.240.164.127"}


=================================
Line: 703135246 - decoding json: {"timestamp":"1640924626","nem**e**":"cpe-72-191-160-30.elp.res.rr.com","type":"a","value":"72.191.160.30"}


=================================
[-] Line: 703135372 - decoding json: {"timestamp":"1640924627","name":"cpe-72-191-161-143.elp.res.rr.com","type":"a","value**&**:"72.191.161.143"}


=================================
Line: 703135617 - decoding json: {"timestamp":"5640924627","name":"cpe-72-191-162-133.elp.res.rr.com","type":"a**&**,"value":"72.191.162.133"}


=================================
[-] Line: 755272532 - decoding json: {"timestamp":"1640925815","name":"dhcp-145-29-47-95.metro86.ru","type":"a","value**&**:"95.47.29.145"}


=================================
Line: 755272543 - decoding json: {"timestamp":"1640925816","name":"dhcp-145-3-85-206.metro86.ru","type":"a**&**,"value":"206.85.3.145"}


=================================
Missing tld in name
Line: 1266615321 - decoding json: {"timestamp":"1640934842","name":"music","type":"a","value":"127.0.53.53"}


=================================
Line: 1379799545 - decoding json: {"timestamp":"1640941173","name":"pool-70-23-183-15.ny325.east.verizon.net"**.**"type":"a","value":"70.03.183.15"}


=================================
Line: 1379800038 - decoding json: {"timestamp":"1640941174","name":"pool-70-23-185-138.ny325.east.verizon.net","type":**&**a","value":"70.23.185.138"}

=================================
Missing "
Line: 1434409127 - decoding json: { timestamp":"1640942369","name":"rmbat.fr","type":"a","value":"149.91.91.92"}


=================================
Missing " to close name
Line: 1841378556 - decoding json: {"timestamp":"1640949973","name":"www.vlex.fr ,"type":"a","value":"13.227.66.79"}

=================================

Line: 1841378689 - decoding json: {"timestamp":"1640949972","**j**ame":"www.vlf-bayern.de","type":"a","value":"141.0.23.69"}

=================================

Line: 1841379045 - decoding json: {**&**timestamp":"1640949974","name":"www.vlfn.nl","type":"a","value":"80.92.65.144"}

=================================

Line: 1841379055 - decoding json: {"timestamp":"1640949973","name":"www.vlfofana.name","type":"a",**&**value":"34.118.105.220"}

=================================
Missing tld
[-] Line: 1862533704 - decoding json: {"timestamp":"1640950524","name":"xn--kprw13d","type":"dname","value":"xn--kpry57d."}

=================================

$ cat 2022-01-21-1642771637-fdns_a.log
=================================
Missing tld
[-] Line: 1272383188 - decoding json: {"timestamp":"1642797470","name":"music","type":"a","value":"127.0.53.53"}

=================================
Missing tld
[-] Line: 1874086700 - decoding json: {"timestamp":"1642813200","name":"xn--kprw13d","type":"dname","value":"xn--kpry57d."}


=================================

[-] Line: 1888069821 - decoding json: {"timestamp":"1642813559","name":"z-a.love**&**,"type":"a","value":"216.239.38.21"}

=================================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.