Code Monkey home page Code Monkey logo

richkit's Introduction

Richkit

Richkit is a python3 package that provides tools taking a domain name as input, and returns addtional information on that domain. It can be an analysis of the domain itself, looked up from data-bases, retrieved from other services, or some combination thereof.

The purpose of richkit is to provide a reusable library of domain name-related analysis, lookups, and retrieval functions, that are shared within the Network Security research group at Aalborg University, and also availble to the public for reuse and modification.

Documentation can be found at https://richkit.readthedocs.io/en/latest/.

Requirements

  • Python >= 3.5

Installation

In order to install richikit just type in the terminal pip install richkit

Usage

The following codes can be used to retrieve the TLD and the URL category, respectively.

  • Retriving effective top level domain of a given url:

    >>> from richkit.analyse import tld
    >>> urls = ["www.aau.dk","www.github.com","www.google.com"]
    >>>
    >>> for url in urls:
    ...     print(tld(url))
    dk
    com
    com
  • Retriving category of a given url:

    >>> from richkit.retrieve.symantec import fetch_from_internet
    >>> from richkit.retrieve.symantec import LocalCategoryDB
    >>>
    >>> urls = ["www.aau.dk","www.github.com","www.google.com"]
    >>>
    >>> local_db = LocalCategoryDB()
    >>> for url in urls:
    ...     url_category=local_db.get_category(url)
    ...     if url_category=='':
    ...         url_category=fetch_from_internet(url)
    ...     print(url_category)
    Education
    Technology/Internet
    Search Engines/Portals

Modules

Richkit define a set of functions categorized by the following modules:

  • richkit.analyse: This module provides functions that can be applied to a domain name. Similarly to richkit.lookup, and in contrast to richkit.retrieve, this is done without disclosing the domain name to third parties and breaching confidentiality.

  • richkit.lookup: This modules provides the ability to look up domain names in local resources, i.e. the domain name cannot be sent of to third parties. The module might fetch resources, such as lists or databasese, but this must be done in a way that keeps the domain name confidential. Contrast this with richkit.retrieve.

  • richkit.retrieve: This module provides the ability to retrieve data on domain names of any sort. It comes without the "confidentiality contract" of richkit.lookup.

Run Tests on Docker

In order to prevent any problems regarding to environment, we are providing Dockerfile.test file which basically constructs a docker image to run tests of Richkit.

  • The only thing to add is just MAXMIND_LICENCE_KEY in .github/local-test/run-test.sh at line 3. It is required to pass the test cases for lookup module.

Commands to test them in Docker environment.

  • docker build -t richkit-test -f Dockerfile.test . : Builds required image to run test cases

  • docker run -e MAXMIND_LICENSE_KEY="<licence-key> " richkit-test : Runs run-test.sh file in Docker image.

Contributing

Contributions are most welcome.

We use the gitflow branching strategy, so if you plan to push a branch to this repository please follow that. Note that we test branch names with .githooks/check-branch-name.py. The git pre-commit hook can be used to automatically check this on commit. An example that can be used directly as follows is available on linux, and can be enabled like this (assuming python>=3.6 and bash):

ln -s $(pwd)/.githooks/pre-commit.linux.sample $(pwd)/.git/hooks/pre-commit

Credits

richkit's People

Contributors

gianmarcomennecozzi avatar kidmose avatar mrtrkmn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

richkit's Issues

Document the assumed data model for domain names

On a meeting with kh, atu, gmm and egk on 19. sept. kh described a model for how to refer to different sort of domain names. This is currently captured in the readme with:

Todo: Describe the data model of FQDN > APEX Domain > Public Suffix > TLD

This needs to be done.
This will contribute to solving #1 .

As a reviewer I'd like adherence to pep8 to make code easier to read

A tool like flake8 can be used to weed out code that diverts from the pep8 coding style:.
We currently have 506 errors/warnings:

me@machine:dat$ pip install flake8
me@machine:dat$ flake8 dat/ test/ | wc -l
506

Adhering to pep8 would make my life easier when reviewing PRs, because I have an easier time reading the code and understanding what has changed.
When I code myself I already try to adhere, so I think the overhead is negligible.

I suggest to include flake8 in the CI/CD pipeline.
As a starting point, failing builds on errors and warnings.

Opinions @gianmarcomennecozzi , @mrturkmen06 and anyone?

remove URLVoid code because it is dead

richkit/retrieve/util.py contains code for fetching data from URLVoid service, and the test of it fails.
As it is not currently expose as methods under retrieve.* it seems to be incomplete, and should be removed from master until it has been completed and until it passes relevant tests.

This seems like it will also solve #52

Fix example from README.md: retrieve.symantec.LocalCategoryDB

$ ipython
Python 3.7.0 (default, Feb  4 2020, 14:16:38) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: >>> from richkit.retrieve.symantec import fetch_from_internet 
   ...: >>> from richkit.retrieve.symantec import LocalCategoryDB 
   ...: >>> 
   ...: >>> urls = ["www.aau.dk","www.github.com","www.google.com"] 
   ...: >>> 
   ...: >>> local_db = LocalCategoryDB() 
   ...: >>> for url in urls: 
   ...: ...     url_category=local_db.get_category(url) 
   ...: ...     if url_category=='': 
   ...: ...         url_category=fetch_from_internet(url) 
   ...: ...     print(url_category) 
   ...:                                                                                                               
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-4c50769c2b82> in <module>
      4 urls = ["www.aau.dk","www.github.com","www.google.com"]
      5 
----> 6 local_db = LocalCategoryDB()
      7 for url in urls:
      8     url_category=local_db.get_category(url)

~/git-reps/richkit/richkit/retrieve/symantec.py in __init__(self)
     49     def __init__(self):
     50 
---> 51         self.url_to_category = read_categorized_file()
     52 
     53     def get_category(self, url):

~/git-reps/richkit/richkit/retrieve/symantec.py in read_categorized_file()
    149     url_to_category = dict()
    150     if not os.path.exists(categorized_urls_file):
--> 151         open(categorized_urls_file,'w').close()
    152     else:
    153         with open(categorized_urls_file, "r") as ins:

FileNotFoundError: [Errno 2] No such file or directory: 'dat/retrieve/data/categorized_urls.txt'

define git hooks for better git branch names

There should some name conventions in order to get out from a situation where everything is messed up. This issue could be closed by creating customised git hooks which might run on server and client side.

  • New branches should be created from master branch
  • If there is no issue which points your dev intention, first create issue and define your intention
  • There should not be branch-to-branch (other than master) pull requests at least for time being.
  • The name of branches should match with the issues that the project has.
  • Branch names should contain issue number at the end otherwise that should NOT be accepted
  • A proper length of branch name should be between 18 and 25.
  • Example :
    - For issue #4, the ideal branch name could be add-more-sources-#4

propose of Makefile

A Makefile could be useful to ;

  • create virtualenv for python
  • clean ( cleaning cache files which are generated from different cases )
  • linting
  • build
  • run test cases

In my opinion, it could be useful to have it, to have easy interaction with project. What do you think about it @kidmose and @gianmarcomennecozzi ?

clean up dependencies for http library

We are currently using multiple libraries for HTTP capabilities, which is unwanted complexity and dependency.

requests is believed to cover all our needs, be the easiest to use, and therefore the way to go.

Success criteria: We have removed wget and urllib* from requirements, and rewritten existing code to use requests (https://pypi.org/project/requests/)

Inconsistent test cases

It is quite weird to see failure and pass cases when you have same commit id. I have tried to merge update_docs branch into develop branch, although it was only containing doc's updates, it is failed. So, i recover the commit which was passed before, however when I recover it, github actions started to fail. There is something wrong either in github actions or in our test cases. Needs to be investigated. Here is the screen shot what I am trying to say :

Screenshot 2020-03-15 at 21 19 17

Does anyone has any idea about it ?

Only download maxmind databases if the current one is outdated

In richkit.lookup.{country|asn} the databases are downloaded from the web if missing, as intended.

However, it also seems to me that the databases are downloaded again every time richkit.lookup.util is loaded, regardless if the files for the databases were downloaded just recently, even if the current ones are still up to date with the ones available from MaxMind.

According to docs tempfile.mkdtemp() will ensure that a new, empty temp folder is used on every new load of the module, thus requiring a new download:

temp_directory = tempfile.mkdtemp()
:

This causes the following code, that is intended to reuse a local file if it is new enough, to not have any effect (Looking in a new, empty tmp dir everytime):

# check if the database is updated
if (int(calendar.timegm(time.gmtime())) - int(
os.path.getctime(MaxMind_CC_DB.get_db_path(self)))) > self.three_weeks:
shutil.rmtree(self.path_db)
os.mkdir(self.path_db)
MaxMind_CC_DB.get_db()

Goal: avoid downloading a database if one that is up to date already has been downloaded.

Incorporate features found in pydomain

We have an existing code repository named pydomain, including calculations of 16 features.
These are useful, but the existing code is hard to reuse.

We need to:

  1. take the code for each of these features,
  2. reimplement it in this repo,
  3. add relevant unit tests,
  4. and document the features in docstring.

The old code is found here:
https://github.com/aau-network-security/pydomain/blob/master/pydomain/pydomain.py#L698

Please state in the comments when you start working on one of them, and note that leaving TODO's for e.g. docstrings is ok when it is not obvious how to interpret the feature.

Expected fail !

Time to time, google DNS servers (8.8.8.8 8.8.4.4 ) are changing their ASN, which ends up having failure in our tests, ideally, those tests could be skipped.

integrate logging

Removing print statement in package and integrating a logging system would be nice to inform user

Remove log message clutter

Currently, importing the three submodules produces multiple log messages that are not relevant (See bottom)
This is in conflict with good design principles, like Raymod states in "The Art of UNIX Programming":

Don't clutter output with extraneous information

Goal: When using richkit, without having taking any steps to increase log verbosity, the only log messages printed are warnings, when there is something the user really needs to now, and errors, when a given operation fails.

user@host:~/git-reps/richkit$ ipython
Python 3.7.0 (default, Feb  4 2020, 14:16:38) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from richkit import lookup, analyse, retrieve                                                                                                         
02-05 13:03 urllib3.connectionpool DEBUG    Starting new HTTPS connection (1): publicsuffix.org:443
02-05 13:03 urllib3.connectionpool DEBUG    https://publicsuffix.org:443 "GET /list/effective_tld_names.dat HTTP/1.1" 200 None
02-05 13:03 richkit.analyse.segment INFO     Fetching one gram file from gist ...
02-05 13:03 urllib3.connectionpool DEBUG    Starting new HTTPS connection (1): gist.githubusercontent.com:443
02-05 13:03 urllib3.connectionpool DEBUG    https://gist.githubusercontent.com:443 "GET /mrturkmen06/d9d5f8bc35be8efd81c447f70ca99fbf/raw/cfa317d7bce53ba55ca8f9bf27aa3170038f99cf/one-grams.txt HTTP/1.1" 200 4956240

In [2]:  

Use unittest.TestCase.assert* in TCs

We currently have some test cases under ./richkit/test/ that use the assert, which intended for debugging and is not intended for use in the unittest framework, which we use.

These should be changed to self.assert* (Where self is a unittest.TestCase).

AttributeError: module 'dat.retrieve.symantec' has no attribute 'refetch_from_internet'

I see the following error when trying to use dat.retrieve.symantec_category:

(dat) egk@egk-ThinkPad-T450s:~/git-reps/dat$ git branch -v
  develop                    94e4a0d Ideas in comments moved to Issue #4
  feature/docstring-refactor 870f1a9 refactor dat.retrieve.symantex to docstring + expose simple function
* master                     e8365c2 Merge pull request #13 from aau-network-security/clean-up-requirements-#11
(dat) egk@egk-ThinkPad-T450s:~/git-reps/dat$ python
Python 3.6.5 (default, Sep 19 2019, 13:56:05) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dat.retrieve
>>> dat.retrieve.symantec_category('google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/egk/git-reps/dat/dat/retrieve/__init__.py", line 12, in symantec_category
    return symantec.refetch_from_internet(domain)
AttributeError: module 'dat.retrieve.symantec' has no attribute 'refetch_from_internet'

We need 1) a test case to catch this and 2) a fix.

https://richkit.readthedocs.io/en/latest/ not available

I was hoping to access the richkit documentation at https://richkit.readthedocs.io/en/latest/ but I get an error:

    \          SORRY            /
     \                         /
      \    This page does     /
       ]   not exist yet.    [    ,'|
       ]                     [   /  |
       ]___               ___[ ,'   |
       ]  ]\             /[  [ |:   |
       ]  ] \           / [  [ |:   |
       ]  ]  ]         [  [  [ |:   |
       ]  ]  ]__     __[  [  [ |:   |
       ]  ]  ] ]\ _ /[ [  [  [ |:   |
       ]  ]  ] ] (#) [ [  [  [ :===='
       ]  ]  ]_].nHn.[_[  [  [
       ]  ]  ]  HHHHH. [  [  [
       ]  ] /   `HH("N  \ [  [
       ]__]/     HHH  "  \[__[
       ]         NNN         [
       ]         N/"         [
       ]         N H         [
      /          N            \
     /           q,            \
    /                           \

Could be related to the need to do a release, as mentioned in #6 ?

Also the build seems to have failed: https://readthedocs.org/projects/richkit/builds/

Linting before commit

It is annoying to see linting error after changes have been made, integrating it into existing githook is nice way to prevent it, I think.

There are skipped tests

Get an overview of why we have 4 tests that are skipped, create relevant issues, and address them.

richkit/test/test_analyse.py ...ss................                       [ 61%]
richkit/test/test_lookup.py ..                                           [ 67%]
richkit/test/test_retrieve.py s...s                                      [ 82%]
richkit/test/test_util.py ......                                         [100%]

Handle URLVoid changing number of blacklists

URLVoid appears to have changed the number of blacklists , leading to failed test:

https://github.com/aau-network-security/richkit/pull/71/checks?check_run_id=452081551#step:6:23

____________________ URLVoidTestCase.test_blacklist_status _____________________

self = <richkit.test.retrieve.test_urlvoid.URLVoidTestCase testMethod=test_blacklist_status>

    def test_blacklist_status(self):
        for k, v in self.test_urls.items():
            instance = URLVoid(k)
>           assert instance.blacklist_status() == v["blacklist_status"]
E           AssertionError: assert '0/34' == '0/36'
E             - 0/34
E             ?    ^
E             + 0/36
E             ?    ^

richkit/test/retrieve/test_urlvoid.py:79: AssertionError

We need to make the code and tests to be robust to such changes as we have seen them introduce noise before (E.g. #70 )

Get an overview of missing tests

It seems that we haven't implemented tests for all the methods found under lookup, analyse and retrieve submodules, with analyse.n_grams_alexa being one example I encountered.

Task:

  1. Identify all method in the __init__.py files of each of the three submodules,
  2. Get an overview of which ones aren't tested
  3. Create individual issues on github for each function that is not currently tested
In [1]: from richkit import lookup, analyse, retrieve                                                                                                                                                                                         

In [2]: analyse.n_grams_alexa('example.com')                                                                                                                                                                                                  
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-8d8491ba670c> in <module>
----> 1 analyse.n_grams_alexa('example.com')

~/git-reps/richkit/richkit/analyse/__init__.py in n_grams_alexa(domain)
    225 
    226     """
--> 227     return analyse.get_grams_alexa_2ld(domain)
    228 
    229 

~/git-reps/richkit/richkit/analyse/analyse.py in get_grams_alexa_2ld(domain, analyzer, ngram_range)
    153         :return: grams of second level domain
    154 	"""
--> 155         alexa_slds = load_alexa()
    156 	alexa_vc = CountVectorizer(analyzer=analyzer,
    157                                                            ngram_range=ngram_range,

~/git-reps/richkit/richkit/analyse/util.py in load_alexa(limit)
     64     alexa_domains = set()
     65     path = "top-1m.csv"
---> 66     with open(path) as f:
     67         for line in f:
     68             line = line.strip()

FileNotFoundError: [Errno 2] No such file or directory: 'top-1m.csv'

Make whois independent of linux and external binary

As per requirements.txt we are currently using the whois module;

Python wrapper for Linux “whois” command

This is expected to fail when whois is not installed/available on $PATH, and also when running on other OSs. I think the first case is what we see here:
https://github.com/aau-network-security/richkit/runs/531202016#step:5:191

In order to remove the tie to a specific OS and avoid being dependent on an external binary I suggest we move to a python implementation of whois.

I've previously worked with python-whois and experienced that to work nicely and be extensible (I added parsing for .dk whois), is I suggest that, but other alternatives might exist.

Goal: Unskip richkit.test.retrieve.test_whois.WhoisTestCase and make sure it passes the current tests.

Release error

Error on releasing new version to python package

Screenshot 2020-03-17 at 23 07 38

I am checking it...

analyse.tld does not scale linearly

The runtime complexity of some of the functions may be prohibitive. Consider that of richkit.analyze.tld() as an example. Is this really due to the fact that TLDs are intrinsically difficult to compute (e.g., by accounting for examples such as *.co.uk) or could this be streamlined?

The output of the attached code (see below) is as follows:

1 domains processed:
split(): 0.0009176731109619141 s
Richkit: 0.017614364624023438 s

10 domains processed:
split(): 0.0008223056793212891 s
Richkit: 0.1530759334564209 s

100 domains processed:
split(): 0.0008115768432617188 s
Richkit: 1.5235605239868164 s

1000 domains processed:
split(): 0.001218557357788086 s
Richkit: 15.542202234268188 s

HTTP CT logs features

@kdhageman has done some work on the area of HTTPS Certificate Transparency logs, and also has a script that extract some features for a domain.

We want those reimplemented in richkit (as a first iteration) such that richkit has a function for each feature, that given a domain name will return the value for the feature.

The script is likely based on a API/data source at Censys, which is very batch oriented (Along the lines that a batch, whether for 1 or for 1000 domains, has a fixed price). It seems likely that https://crt.sh/?q=example.com is a better candidate for richkit for now.

Not knowing the state or nature of the script, it might be necessary to analyse it to understand each feature and reimplement it from scratch here, but I'm sure Kaspar can provide some advice.

This is done when richkit has a method for each of the features, with the documentation and testing to with it.

@kdhageman : If you don't get arround to push the script to a repo, then perhaps you can share the current version here?

Select license

We need to select which license this is to be published under.

I see the goals of this as enabling adopters and contributors to easily;

  • Know how they can use this tool, including rights for derivative work.
  • Know how they can contribute, what rights they retain on their contribution, and what rights they must be ready to waive.

My initial idea is GNU GPL, but I'm also interested in inputs.

Streamline Maxmind licensing

When Maxmind (the company providing the resources backing some of the functions in lookup) changed their API from anonymous HTTP download to a HTTP download where a (free) license key is required, a quick fix was implemented so that the key was read from the MAXMIND_LICENSE_KEY environment variable.

This is lacking at least the following:

  • Documentation on where and how to obtain the license
  • Documentation on how to configure the license
  • Handling of missing license, e.g. such that all other parts of richkit, in particular richkit.lookup, still is functional when the key is not configured.
  • relevant test cases for the step above

And there might be more to add.

Clean up requirements.txt

Currently requirements.txt seems to include more entries than necessary to use the library.

Please remove all unnecessary entries from the file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.