Code Monkey home page Code Monkey logo

crawley's Introduction

Badges

License Python 3.1

Crawley: A tool for basic web crawling

Crawley is a command-line tool engineered in Python, that facilitates the process of web platform discovery. It enables users to customize search queries and validate search results, with the overarching goal of identifying websites that are underpinned by specific technological platforms (i.e. Semantic Wikis, Open Data Portals, Wordpress sites).

Features

  • Cross-Platform Search Engine Crawling: The tool is equipped with the ability to automate the crawling of popular search engines (Google, Bing, Yandex, Yahoo, DuckDuckGo, Baidu and Naver).
  • Rule-Based HTML Parsing: Crawley can parse HTML content derived from search results, facilitating identification and confirmation of underlying technical platforms via the application of custom validation rules.
  • Hyperlink Traversal and Validation: The tool is capable of hyperlink traversal, following and validating underlying platforms for further URLs extracted from the initial query search results.
  • Result Categorization: The output from Crawley is classified acc. to the technology and applicable markers.

Requirements

  • Python 3.x
  • urllib
  • bs4
  • serpapi
  • pprint

Installation

Clone this repository to your machine and install the required packages from requirements.txt.

Pre-requisites

To be able to use the tool, you need SERP API key(s). The keys.txt already contains a couple of them, but more is better. You can generate the free keys by g oing to https://serpapi.com and registering for the free account (100 searches / month). Insert all available keys into the keys.txt file (new line for each key) and the tool will automatically choose the ones that have capacity.

1. Querying process

The platform discovery process begins by identifying parts of text or image annotations commonly found on sites using a particular technology of interest. These are usually phrases such as:

Powered by Semantic MediaWiki
CKAN API
Socrata API

or parts of URL commonly used by a specific platform:

.../dataset

Having identified possible common markers, you can formulate queries and process results from the common search engines (SEs) with the tool. The tool extends the SERP API as a reliable solution to SE querying. The search is possible with Google, Bing, Yandex, Yahoo, DuckDuckGo, Baidu and Naver. The search results are aggregated in the ./results folder.

Normally, the search is defined by the engine (Google, Bing, Yandex, Yahoo, DuckDuckGo, Baidu, Naver), query itself, count (number of results per page, whereby you need to consider maximum allowed values for a given SE) and offset (pagination, could be either pages - 1, 2, 3 - or skipped results - 100, 200, 300 - depending on the SE). Thus, each SE allows a different maximum amount of results per page and may vary in its approach to pagination. Below are the command templates for each SE.

After each search, the tool prints the actual number of usable results returned or an error when no results are available anymore. Normally, it makes sense to increase the pagination until the results are exhausted. At the same time, the number of results gives a good estimation of how well the platforms are discoverable with the given query (more general queries lead to more results, but often less hits, and more specific queries to less results, but a larger proportion of hits).

Google:

python3 crawley.py --query "QUERY" --engine Google --count 100  --offset 100

Notes: Optimal queries with Google inclide exact match searches (terms in parentheses), inurl: and site: operators (https://ahrefs.com/blog/google-advanced-search-operators/) and exclusions of clear false positives through -site:.

python3 crawley.py --query "site:*.socrata.com \"Powered by Socrata\" -site:socrata.com" --engine Google --count 100  --offset 0

Google:

The maximum number of results for Google is 100 (i.e. count set to 100) and offset is calculated in total results to be skipped (i.e. 0,100,200, ... if count is set to 100).

python3 crawley.py --query "{QUERY}" --engine Google --count {COUNT}  --offset {OFFSET}

Notes: Optimal queries with Google inclide exact match searches (terms in parentheses), inurl: and site: operators (https://ahrefs.com/blog/google-advanced-search-operators/) and exclusions of clear false positives through -site:.

python3 crawley.py --query "site:*.socrata.com \"Powered by Socrata\" -site:socrata.com" --engine Google --count 100  --offset 0

The command to get the next page of results would be (offset increased by 100):

python3 crawley.py --query "site:*.socrata.com \"Powered by Socrata\" -site:socrata.com" --engine Google --count 100  --offset 100

2. Validating the platforms

See config-example.json and save your adjusted version as config.json to define the markers for validation. The tool requests HTML for each search result and then tries to match it with the markers. After having done enough queries and defining markers, run the following command to validate the results:

python3 crawley.py --validate

The validation results can be found in ./validatedSites.json. The total number of validation hits for each platform will be printed in the console and can be found in ./validatedReport.json.

3. Following links

To request HTML from and validate links appearing in the HTML of initial results, use the following command.

python3 crawley.py --links

This will update the results.

Contributing

Contributions are welcome. Please open an issue or submit a pull request if you would like to contribute.

License

The tool is avaiable under the CC-BY 4.0 license

Authors

[ANONYMIZED]

Acknowledgements

crawley's People

Contributors

semantisch avatar

Stargazers

Pierre Maillot avatar Uldis Bojārs avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.