This crawler is built in Python and based on the A-MA-ZING scrapy framework. The purpose of this crawler is to scan all the modules hosted on Drupal.org to store them in a database. Intro
In order to make this crawler works, you need to install on your machine : Pre-requisites:
- Python
- scrapy
Install scrapy
The easiest way to install scrapy is to use pip a tool for installing and managing Python packages.
Installing scrapy using pip:
pip install Scrapy
Download the crawler
git clone git://github.com/JulienD/modules-crawler.git
Play with the crawler
To run the spider :
scrapy crawl ModulesXml
If you want to specify a major version of modules, you can pass it as an argument :
scrapy crawl ModulesXml -a version=7
By default Scrapy display all messages even Debug info, if you want to just see Error message use the loglevel argument :
scrapy crawl ModulesXml --loglevel=ERROR