Crawl paper information from
- science direct
- mdpi and store into mongodb
Check item.py
out for the details of scraped structure.
- docker
- python 3.7
docker pull mongo:4.2.3
docker run -p 27017:27017 -d --name mongo mongo:4.2.3
- run either below methods to install dependencies
pip install -r requirement.txt
pipenv install --python 3.7
thenpipenv shell
- run crawler,
scrapy crawl sciencedirect --loglevel=INFO
, wheresciencedirect
is a spider
- run
scrapy list
to display the list of crawlers
I recommend Robo 3T
to query data, otherwise mongo-shell
can do
- connect to server
db.getCollection('items').find({'abstract': {$ne: null}})
- visit https://docs.scrapy.org/en/latest/topics/jobs.html for pause crawl works
- e.g
scrapy crawl sciencedirect -s JOBDIR=crawl_jobs/sciencedirect
- should expose file inside docker container to the machine
- determine a folder first outside, for example:
/Users/bryan/papers_crawler/mongodb
- run
docker run -p 27017:27017 -d -v /Users/bryan/workplace/papers_crawler/mongodb:/data/db/ --name mongo mongo:4.2.3