Overview: 0) install Docker and maven
- clone this repo.
- start Elasticsearch instance
- Build the uber jar
- Start the scraper
- Add search tasks
- list keywords and documents
- view documents.
Need docker installed
cd into repo
Start ES with:
docker-compose up
stop with docker-compose down
Stop and clear persistent volumnes with docker-compose down -v
Open terminal in the folder you cloned repo into and run:
mvn -Dmaven.test.skip=true package
This is a process that polls the ES instance looking for Search tasks:
Open terminal in the folder you cloned repo into and run:
java -jar target/searchscraper-1.0-SNAPSHOT.jar --scraper --scraper-threads 5
Note that you can quit it by pressing Ctrl + C
in a new console add a task using
java -jar target/searchscraper-1.0-SNAPSHOT.jar --add --search-name --keywords ...
eg
- searchName: "centralized logging", keywords: "datadog", "metrics", "logging"
java -jar target/searchscraper-1.0-SNAPSHOT.jar --add --search-name "centralized logging" --keywords datadog metrics logging
Note that you need to either escape spaces or put a phrase in quotes eg
java -jar target/searchscraper-1.0-SNAPSHOT.jar --add --search-name "centralized logging spacy" --keywords "data dog" datum\ dogs
Get a list of all of the keywords scraped using this:
java -jar target/searchscraper-1.0-SNAPSHOT.jar --list
Add a specific keyword to get the list of results for that document:
java -jar target/searchscraper-1.0-SNAPSHOT.jar --list datadog
It should return a list of documents. Each is perfix by their document ID (a random UUID)
Get the content of the doc using:
java -jar target/searchscraper-1.0-SNAPSHOT.jar --read <document ID>
-
Logging Would have liked to have decent logging
-
Exception handling for ES operations
It is kinda messy.
- DataStore#listDocsForKeyword only grabbing first 10
- if a keyword had more than 10 docs associated with it they'd not get seen.
- need to switch to using Scrolled search like in DataStore#listKeywords