peterbencze / serritor Goto Github PK

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.

License: Apache License 2.0

Java 100.00%

crawler java selenium framework scraper dynamic-website dynamic-webpages automation data-mining selenium-crawler

serritor's People

Contributors

Stargazers

Watchers

Forkers

marcelomata lifedom sridhar-newsdistill zoltanthehun yunjuanyunshu07 gitter-badger jiuzixue09 arunima-rastogi hmdavid pranjut maxplne buidiemnhi meimingle onlineworkwear vadimlevitzky

serritor's Issues

Adjust BaseCrawler to the further developed CrawlFrontier

Use CrawlRequest and CrawlResponse objects. The CrawlResponse objects should contain the URLs to visit (these URLs are added by 2 new methods, see below).

Implement 2 new methods:

visitUrl: it should append an URL to the list of URLs to visit.
visitUrls: it should extend the list of URLs to visit with a list of URLs.

Rename addExtractedUrls in CrawlFrontier to addCrawlResponse and update its javadoc accordingly.

Implementation Question

Hi,
I have a site which I need to do some work with the webDriver object to login and then refresh the site to display the data, The login process is in other site and then there are cookies created,
After I finish doing all my pre-work can I pass the webDriver to the crawler somehow and continue crawling wit the same window.
Appriciate for quick reply.
Thanks,
Vadim

Implement CrawlerLogger, a logger that controls every logging related to the crawler

Logging should be toggleable in the configuration by setting debug mode to true. This makes the crawler to print debug messages to the console. By specifying an optional argument (logFilePath), this output should also be written to the specified file.
If debug mode is set to false, no output should appear in the console.

Example:
config.setDebugMode(true);
config.setDebugMode(true, "/path/to/file");

Implement the initial CrawlerConfiguration class

This class will be used to set and get all the necessary options for the crawler (such as the WebDriver, etc.).

In this phase, implement all the required methods for the WebDriverFactory and also runInBackground.

Create initial BaseCrawler abstract class

This class will be a skeletal implementation of a crawler to minimize the effort for users to implement their own.

Rewrite the crawling logic, implement specific features

HEAD requests should be sent only once (when the frontier returns the crawl request). If a request gets redirected to another URL, it should be threated as a new request.

Features to be implemented:

Remove CrawlResponse and replace it by CrawlRequest
CrawlRequests should contain the URL's top private domain
Rewrite tests for the modified frontier
Add visitUrl and visitUrls methods. These methods provide a way to add URLs to the frontier from the callbacks.
Add offsite request filtering (can be toggled in the configuration) - use Guava (included in the Selenium dependencies)
Add a configuration option to disable duplicate request filtering

Create Maven project for Serritor

Dependency: selenium-java

Set up continous integration

This list could be helpful: https://github.com/ligurio/Continuous-Integration-services/blob/master/continuous-integration-services-list.md

Refactor and further develop BaseCrawler, rewrite HttpHeadResponse

BaseCrawler

Callback methods do not need to be abstract (template pattern).
The driver should always be closed, even if an exception occurs.
Rename onUrlOpen to onBrowserOpen
Rename onEnd to onFinish

Add new callback: onBrowserTimeout - this method is called when the request times out in the browser.
Add new public method: stop - this method is used to stop the crawler.

HttpHeadResponse

This class will be used as a parameter in specific callbacks. It should contain the URL of the response and all the non-setter methods that HttpResponse provides.

Create development branch from master

The development branch will obviously be used for development. When you start working on an issue, you should always create a new branch from the development branch. When you're finished, you commit and push your changes also to this branch. Then, you create a pull request for your code to be merged into the development branch.

Implement an initial CrawlFrontier

It should provide an interface for the crawler to manage URLs. A crawl frontier is the part of a crawling system that decides what URLs should be crawled next. The frontier is initialized with a list of start URLs, that we call the seeds. Once the frontier is initialized the crawler asks it what URLs should be visited next. As the crawler starts to visit the URLs, it will inform the frontier of the extracted URLs contained within the page. These URLs are added to the frontier, if they haven't been visited yet.

Design WebDriverFactory for Selenium WebDriver construction

The list of available WebDrivers can be found here: http://www.seleniumhq.org/projects/webdriver/
We would like to be able to use all of them. Please read their manuals/documentation to understand how they work and what they are capable of.

The factory will be used in BaseCrawler. Settings for the WebDrivers can be read from the CrawlerConfiguration instance which is passed as a parameter.

Example usage would be:
WebDriverFactory.getDriver(configuration)

Implement multiple features for release

TODO

Add the possibility of running the crawler in the same thread

Blocked by: #12

BaseCrawler's start method should have a boolean argument. This argument could be set to true, if the user wants to run the crawler in a seperate thread.

Refactor BaseCrawler, extend CrawlerConfiguration

Refactor BaseCrawler:

For loops should be replaced with Java 8 streams.

Add abstract onUrlOpen method (argument: the driver instance)
Add abstract onUrlOpenError (argument: the URL as String)

Extend CrawlerConfiguration:

Methods to be added:

addDesiredCapability: add a desired capability to the desiredCapabilities list
setDebugMode
getDebugMode
setCrawlingStrategy (Blocking #18): DEPTH_FIRST, BREADTH_FIRST
getCrawlingStrategy (Blocking #18)

Rename the following methods:
setDesiredCapabilities -> addDesiredCapabilities
setSeed -> addSeed
setSeeds -> addSeeds

Implement the designed factory for Selenium WebDriver construction

Blocked by: #6, #12

Further develop the Crawl Frontier

Two new classes should be implemented:

CrawlRequest
These objects are constructed by the frontier. When the frontier receives a CrawlResponse (see below), it should loop through the extracted URLs in the response and construct a CrawlRequest object - containing the crawl depth and the URL as String - if and only if the URL has not already been visited by the crawler (check its fingerprint). If the URL has not already been visited, add it to the priority queue (see below).
CrawlResponse
These objects are constructed by the crawler. When the crawler extracts a list of URLs from a page, it should construct a CrawlResponse object with the crawl depth (request's crawl depth + 1) and the list of extracted URLs (use URL type) and pass it to the frontier.

The CrawlFrontier class should contain a priority queue (use PriorityQueue) of these requests, sorted by their crawl depth (according to the configuration). When the crawler asks the frontier if it has a next request, the frontier should check if the queue is empty or not. When the crawler asks for the next request, the frontier should get the first element from the queue (which is a CrawlRequest object) and return it to the crawler (PriorityQueue has a poll method which is perfectly suitable for this).

CrawlFrontier should be initialized with a list of URLs (seeds). For each of these URLs, a new CrawlRequest object is constructed and added to the priority queue. A fingerprint for the URL is also created and added to the list of fingerprints.

Add multiple new features, refact

Upgrade Selenium 3.0.0-beta2 to 3.0.0-beta3
Add the possibility of waiting some time between requests (delay between requests)
Rename visitUrl to crawlUrl and visitUrls to crawlUrls
Remove thread creation from BaseCrawler (along with the configurations options)
Add the possibility of resuming a previous crawl

Cannot get HarResponse

Version 2.1.1

OS Win
Webriver chromium

First
Crawler.java line 478. Filter is incorrect
HarResponse harResponse = proxyServer.getHar().getLog().getEntries().stream() .filter(harEntry -> candidateUrl.equals(harEntry.getRequest().getUrl())) .findFirst() .orElseThrow(() -> new IllegalStateException("No HAR entry for request URL")) .getResponse();
candidateUrl always with final /
harEntry.getRequest().getUrl() always without final /

Second
Violation of SOLID principles does not allow fixing such issue easily