Code Monkey home page Code Monkey logo

serritor's People

Contributors

kmozsi avatar peterbencze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

serritor's Issues

Adjust BaseCrawler to the further developed CrawlFrontier

Use CrawlRequest and CrawlResponse objects. The CrawlResponse objects should contain the URLs to visit (these URLs are added by 2 new methods, see below).

Implement 2 new methods:

  • visitUrl: it should append an URL to the list of URLs to visit.
  • visitUrls: it should extend the list of URLs to visit with a list of URLs.

Rename addExtractedUrls in CrawlFrontier to addCrawlResponse and update its javadoc accordingly.

Implementation Question

Hi,
I have a site which I need to do some work with the webDriver object to login and then refresh the site to display the data, The login process is in other site and then there are cookies created,
After I finish doing all my pre-work can I pass the webDriver to the crawler somehow and continue crawling wit the same window.
Appriciate for quick reply.
Thanks,
Vadim

Implement CrawlerLogger, a logger that controls every logging related to the crawler

Logging should be toggleable in the configuration by setting debug mode to true. This makes the crawler to print debug messages to the console. By specifying an optional argument (logFilePath), this output should also be written to the specified file.
If debug mode is set to false, no output should appear in the console.

Example:
config.setDebugMode(true);
config.setDebugMode(true, "/path/to/file");

Implement the initial CrawlerConfiguration class

This class will be used to set and get all the necessary options for the crawler (such as the WebDriver, etc.).

In this phase, implement all the required methods for the WebDriverFactory and also runInBackground.

Rewrite the crawling logic, implement specific features

HEAD requests should be sent only once (when the frontier returns the crawl request). If a request gets redirected to another URL, it should be threated as a new request.

Features to be implemented:

  • Remove CrawlResponse and replace it by CrawlRequest
  • CrawlRequests should contain the URL's top private domain
  • Rewrite tests for the modified frontier
  • Add visitUrl and visitUrls methods. These methods provide a way to add URLs to the frontier from the callbacks.
  • Add offsite request filtering (can be toggled in the configuration) - use Guava (included in the Selenium dependencies)
  • Add a configuration option to disable duplicate request filtering

Refactor and further develop BaseCrawler, rewrite HttpHeadResponse

  • BaseCrawler

Callback methods do not need to be abstract (template pattern).
The driver should always be closed, even if an exception occurs.
Rename onUrlOpen to onBrowserOpen
Rename onEnd to onFinish

Add new callback: onBrowserTimeout - this method is called when the request times out in the browser.
Add new public method: stop - this method is used to stop the crawler.

  • HttpHeadResponse

This class will be used as a parameter in specific callbacks. It should contain the URL of the response and all the non-setter methods that HttpResponse provides.

Create development branch from master

The development branch will obviously be used for development. When you start working on an issue, you should always create a new branch from the development branch. When you're finished, you commit and push your changes also to this branch. Then, you create a pull request for your code to be merged into the development branch.

Implement an initial CrawlFrontier

It should provide an interface for the crawler to manage URLs. A crawl frontier is the part of a crawling system that decides what URLs should be crawled next. The frontier is initialized with a list of start URLs, that we call the seeds. Once the frontier is initialized the crawler asks it what URLs should be visited next. As the crawler starts to visit the URLs, it will inform the frontier of the extracted URLs contained within the page. These URLs are added to the frontier, if they haven't been visited yet.

Design WebDriverFactory for Selenium WebDriver construction

The list of available WebDrivers can be found here: http://www.seleniumhq.org/projects/webdriver/
We would like to be able to use all of them. Please read their manuals/documentation to understand how they work and what they are capable of.

The factory will be used in BaseCrawler. Settings for the WebDrivers can be read from the CrawlerConfiguration instance which is passed as a parameter.

Example usage would be:
WebDriverFactory.getDriver(configuration)

Refactor BaseCrawler, extend CrawlerConfiguration

  • Refactor BaseCrawler:

For loops should be replaced with Java 8 streams.

Add abstract onUrlOpen method (argument: the driver instance)
Add abstract onUrlOpenError (argument: the URL as String)

  • Extend CrawlerConfiguration:

Methods to be added:

  • addDesiredCapability: add a desired capability to the desiredCapabilities list
  • setDebugMode
  • getDebugMode
  • setCrawlingStrategy (Blocking #18): DEPTH_FIRST, BREADTH_FIRST
  • getCrawlingStrategy (Blocking #18)

Rename the following methods:
setDesiredCapabilities -> addDesiredCapabilities
setSeed -> addSeed
setSeeds -> addSeeds

Further develop the Crawl Frontier

Two new classes should be implemented:

  • CrawlRequest
    These objects are constructed by the frontier. When the frontier receives a CrawlResponse (see below), it should loop through the extracted URLs in the response and construct a CrawlRequest object - containing the crawl depth and the URL as String - if and only if the URL has not already been visited by the crawler (check its fingerprint). If the URL has not already been visited, add it to the priority queue (see below).
  • CrawlResponse
    These objects are constructed by the crawler. When the crawler extracts a list of URLs from a page, it should construct a CrawlResponse object with the crawl depth (request's crawl depth + 1) and the list of extracted URLs (use URL type) and pass it to the frontier.

The CrawlFrontier class should contain a priority queue (use PriorityQueue) of these requests, sorted by their crawl depth (according to the configuration). When the crawler asks the frontier if it has a next request, the frontier should check if the queue is empty or not. When the crawler asks for the next request, the frontier should get the first element from the queue (which is a CrawlRequest object) and return it to the crawler (PriorityQueue has a poll method which is perfectly suitable for this).

CrawlFrontier should be initialized with a list of URLs (seeds). For each of these URLs, a new CrawlRequest object is constructed and added to the priority queue. A fingerprint for the URL is also created and added to the list of fingerprints.

Add multiple new features, refact

  • Upgrade Selenium 3.0.0-beta2 to 3.0.0-beta3
  • Add the possibility of waiting some time between requests (delay between requests)
  • Rename visitUrl to crawlUrl and visitUrls to crawlUrls
  • Remove thread creation from BaseCrawler (along with the configurations options)
  • Add the possibility of resuming a previous crawl

Cannot get HarResponse

Version 2.1.1

OS Win
Webriver chromium

First
Crawler.java line 478. Filter is incorrect
HarResponse harResponse = proxyServer.getHar().getLog().getEntries().stream() .filter(harEntry -> candidateUrl.equals(harEntry.getRequest().getUrl())) .findFirst() .orElseThrow(() -> new IllegalStateException("No HAR entry for request URL")) .getResponse();
candidateUrl always with final /
harEntry.getRequest().getUrl() always without final /

Second
Violation of SOLID principles does not allow fixing such issue easily

harresponcebug

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.