tokenmill / crawling-framework Goto Github PK

View Code? Open in Web Editor NEW

21.0 6.0 3.0 940 KB

Easily crawl news portals or blog sites using Storm Crawler.

License: Other

Java 98.26% Shell 0.43% Makefile 0.19% JavaScript 0.16% SCSS 0.96%

crawler crawling storm scraping java elasticsearch storm-crawler crawling-framework vaadin

crawling-framework's Introduction

Crawling Framework

Crawling Framework aims at providing instruments to configure and run your Storm Crawler based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can:

Specify which sites to crawl.
Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched.
Specify which elements of the page provide information about article publication name, its title and main body.
Define tests which validate that extraction rules are working.

Once configuration is done the Crawling Framework runs Storm Crawler based crawling following the rules specified in the configuration.

Introduction

We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube.

Requirements

Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project install ElasticSearch (Crawling Framework is tested to work with Elastic v7.x).

Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed.

Using password protected ElasticSearch

Some providers hide ElasticSearch under authentification step (Which makes sense). Just set environment variables ES_USERNAME and ES_PASSWORD accordingly, everything else can remain the same. Authentification step will be done implicitly if proper credentials are there

Configuring and Running a crawl

See Crawling Framework Example project's documentation.

License

Distributed under the The Apache License, Version 2.0.

crawling-framework's People

Contributors

Stargazers

Watchers

Forkers

sam65536 admariner freshy969

crawling-framework's Issues

Bulk update updates docs with wrong url

URL status after partial data analysis should have a unique status

let's say PARTIAL_ANALYSIS

Log actual LD+JSON on parsing error

Error should also log erroneous JSON so that we could learn how to pre-process it to avoid such errors

WARN  l.t.c.p.u.JsonLdParser - Failed to parse ld+json
com.fasterxml.jackson.core.JsonParseException: Document contains more content after json-ld element - (possible mismatched {}?)
 at [Source: java.io.StringReader@72eeb417; line: 31, column: 10]
        at com.github.jsonldjava.utils.JsonUtils.fromJsonParser(JsonUtils.java:167) ~[crawler-standalone.jar:?]
        at com.github.jsonldjava.utils.JsonUtils.fromReader(JsonUtils.java:122) ~[crawler-standalone.jar:?]
        at com.github.jsonldjava.utils.JsonUtils.fromString(JsonUtils.java:190) ~[crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.utils.JsonLdParser.parse(JsonLdParser.java:37) [crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticleWithDetails(ArticleExtractor.java:35) [crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticle(ArticleExtractor.java:22) [crawler-standalone.jar:?]

Autocomplete on source field

When adding a new test from tests view, name field should autocomplete.

Management UI null pointer exception

java.lang.NullPointerException: index must not be null
	at java.util.Objects.requireNonNull(Objects.java:228)
	at org.elasticsearch.action.search.SearchRequest.indices(SearchRequest.java:140)
	at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:111)
	at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:101)
	at lt.tokenmill.crawling.es.EsHttpUrlOperations.calculateStats(EsHttpUrlOperations.java:141)
	at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:245)
	at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:229)
	at com.vaadin.data.util.GeneratedPropertyContainer$GeneratedProperty.getValue(GeneratedPropertyContainer.java:95)
	at com.vaadin.ui.Grid$RowDataGenerator.writeData(Grid.java:2232)
	at com.vaadin.ui.Grid$RowDataGenerator.generateData(Grid.java:2189)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.getRowData(RpcDataProviderExtension.java:404)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.pushRowData(RpcDataProviderExtension.java:392)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.beforeClientResponse(RpcDataProviderExtension.java:334)
	at com.vaadin.server.communication.UidlWriter.write(UidlWriter.java:112)
	at com.vaadin.server.communication.UidlRequestHandler.writeUidl(UidlRequestHandler.java:124)
	at com.vaadin.server.communication.UidlRequestHandler.synchronizedHandleRequest(UidlRequestHandler.java:92)
	at com.vaadin.server.SynchronizedRequestHandler.handleRequest(SynchronizedRequestHandler.java:41)
	at com.vaadin.server.VaadinService.handleRequest(VaadinService.java:1422)
	at com.vaadin.server.VaadinServlet.service(VaadinServlet.java:379)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:524)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)

Deleting http source should invoke dialog menu

HttpArticle should include language attribute

To make the article search easier it would be good to filter them by language.

Now problem is that if I want to search only in those articles that are in German, first, I have to filter HttpSources that has language de, get the sources, second, filter articles by those sources, and execute search.

With this change, we would remove the first step.

Upload CSV with sources, related (#2)
Check which ones are already configured.
other validations TODO.
export CSV with results

[ ] use name
[ ] analyze source keyword

Topology Worker Increase Not Doing Crawling.

Increasing In Topology.worker=4 Stop Doing Crawling.
Then No Use Of Storm Cluster. If It Fail.