The crawling-framework's discuss from tokenmill

Article fingerprint for deduplication

Use something like this https://wiki.apache.org/solr/Deduplication

Management UI paging of HTTP sources

Now only 100 sources sorted by alphabet are displayed

BulkProcessor and the RestHighLevelClient treats url encoded urls differently

When docs are indexed using the bulk processor then get by URL fails.

Evaluate http sources from CSV

Upload CSV with sources, related (#2)
Check which ones are already configured.
other validations TODO.
export CSV with results

Topology Worker Increase Not Doing Crawling.

Increasing In Topology.worker=4 Stop Doing Crawling.
Then No Use Of Storm Cluster. If It Fail.

ES index config should have `fielddata` enabled for text fields

This would enable to create tag cloud visualizations on the crawled text.

Extracted document should hava name from source for filtering

[ ] use name
[ ] analyze source keyword

Log actual LD+JSON on parsing error

Error should also log erroneous JSON so that we could learn how to pre-process it to avoid such errors

WARN  l.t.c.p.u.JsonLdParser - Failed to parse ld+json
com.fasterxml.jackson.core.JsonParseException: Document contains more content after json-ld element - (possible mismatched {}?)
 at [Source: java.io.StringReader@72eeb417; line: 31, column: 10]
        at com.github.jsonldjava.utils.JsonUtils.fromJsonParser(JsonUtils.java:167) ~[crawler-standalone.jar:?]
        at com.github.jsonldjava.utils.JsonUtils.fromReader(JsonUtils.java:122) ~[crawler-standalone.jar:?]
        at com.github.jsonldjava.utils.JsonUtils.fromString(JsonUtils.java:190) ~[crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.utils.JsonLdParser.parse(JsonLdParser.java:37) [crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticleWithDetails(ArticleExtractor.java:35) [crawler-standalone.jar:?]
        at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticle(ArticleExtractor.java:22) [crawler-standalone.jar:?]

URL status after partial data analysis should have a unique status

let's say PARTIAL_ANALYSIS

Management UI related write actions should be commited Immediately

Use ES {:refresh true} with synchronous write and remove timeout to speed up the UI.

Do not use bulk for UI.

Inside source config indicate that the crawl is not happening

Stats button is showing the status of the crawl, but if there is nothing crawled it would be good to see it in the table, without opening the stats popup. Say making stats button redish.

Management UI should have paging for tests

URL date should not be used as publication date hint

Start crawler with a simple `docker-compose up`

Management UI null pointer exception

java.lang.NullPointerException: index must not be null
	at java.util.Objects.requireNonNull(Objects.java:228)
	at org.elasticsearch.action.search.SearchRequest.indices(SearchRequest.java:140)
	at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:111)
	at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:101)
	at lt.tokenmill.crawling.es.EsHttpUrlOperations.calculateStats(EsHttpUrlOperations.java:141)
	at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:245)
	at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:229)
	at com.vaadin.data.util.GeneratedPropertyContainer$GeneratedProperty.getValue(GeneratedPropertyContainer.java:95)
	at com.vaadin.ui.Grid$RowDataGenerator.writeData(Grid.java:2232)
	at com.vaadin.ui.Grid$RowDataGenerator.generateData(Grid.java:2189)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.getRowData(RpcDataProviderExtension.java:404)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.pushRowData(RpcDataProviderExtension.java:392)
	at com.vaadin.server.communication.data.RpcDataProviderExtension.beforeClientResponse(RpcDataProviderExtension.java:334)
	at com.vaadin.server.communication.UidlWriter.write(UidlWriter.java:112)
	at com.vaadin.server.communication.UidlRequestHandler.writeUidl(UidlRequestHandler.java:124)
	at com.vaadin.server.communication.UidlRequestHandler.synchronizedHandleRequest(UidlRequestHandler.java:92)
	at com.vaadin.server.SynchronizedRequestHandler.handleRequest(SynchronizedRequestHandler.java:41)
	at com.vaadin.server.VaadinService.handleRequest(VaadinService.java:1422)
	at com.vaadin.server.VaadinServlet.service(VaadinServlet.java:379)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:524)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)

Bulk update updates docs with wrong url

Use canonical url for duplicate detection

A page can have multiple urls with different params but the page will have a single canonical url uniquely identifying it. We need to use this for duplicate checks

Auto-create ElasticSearch indexes if they don't exist

This would allow to get rid of several setup scripts which create indexes

Relative date should be the date of parsing

In cases where news article provides the time of the day, e.g. "15:04", we can write a format HH:mm but default date is 1970-01-01.

Deleting http source should invoke dialog menu

Separate integration tests from unit tests

Prepare a demo crawler with bbc configuration

REST API to import/export crawler configuration

Currently configuration can be managed only through Administration UI

Management UI search as you type

Search field should perform a search for every letter

Autocomplete on source field

When adding a new test from tests view, name field should autocomplete.

Provide default values for ES index creation script

Search for tests

In the http sources tests window the search is not working. Should search on every keystroke.

HttpArticle should include language attribute

To make the article search easier it would be good to filter them by language.

Now problem is that if I want to search only in those articles that are in German, first, I have to filter HttpSources that has language de, get the sources, second, filter articles by those sources, and execute search.

With this change, we would remove the first step.

Create example project which uses crawling-framework

Example is better than any documentation

tokenmill / crawling-framework Goto Github PK

crawling-framework's Issues

Recommend Projects

Recommend Topics

Recommend Org