tokenmill / crawling-framework Goto Github PK
View Code? Open in Web Editor NEWEasily crawl news portals or blog sites using Storm Crawler.
License: Other
Easily crawl news portals or blog sites using Storm Crawler.
License: Other
Use something like this https://wiki.apache.org/solr/Deduplication
Now only 100 sources sorted by alphabet are displayed
When docs are indexed using the bulk processor then get by URL fails.
Increasing In Topology.worker=4 Stop Doing Crawling.
Then No Use Of Storm Cluster. If It Fail.
This would enable to create tag cloud visualizations on the crawled text.
Error should also log erroneous JSON so that we could learn how to pre-process it to avoid such errors
WARN l.t.c.p.u.JsonLdParser - Failed to parse ld+json
com.fasterxml.jackson.core.JsonParseException: Document contains more content after json-ld element - (possible mismatched {}?)
at [Source: java.io.StringReader@72eeb417; line: 31, column: 10]
at com.github.jsonldjava.utils.JsonUtils.fromJsonParser(JsonUtils.java:167) ~[crawler-standalone.jar:?]
at com.github.jsonldjava.utils.JsonUtils.fromReader(JsonUtils.java:122) ~[crawler-standalone.jar:?]
at com.github.jsonldjava.utils.JsonUtils.fromString(JsonUtils.java:190) ~[crawler-standalone.jar:?]
at lt.tokenmill.crawling.parser.utils.JsonLdParser.parse(JsonLdParser.java:37) [crawler-standalone.jar:?]
at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticleWithDetails(ArticleExtractor.java:35) [crawler-standalone.jar:?]
at lt.tokenmill.crawling.parser.ArticleExtractor.extractArticle(ArticleExtractor.java:22) [crawler-standalone.jar:?]
let's say PARTIAL_ANALYSIS
Use ES {:refresh true}
with synchronous write and remove timeout to speed up the UI.
Do not use bulk for UI.
Stats button is showing the status of the crawl, but if there is nothing crawled it would be good to see it in the table, without opening the stats popup. Say making stats button redish.
java.lang.NullPointerException: index must not be null
at java.util.Objects.requireNonNull(Objects.java:228)
at org.elasticsearch.action.search.SearchRequest.indices(SearchRequest.java:140)
at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:111)
at org.elasticsearch.action.search.SearchRequest.<init>(SearchRequest.java:101)
at lt.tokenmill.crawling.es.EsHttpUrlOperations.calculateStats(EsHttpUrlOperations.java:141)
at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:245)
at lt.tokenmill.crawling.adminui.view.HttpSourcesView$StatsButtonPropertyGenerator.getValue(HttpSourcesView.java:229)
at com.vaadin.data.util.GeneratedPropertyContainer$GeneratedProperty.getValue(GeneratedPropertyContainer.java:95)
at com.vaadin.ui.Grid$RowDataGenerator.writeData(Grid.java:2232)
at com.vaadin.ui.Grid$RowDataGenerator.generateData(Grid.java:2189)
at com.vaadin.server.communication.data.RpcDataProviderExtension.getRowData(RpcDataProviderExtension.java:404)
at com.vaadin.server.communication.data.RpcDataProviderExtension.pushRowData(RpcDataProviderExtension.java:392)
at com.vaadin.server.communication.data.RpcDataProviderExtension.beforeClientResponse(RpcDataProviderExtension.java:334)
at com.vaadin.server.communication.UidlWriter.write(UidlWriter.java:112)
at com.vaadin.server.communication.UidlRequestHandler.writeUidl(UidlRequestHandler.java:124)
at com.vaadin.server.communication.UidlRequestHandler.synchronizedHandleRequest(UidlRequestHandler.java:92)
at com.vaadin.server.SynchronizedRequestHandler.handleRequest(SynchronizedRequestHandler.java:41)
at com.vaadin.server.VaadinService.handleRequest(VaadinService.java:1422)
at com.vaadin.server.VaadinServlet.service(VaadinServlet.java:379)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:524)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
A page can have multiple urls with different params but the page will have a single canonical url uniquely identifying it. We need to use this for duplicate checks
This would allow to get rid of several setup scripts which create indexes
In cases where news article provides the time of the day, e.g. "15:04", we can write a format HH:mm but default date is 1970-01-01.
Currently configuration can be managed only through Administration UI
Search field should perform a search for every letter
When adding a new test from tests view, name field should autocomplete.
In the http sources tests window the search is not working. Should search on every keystroke.
To make the article search easier it would be good to filter them by language.
Now problem is that if I want to search only in those articles that are in German, first, I have to filter HttpSources that has language de
, get the sources, second, filter articles by those sources, and execute search.
With this change, we would remove the first step.
Example is better than any documentation
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.