Code Monkey home page Code Monkey logo

sce's Introduction

Sparkler Crawl Environment

The Sparkler Crawl Environment aims at providing an efficient, scalable, consistent and reliable software architecture consisting of domain discovery tools able to enrich a given domain by expanding the collection of artifacts that define the domain.

This repository, named sce, provides a command-line utility for building Sparkler Crawl Environment as a multi-container Docker application running through the Docker Compose tool on a single node. As a PoC, you can easily install the Sparkler Crawl Environment on a single node using the kickstart.sh bash script that automatically builds and starts up all the software components:

./kickstart.sh [-l /path/to/log]

The Sparkler Crawl Environment is built on top of Sparkler, a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

sce's People

Contributors

giuseppetotaro avatar sujen1412 avatar wmburke avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

davtalab ahmadika

sce's Issues

Check for ports already in use

@giuseppetotaro
For the docker-compose up to work correctly, we need to check if the following ports are available

  1. 8983
  2. 5000
  3. 9559 and
  4. 4444

If these ports are not available we should ask the user to make them available while running the kickstart.sh script ?

Documentation on how to construct a keyword list needs to be added to wiki or somewhere accessible

Users expecting to use the SCE for a particular purpose may need to regenerate a model from scratch during an initial exploration phase. For example when Ketil reviewed the 600 URL's he ended up having to define for himself detailed descriptions of how to score a document. I also have had to do that, as my initial guess for what was the "right stuff" turned out to be inadequate (not specific enough). My second pass at a rule set is:

+ Green rules:

A document that defines permafrost-related terminology that also likely contains information about the history of the term. Examples: Permafrost related review papers; comprehensive dictionaries; wikipedia articles with historical discussions of terms, etc.

! Orange rules: 

Documents that are about permafrost-related terms but which do not have historical information. Newspaper articles, etc.

- Red rules: 

Documents using permafrost terms but which are about businesses, games, etc. Totally irrelevant stuff.

NOTE: Anyone have a better way of colorizing markdown text?

The crawl dashboard is not coming up

I get the following when I try to bring the crawl dashboard up:

screen shot 2019-03-05 at 10 39 41 pm

I get the same screen when I go to http://0.0.0.0:8983/solr/#/ I get that same error. Ditto with http://0.0.0.0:8983/banana/.

However, when I check the docker images, they look fine (though there are extras given the testing I was doing a while ago):

screen shot 2019-03-05 at 10 47 07 pm

and the containers seem to be fine as well:

screen shot 2019-03-05 at 10 48 57 pm

but the logs do show that there is a problem - the sce.log repeats the following error every second or two:

2019-03-06 05:36:47 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-06 05:36:49 INFO Crawler$:147 [main] - Committing crawldb..
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:49)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:617)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:148)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:48)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:310)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at shaded.org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at shaded.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at shaded.org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
... 17 more

It looks like Solr is not up and running or at least neither I nor the SCE can connect to it.

Suggestions?

sce.sh functionality

The default functionality for sce.sh should be the following:

  1. if -i (number of iterations) is not specified, run until stopped
  2. if -l (log) is not specified, it should automatically create a log while also outputting to screen
  3. if -id (job id) is not specified, it should automatically create one based on the seed-list name and a timestamp (or whatever makes sense)

dumper.sh -l default

if -l (log) is not specified, it should automatically create a log while also outputting to screen

pagination on search results

There was a request during DD Eval 2 from the SEC domain to be able to mark more search results from a given query. Seems like a good idea.

And now ./dumper.sh is not working either

The error is:

Ruths-iMac:sce-master ruthduerr$ ./dumper.sh
python: can't open file '/projects/sce/dumper/cdrv3_exporter.py': [Errno 2] No such file or directory
The dump of segments has been started. All the log messages will be reported also to /dev/stdout
Ruths-iMac:sce-master ruthduerr$

It looks like this is the line with the failure:

docker exec compose_domain-discovery_1 python /projects/sce/dumper/cdrv3_exporter.py

which implies that container compose_domain-discovery_1 is having issues finding directories.... It is true that the system has been stopped and started several times using kickstart.sh

Enable deep crawling using Sparker

To enable deep crawling we would need two sparkler deployments configured in a way to enable deep crawling.
This would mean we would change the way we are deploying our system.
Instead of running an embedded solr within the sparkelr container we would separate it out in its own container.
We would then have 5 containers running
1 SOLR, 2 Sparklers, 1 DD Tool, 1 Firefox.

Solr keeps shutting down

I have managed to successfully launch and run a crawl; however, it crashes after a while (and the time between crashes has been getting shorter). When that happens it looks like SOLR is down - is it not configured correctly? Getting full?

Add documentation into the wiki

At this point, I'm just copying it over from the Memex wiki so that it is readily accessible.

Also, I need these images over there:

ddtool-2017-08-31

docker_compose

Kickstart.sh should report the urls more accurately

As kickstart completes, it reports the following information:

The Solr instance is available on http://0.0.0.0:8983
The DD explorer is available on http://0.0.0.0:5000/explorer

This is only accurate on a local install.

Anyone seen this error before?

It seems to happen when a key phrase is entered and the system never comes back with 12 URL's to label. Instead after a long, long time (hours?) the system pops up the following image
screen shot 2019-03-05 at 9 36 20 pm

and then new searches do nothing and worse yet it appears like the front end loses the ability to read the model. The only way I've found so far to fix this is to

./kickstart.sh -stop

then

./kickstart.sh -start

Fortunately when this is done, the model information reloads OK, so I haven't lost all of my labeled URL's

Adaptive fetch schedule

Port the adaptive fetch schedule from Nutch so we have dynamically determined timeframes between crawls on those being deep crawled

Banana dashboard needs tweaking

Watching a crawl this week, I've noticed a number of things that I haven't been able to resolve. This may indicate my novice interactions with Banana, but I wanted to document them nonetheless:

  1. It sometimes doubles numbers. For example, as I was just watching it, it would jump from 57k fetched docs to 114k fetched docs, then refresh and drop back to 57k. The correct number if 57k.
  2. It auto-refreshes very often (I think more than 30s), and if I try to change the time/date in the timepicker, it changes it back very quickly.

kickstart.sh -l default

if -l (log) is not specified, it should automatically create a log while also outputting to screen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.