nasa-jpl-memex / sce Goto Github PK

View Code? Open in Web Editor NEW

4.0 6.0 3.0 57 KB

Sparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git

Home Page: http://irds.usc.edu/sparkler/

License: Apache License 2.0

Shell 54.83% Python 45.17%

web-crawler crawler apache apache-spark docker docker-image information-retrieval

sce's Introduction

Sparkler Crawl Environment

The Sparkler Crawl Environment aims at providing an efficient, scalable, consistent and reliable software architecture consisting of domain discovery tools able to enrich a given domain by expanding the collection of artifacts that define the domain.

This repository, named sce, provides a command-line utility for building Sparkler Crawl Environment as a multi-container Docker application running through the Docker Compose tool on a single node. As a PoC, you can easily install the Sparkler Crawl Environment on a single node using the kickstart.sh bash script that automatically builds and starts up all the software components:

./kickstart.sh [-l /path/to/log]

The Sparkler Crawl Environment is built on top of Sparkler, a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

sce's People

Contributors

Stargazers

Watchers

Forkers

davtalab ahmadika

sce's Issues

Retreived 12 webpages in UI are already colored before voting for relevancy

When training a model using UI, if you search for new keywords, the new 12 webpages are already colored based on previous voting. This will confuse the voter if he/she has already voted for a webpages or not. The colors should be reset for each new search.

remove flask and switch to gunicorn or similar

separate the build components out of the docker file

allow search operators in search box

it would be helpful to allow domain exclusion in the search box, e.g. -site:foxnews.com
allow all operators shown: https://duck.co/help/results/syntax

Hook up monitoring on Solr and Sparkler

Check for ports already in use

@giuseppetotaro
For the docker-compose up to work correctly, we need to check if the following ports are available

8983
5000
9559 and
4444

If these ports are not available we should ask the user to make them available while running the kickstart.sh script ?

Need visual indication that a choice has been made for a given URL.

Figure out a plan

create travis build for sce

Add/remove urls from the deep crawl list

How do we add/remove urls from the deep crawl list?

Establish clear work flow in the interface

Add -uninstall option in kickstart.sh

Kickstart.sh should be able to clean the system of the images.

Documentation on how to construct a keyword list needs to be added to wiki or somewhere accessible

Users expecting to use the SCE for a particular purpose may need to regenerate a model from scratch during an initial exploration phase. For example when Ketil reviewed the 600 URL's he ended up having to define for himself detailed descriptions of how to score a document. I also have had to do that, as my initial guess for what was the "right stuff" turned out to be inadequate (not specific enough). My second pass at a rule set is:

+ Green rules:

A document that defines permafrost-related terminology that also likely contains information about the history of the term. Examples: Permafrost related review papers; comprehensive dictionaries; wikipedia articles with historical discussions of terms, etc.

! Orange rules:

Documents that are about permafrost-related terms but which do not have historical information. Newspaper articles, etc.

- Red rules:

Documents using permafrost terms but which are about businesses, games, etc. Totally irrelevant stuff.

NOTE: Anyone have a better way of colorizing markdown text?

The crawl dashboard is not coming up

I get the following when I try to bring the crawl dashboard up:

I get the same screen when I go to http://0.0.0.0:8983/solr/#/ I get that same error. Ditto with http://0.0.0.0:8983/banana/.

However, when I check the docker images, they look fine (though there are extras given the testing I was doing a while ago):

and the containers seem to be fine as well:

but the logs do show that there is a problem - the sce.log repeats the following error every second or two:

2019-03-06 05:36:47 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-06 05:36:49 INFO Crawler$:147 [main] - Committing crawldb..
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:49)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:617)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:148)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:48)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:310)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at shaded.org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at shaded.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at shaded.org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
... 17 more

It looks like Solr is not up and running or at least neither I nor the SCE can connect to it.

Suggestions?

Develop Crawl Alert interface

Allow for infinite crawling

Make provisions in the sce.sh script to let a crawl run till a user presses CRTL+C

Fix link to crawl dashboard for local install

The link currently points to /banana, which is fine unless you also need to define the port because it's a local install.

Can we get this to work in both cases?

sce.sh functionality

The default functionality for sce.sh should be the following:

if -i (number of iterations) is not specified, run until stopped
if -l (log) is not specified, it should automatically create a log while also outputting to screen
if -id (job id) is not specified, it should automatically create one based on the seed-list name and a timestamp (or whatever makes sense)

Update wiki to include instructions on GUI capabilities currently not mentioned

For example, the ability to import and export models under the Generate a Model section. At this point the documentation only obviously discusses using dumper.sh. @wmburke

Design interface to include progress reporting

User Guide needs a section on kickstart.sh options

Recommendation Engine

Develop recommendations to add to deep crawl list based on results of web crawl.

Configure the banana dashboard to show the data from last 10 days

tidy up url paths

Develop Crawl Alert capability

Develop release plan that includes solid testing

At the second DD Eval, the program did not work after install due to a silly mistake during upload. We need to ensure this does not happen again.

dumper.sh -l default

if -l (log) is not specified, it should automatically create a log while also outputting to screen

pagination on search results

There was a request during DD Eval 2 from the SEC domain to be able to mark more search results from a given query. Seems like a good idea.

And now ./dumper.sh is not working either

The error is:

Ruths-iMac:sce-master ruthduerr$ ./dumper.sh
python: can't open file '/projects/sce/dumper/cdrv3_exporter.py': [Errno 2] No such file or directory
The dump of segments has been started. All the log messages will be reported also to /dev/stdout
Ruths-iMac:sce-master ruthduerr$

It looks like this is the line with the failure:

docker exec compose_domain-discovery_1 python /projects/sce/dumper/cdrv3_exporter.py

which implies that container compose_domain-discovery_1 is having issues finding directories.... It is true that the system has been stopped and started several times using kickstart.sh

User Guide: Output commands section needs to be cleaned up

Enable deep crawling using Sparker

To enable deep crawling we would need two sparkler deployments configured in a way to enable deep crawling.
This would mean we would change the way we are deploying our system.
Instead of running an embedded solr within the sparkelr container we would separate it out in its own container.
We would then have 5 containers running
1 SOLR, 2 Sparklers, 1 DD Tool, 1 Firefox.

Solr keeps shutting down

I have managed to successfully launch and run a crawl; however, it crashes after a while (and the time between crashes has been getting shorter). When that happens it looks like SOLR is down - is it not configured correctly? Getting full?

Add documentation into the wiki

At this point, I'm just copying it over from the Memex wiki so that it is readily accessible.

Also, I need these images over there:

Kickstart.sh should report the urls more accurately

As kickstart completes, it reports the following information:

The Solr instance is available on http://0.0.0.0:8983
The DD explorer is available on http://0.0.0.0:5000/explorer

This is only accurate on a local install.

Anyone seen this error before?

It seems to happen when a key phrase is entered and the system never comes back with 12 URL's to label. Instead after a long, long time (hours?) the system pops up the following image

and then new searches do nothing and worse yet it appears like the front end loses the ability to read the model. The only way I've found so far to fix this is to

./kickstart.sh -stop

then

./kickstart.sh -start

Fortunately when this is done, the model information reloads OK, so I haven't lost all of my labeled URL's

It sometimes doubles numbers. For example, as I was just watching it, it would jump from 57k fetched docs to 114k fetched docs, then refresh and drop back to 57k. The correct number if 57k.
It auto-refreshes very often (I think more than 30s), and if I try to change the time/date in the timepicker, it changes it back very quickly.