The sce's discuss from nasa-jpl-memex

Retreived 12 webpages in UI are already colored before voting for relevancy

When training a model using UI, if you search for new keywords, the new 12 webpages are already colored based on previous voting. This will confuse the voter if he/she has already voted for a webpages or not. The colors should be reset for each new search.

Develop Crawl Alert interface

Add/remove urls from the deep crawl list

How do we add/remove urls from the deep crawl list?

Recommendation Engine

Develop recommendations to add to deep crawl list based on results of web crawl.

Send alert when a site goes down

eg, when it returns 404s

Adaptive fetch schedule

Port the adaptive fetch schedule from Nutch so we have dynamically determined timeframes between crawls on those being deep crawled

Banana dashboard needs tweaking

Watching a crawl this week, I've noticed a number of things that I haven't been able to resolve. This may indicate my novice interactions with Banana, but I wanted to document them nonetheless:

It sometimes doubles numbers. For example, as I was just watching it, it would jump from 57k fetched docs to 114k fetched docs, then refresh and drop back to 57k. The correct number if 57k.
It auto-refreshes very often (I think more than 30s), and if I try to change the time/date in the timepicker, it changes it back very quickly.

kickstart.sh -l default

if -l (log) is not specified, it should automatically create a log while also outputting to screen

Add -uninstall option in kickstart.sh

Kickstart.sh should be able to clean the system of the images.

separate the build components out of the docker file

Enable user to submit info for logging in to a given site

Hook up monitoring on Solr and Sparkler

tidy up url paths

sce.sh functionality

The default functionality for sce.sh should be the following:

if -i (number of iterations) is not specified, run until stopped
if -l (log) is not specified, it should automatically create a log while also outputting to screen
if -id (job id) is not specified, it should automatically create one based on the seed-list name and a timestamp (or whatever makes sense)

Send alert when the relevancy of a deep crawl changes

When relevancy of pages on a deep crawl site go below a certain threshold, mark it for review by the SME

Documentation on how to construct a keyword list needs to be added to wiki or somewhere accessible

Users expecting to use the SCE for a particular purpose may need to regenerate a model from scratch during an initial exploration phase. For example when Ketil reviewed the 600 URL's he ended up having to define for himself detailed descriptions of how to score a document. I also have had to do that, as my initial guess for what was the "right stuff" turned out to be inadequate (not specific enough). My second pass at a rule set is:

+ Green rules:

A document that defines permafrost-related terminology that also likely contains information about the history of the term. Examples: Permafrost related review papers; comprehensive dictionaries; wikipedia articles with historical discussions of terms, etc.

! Orange rules:

Documents that are about permafrost-related terms but which do not have historical information. Newspaper articles, etc.

- Red rules:

Documents using permafrost terms but which are about businesses, games, etc. Totally irrelevant stuff.

NOTE: Anyone have a better way of colorizing markdown text?

Upgrade CDR script from v3 to 3.1

Export the response_headers in a key-value pair format with multi-valued keys being a string of values separated by a comma

Allow for infinite crawling

Make provisions in the sce.sh script to let a crawl run till a user presses CRTL+C

pagination on search results

There was a request during DD Eval 2 from the SEC domain to be able to mark more search results from a given query. Seems like a good idea.

remove flask and switch to gunicorn or similar

Output from SOLR in Elasticsearch bulk upload format

Currently the output from the export script is in CDRv3.1 format. Modify the json to be in elasticsearch bulk upload format, i.e insert appropriate headers for elasticsearch.

Add --upgrade option to kickstart.sh

This option to stop and remove old images and update to the new ones for new bug fixes.

Establish clear work flow in the interface

Check for ports already in use

@giuseppetotaro
For the docker-compose up to work correctly, we need to check if the following ports are available

8983
5000
9559 and
4444

If these ports are not available we should ask the user to make them available while running the kickstart.sh script ?

Need visual indication that a choice has been made for a given URL.

Anyone seen this error before?

It seems to happen when a key phrase is entered and the system never comes back with 12 URL's to label. Instead after a long, long time (hours?) the system pops up the following image

and then new searches do nothing and worse yet it appears like the front end loses the ability to read the model. The only way I've found so far to fix this is to

./kickstart.sh -stop

then

./kickstart.sh -start

Fortunately when this is done, the model information reloads OK, so I haven't lost all of my labeled URL's

Solr keeps shutting down

I have managed to successfully launch and run a crawl; however, it crashes after a while (and the time between crashes has been getting shorter). When that happens it looks like SOLR is down - is it not configured correctly? Getting full?

Configure the banana dashboard to show the data from last 10 days

allow search operators in search box

it would be helpful to allow domain exclusion in the search box, e.g. -site:foxnews.com
allow all operators shown: https://duck.co/help/results/syntax

Develop Crawl Alert capability

Develop release plan that includes solid testing

At the second DD Eval, the program did not work after install due to a silly mistake during upload. We need to ensure this does not happen again.

Design interface to include progress reporting

Fix link to crawl dashboard for local install

The link currently points to /banana, which is fine unless you also need to define the port because it's a local install.

Can we get this to work in both cases?

Kickstart.sh should report the urls more accurately

As kickstart completes, it reports the following information:

The Solr instance is available on http://0.0.0.0:8983
The DD explorer is available on http://0.0.0.0:5000/explorer

This is only accurate on a local install.

fix line wrap when triple digits are reached under Generate a Model

When 100 pages are checked in any one category in the main interface, the listing of highly relevant, relevant, and not relevant wraps to a second line and looks bad. Screen shot attached.

Add uninstall instructions to the User Guide

Use bash tee command to redirect output to log file and console

Doing the above will be better as the user can see the progress rather than just seeing a black screen for 30mins while the install in underway.

create travis build for sce

The crawl dashboard is not coming up

I get the following when I try to bring the crawl dashboard up:

I get the same screen when I go to http://0.0.0.0:8983/solr/#/ I get that same error. Ditto with http://0.0.0.0:8983/banana/.

However, when I check the docker images, they look fine (though there are extras given the testing I was doing a while ago):

and the containers seem to be fine as well:

but the logs do show that there is a problem - the sce.log repeats the following error every second or two:

2019-03-06 05:36:47 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-06 05:36:49 INFO Crawler$:147 [main] - Committing crawldb..
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:49)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:617)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:148)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:48)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:310)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at shaded.org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at shaded.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at shaded.org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
... 17 more

It looks like Solr is not up and running or at least neither I nor the SCE can connect to it.

Suggestions?

Update wiki to include instructions on GUI capabilities currently not mentioned

For example, the ability to import and export models under the Generate a Model section. At this point the documentation only obviously discusses using dumper.sh. @wmburke

Develop ability to crawl behind logins

create k8s compatible deployment

User Guide needs a section on kickstart.sh options

And now ./dumper.sh is not working either

The error is:

Ruths-iMac:sce-master ruthduerr$ ./dumper.sh
python: can't open file '/projects/sce/dumper/cdrv3_exporter.py': [Errno 2] No such file or directory
The dump of segments has been started. All the log messages will be reported also to /dev/stdout
Ruths-iMac:sce-master ruthduerr$

It looks like this is the line with the failure:

docker exec compose_domain-discovery_1 python /projects/sce/dumper/cdrv3_exporter.py

which implies that container compose_domain-discovery_1 is having issues finding directories.... It is true that the system has been stopped and started several times using kickstart.sh

dumper.sh -l default

if -l (log) is not specified, it should automatically create a log while also outputting to screen

Figure out a plan

Enable deep crawling using Sparker

To enable deep crawling we would need two sparkler deployments configured in a way to enable deep crawling.
This would mean we would change the way we are deploying our system.
Instead of running an embedded solr within the sparkelr container we would separate it out in its own container.
We would then have 5 containers running
1 SOLR, 2 Sparklers, 1 DD Tool, 1 Firefox.

Add documentation into the wiki

At this point, I'm just copying it over from the Memex wiki so that it is readily accessible.

Also, I need these images over there:

nasa-jpl-memex / sce Goto Github PK

sce's Issues

Recommend Projects

Recommend Topics

Recommend Org