nasa-jpl-memex / sce Goto Github PK
View Code? Open in Web Editor NEWSparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git
Home Page: http://irds.usc.edu/sparkler/
License: Apache License 2.0
Sparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git
Home Page: http://irds.usc.edu/sparkler/
License: Apache License 2.0
How do we add/remove urls from the deep crawl list?
Develop recommendations to add to deep crawl list based on results of web crawl.
eg, when it returns 404s
Port the adaptive fetch schedule from Nutch so we have dynamically determined timeframes between crawls on those being deep crawled
Watching a crawl this week, I've noticed a number of things that I haven't been able to resolve. This may indicate my novice interactions with Banana, but I wanted to document them nonetheless:
if -l (log) is not specified, it should automatically create a log while also outputting to screen
Kickstart.sh should be able to clean the system of the images.
The default functionality for sce.sh should be the following:
When relevancy of pages on a deep crawl site go below a certain threshold, mark it for review by the SME
Users expecting to use the SCE for a particular purpose may need to regenerate a model from scratch during an initial exploration phase. For example when Ketil reviewed the 600 URL's he ended up having to define for himself detailed descriptions of how to score a document. I also have had to do that, as my initial guess for what was the "right stuff" turned out to be inadequate (not specific enough). My second pass at a rule set is:
+ Green rules:
A document that defines permafrost-related terminology that also likely contains information about the history of the term. Examples: Permafrost related review papers; comprehensive dictionaries; wikipedia articles with historical discussions of terms, etc.
! Orange rules:
Documents that are about permafrost-related terms but which do not have historical information. Newspaper articles, etc.
- Red rules:
Documents using permafrost terms but which are about businesses, games, etc. Totally irrelevant stuff.
NOTE: Anyone have a better way of colorizing markdown text?
Export the response_headers in a key-value pair format with multi-valued keys being a string of values separated by a comma
Make provisions in the sce.sh script to let a crawl run till a user presses CRTL+C
There was a request during DD Eval 2 from the SEC domain to be able to mark more search results from a given query. Seems like a good idea.
Currently the output from the export script is in CDRv3.1 format. Modify the json to be in elasticsearch bulk upload format, i.e insert appropriate headers for elasticsearch.
This option to stop and remove old images and update to the new ones for new bug fixes.
@giuseppetotaro
For the docker-compose up to work correctly, we need to check if the following ports are available
If these ports are not available we should ask the user to make them available while running the kickstart.sh script ?
It seems to happen when a key phrase is entered and the system never comes back with 12 URL's to label. Instead after a long, long time (hours?) the system pops up the following image
and then new searches do nothing and worse yet it appears like the front end loses the ability to read the model. The only way I've found so far to fix this is to
./kickstart.sh -stop
then
./kickstart.sh -start
Fortunately when this is done, the model information reloads OK, so I haven't lost all of my labeled URL's
I have managed to successfully launch and run a crawl; however, it crashes after a while (and the time between crashes has been getting shorter). When that happens it looks like SOLR is down - is it not configured correctly? Getting full?
it would be helpful to allow domain exclusion in the search box, e.g. -site:foxnews.com
allow all operators shown: https://duck.co/help/results/syntax
At the second DD Eval, the program did not work after install due to a silly mistake during upload. We need to ensure this does not happen again.
The link currently points to /banana, which is fine unless you also need to define the port because it's a local install.
Can we get this to work in both cases?
As kickstart completes, it reports the following information:
The Solr instance is available on http://0.0.0.0:8983
The DD explorer is available on http://0.0.0.0:5000/explorer
This is only accurate on a local install.
Doing the above will be better as the user can see the progress rather than just seeing a black screen for 30mins while the install in underway.
I get the following when I try to bring the crawl dashboard up:
I get the same screen when I go to http://0.0.0.0:8983/solr/#/ I get that same error. Ditto with http://0.0.0.0:8983/banana/.
However, when I check the docker images, they look fine (though there are extras given the testing I was doing a while ago):
and the containers seem to be fine as well:
but the logs do show that there is a problem - the sce.log repeats the following error every second or two:
2019-03-06 05:36:47 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-06 05:36:49 INFO Crawler$:147 [main] - Committing crawldb..
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:49)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:617)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:148)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:48)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:310)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at shaded.org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at shaded.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at shaded.org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
... 17 more
It looks like Solr is not up and running or at least neither I nor the SCE can connect to it.
Suggestions?
For example, the ability to import and export models under the Generate a Model section. At this point the documentation only obviously discusses using dumper.sh. @wmburke
The error is:
Ruths-iMac:sce-master ruthduerr$ ./dumper.sh
python: can't open file '/projects/sce/dumper/cdrv3_exporter.py': [Errno 2] No such file or directory
The dump of segments has been started. All the log messages will be reported also to /dev/stdout
Ruths-iMac:sce-master ruthduerr$
It looks like this is the line with the failure:
docker exec compose_domain-discovery_1 python /projects/sce/dumper/cdrv3_exporter.py
which implies that container compose_domain-discovery_1 is having issues finding directories.... It is true that the system has been stopped and started several times using kickstart.sh
if -l (log) is not specified, it should automatically create a log while also outputting to screen
To enable deep crawling we would need two sparkler deployments configured in a way to enable deep crawling.
This would mean we would change the way we are deploying our system.
Instead of running an embedded solr within the sparkelr container we would separate it out in its own container.
We would then have 5 containers running
1 SOLR, 2 Sparklers, 1 DD Tool, 1 Firefox.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.