archivesunleashed / docker-aut Goto Github PK
View Code? Open in Web Editor NEWDocker image for the Archives Unleashed Toolkit
Home Page: https://archivesunleashed.org/
License: Other
Docker image for the Archives Unleashed Toolkit
Home Page: https://archivesunleashed.org/
License: Other
On OS X 10.11.3:
ianmilligan1@Ians-MBP:~/dropbox/git/warcbase_workshop_vagrant$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'ubuntu/trusty64' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
==> default: Loading metadata for box 'ubuntu/trusty64'
default: URL: https://atlas.hashicorp.com/ubuntu/trusty64
==> default: Adding box 'ubuntu/trusty64' (v20160314.0.2) for provider: virtualbox
default: Downloading: https://atlas.hashicorp.com/ubuntu/boxes/trusty64/versions/20160314.0.2/providers/virtualbox.box
==> default: Successfully added box 'ubuntu/trusty64' (v20160314.0.2) for 'virtualbox'!
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: Warcbase workshop VM
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
default: Adapter 1: nat
==> default: Forwarding ports...
default: 8080 (guest) => 9000 (host) (adapter 1)
default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'starting, running'. The machine is in the
'poweroff' state. Please verify everything is configured
properly and try again.
If the provider you're using has a GUI that comes with it,
it is often helpful to open that and watch the machine, since the
GUI often has more helpful error messages than Vagrant can retrieve.
For example, if you're using VirtualBox, run `vagrant up` while the
VirtualBox GUI is open.
The primary issue for this error is that the provider you're using
is not properly configured. This is very rarely a Vagrant issue.
Will look into this.
See also: #6
I'm working creating a 0.11.0 version, and looking at the documentation we have no, there are not Spark Notebook examples. It appears to be all Spark Shell. Should I remove Spark Notebook from the build process and README instructions?
Well, guess I should do this now. ๐
Update VagrantFile to support Azure provisioning, once we get up and running.
I've jumped through a lot of hoops trying to get warcbase to build as part of the vagrant build, and it just doesn't want to happen.
You can shell in (vagrant ssh
) after the vagrant build and cd /home/vagrant/project/warcbase && sudo mvn clean package appassembler:assemble -DskipTests
, and it builds fine.
See: lintool/warcbase#206
Unable to get the docker container running. Throws the following issue:
docker run --rm -it aut
...
:: problems summary ::
:::: WARNINGS
[NOT FOUND ] com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle) (0ms)
==== local-m2-cache: tried
file:/root/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle)
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [download failed: com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle)]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1083)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Describe the bug
On Mac OS, docker build -t aut .
fails with java.lang.OutOfMemoryError: Java heap space
.
On Linux, the build succeeds.
To Reproduce
On Mac OS, run docker build -t aut .
Expected behavior
Build the Docker image.
Screenshots
n/a
Desktop/Laptop (please complete the following information):
$ uname -a
Darwin C02F37HLML7H 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_X86_64 x86_64
Smartphone (please complete the following information):
n/a
Additional context
See the log.txt file attached.
This is also requires using a new version of Spark Notebook, which uses a different way to load external libraries. The :cp
command is no longer available.
:: problems summary ::
:::: WARNINGS
module not found: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4
==== local-m2-cache: tried
file:/root/.m2/repository/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom
-- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:
file:/root/.m2/repository/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar
==== local-ivy-cache: tried
/root/.ivy2/local/org.apache.hadoop/hadoop-core/0.20.2-cdh3u4/ivys/ivy.xml
-- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:
/root/.ivy2/local/org.apache.hadoop/hadoop-core/0.20.2-cdh3u4/jars/hadoop-core.jar
==== central: tried
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom
-- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom
-- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:
http://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1083)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm working on a 0.11.0 docker build, but ran into this. @ianmilligan1 @lintool you fine with me cutting a 0.11.1 release which resolved the issue?
N.B. At this point I'd prefer to build the Docker image with --packages
as opposed to --jars
because it is surfacing a lot of dependency issues I've feared have remained hidden for a long time.
Create a CONTRIBUTING.md that let's folks know who to provide feedback, etc.
We can steal from the Islandora one: https://github.com/Islandora-CLAW/CLAW/blob/7.x-2.x/CONTRIBUTING.md
Remove mentions of master
.
Working on updating everything here, and I noticed aut
is failing to build on the master branch in Docker build process.
Here is the output of the error:
2017-12-07 23:14:13,556 [main-ScalaTest-running-CountableRDDTest] INFO SparkUI - Stopped Spark web UI at http://172.17.0.2:4040
2017-12-07 23:14:13,558 [dispatcher-event-loop-2] INFO MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2017-12-07 23:14:13,562 [main-ScalaTest-running-CountableRDDTest] INFO MemoryStore - MemoryStore cleared
2017-12-07 23:14:13,562 [main-ScalaTest-running-CountableRDDTest] INFO BlockManager - BlockManager stopped
2017-12-07 23:14:13,564 [main-ScalaTest-running-CountableRDDTest] INFO BlockManagerMaster - BlockManagerMaster stopped
2017-12-07 23:14:13,571 [dispatcher-event-loop-1] INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2017-12-07 23:14:13,573 [main-ScalaTest-running-CountableRDDTest] INFO SparkContext - Successfully stopped SparkContext
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.711 sec - in io.archivesunleashed.spark.rdd.CountableRDDTest
Running io.archivesunleashed.io.ArcRecordWritableTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.231 sec - in io.archivesunleashed.io.ArcRecordWritableTest
Running io.archivesunleashed.io.GenericArchiveRecordWritableTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.382 sec - in io.archivesunleashed.io.GenericArchiveRecordWritableTest
Running io.archivesunleashed.io.WarcRecordWritableTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.344 sec - in io.archivesunleashed.io.WarcRecordWritableTest
Running io.archivesunleashed.ingest.WacArcLoaderTest
2017-12-07 23:14:14,679 [main] INFO WacArcLoaderTest - 300 records read!
2017-12-07 23:14:14,860 [main] INFO WacArcLoaderTest - 300 records read!
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.322 sec - in io.archivesunleashed.ingest.WacArcLoaderTest
Running io.archivesunleashed.ingest.WacWarcLoaderTest
2017-12-07 23:14:15,246 [main] INFO WacWarcLoaderTest - 822 records read!
2017-12-07 23:14:15,623 [main] INFO WacWarcLoaderTest - 822 records read!
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.762 sec - in io.archivesunleashed.ingest.WacWarcLoaderTest
Running io.archivesunleashed.mapreduce.WacWarcInputFormatTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.244 sec - in io.archivesunleashed.mapreduce.WacWarcInputFormatTest
Running io.archivesunleashed.mapreduce.WacArcInputFormatTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec - in io.archivesunleashed.mapreduce.WacArcInputFormatTest
Running io.archivesunleashed.mapreduce.WacGenericInputFormatTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.351 sec - in io.archivesunleashed.mapreduce.WacGenericInputFormatTest
2017-12-07 23:14:16,340 [Thread-1] INFO ShutdownHookManager - Shutdown hook called
2017-12-07 23:14:16,341 [Thread-1] INFO ShutdownHookManager - Deleting directory /tmp/spark-40f43281-67db-4a4e-843c-8cbe042ff68e
Results :
Tests in error:
ExtractPopularImagesTest.run:32->org$scalatest$BeforeAndAfter$$super$run:32->FunSuite.org$scalatest$FunSuiteLike$$super$run:1560->FunSuite.runTests:1560->runTest:32->org$scalatest$BeforeAndAfter$$super$runTest:32->FunSuite.withFixture:1560->FunSuite.newAssertionFailedException:1560 ? TestFailed
Tests run: 75, Failures: 0, Errors: 1, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:02 min
[INFO] Finished at: 2017-12-07T23:14:16+00:00
[INFO] Final Memory: 70M/554M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project aut: There are test failures.
[ERROR]
[ERROR] Please refer to /aut/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
The command '/bin/sh -c git clone https://github.com/archivesunleashed/aut.git /aut && cd /aut && mvn clean install' returned a non-zero code: 1
I'm unable to run the docker container for version 0.18.0.
docker run --rm -it archivesunleashed/docker-aut:0.18.0
results in the following error:
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.github.archivesunleashed.tika#tika-parsers;1.22: not found
:: com.github.netarchivesuite#language-detector;language-detector-0.6a: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.github.archivesunleashed.tika#tika-parsers;1.22: not found, unresolved dependency: com.github.netarchivesuite#language-detector;language-detector-0.6a: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The Spark Notebook works on http://127.0.0.1:9000/# as directed in the walkthrough, but when you load the fatjar the browser hangs. Terminal displays following errors and we can't continue.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.net.URI.<init>(URI.java:588)
at akka.actor.ActorPathExtractor$.unapply(Address.scala:154)
at akka.remote.RemoteActorRefProvider.resolveActorRefWithLocalAddress(RemoteActorRefProvider.scala:347)
at akka.remote.transport.AkkaPduProtobufCodec$.decodeMessage(AkkaPduCodec.scala:191)
at akka.remote.EndpointReader.akka$remote$EndpointReader$$tryDecodeMessageAndAck(Endpoint.scala:993)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:926)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Uncaught error from thread [Remote-akka.remote.default-remote-dispatcher-7] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.jar.Attributes.read(Attributes.java:394)
at java.util.jar.Manifest.read(Manifest.java:199)
at java.util.jar.Manifest.<init>(Manifest.java:69)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at scala.concurrent.Future$class.foreach(Future.scala:204)
at scala.concurrent.impl.Promise$DefaultPromise.foreach(Promise.scala:153)
at akka.remote.transport.netty.NettyTransport$.gracefulClose(NettyTransport.scala:222)
at akka.remote.transport.netty.TcpAssociationHandle.disassociate(TcpSupport.scala:94)
at akka.remote.transport.ProtocolStateActor$$anonfun$1.applyOrElse(AkkaProtocolTransport.scala:516)
at akka.remote.transport.ProtocolStateActor$$anonfun$1.applyOrElse(AkkaProtocolTransport.scala:480)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at akka.actor.FSM$class.terminate(FSM.scala:672)
at akka.actor.FSM$class.applyState(FSM.scala:617)
at akka.remote.transport.ProtocolStateActor.applyState(AkkaProtocolTransport.scala:269)
at akka.actor.FSM$class.processEvent(FSM.scala:609)
at akka.remote.transport.ProtocolStateActor.processEvent(AkkaProtocolTransport.scala:269)
at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
The README.md references /aut/target/aut-0.90.5-SNAPSHOT-fatjar.jar
:
docker run --rm -it \
archivesunleashed/docker-aut \
/spark/bin/pyspark \
--py-files /aut/target/aut.zip \
--jars /aut/target/aut-0.90.5-SNAPSHOT-fatjar.jar
but the Docker image on dockerhub archivesunleashed/docker-aut:latest
contains aut-0.90.3-SNAPSHOT-fatjar.jar
:
$ docker pull archivesunleashed/docker-aut:latest
Using default tag: latest
latest: Pulling from archivesunleashed/docker-aut
Digest: sha256:cbaabbd3bf2783ec3af1956fefb44ce20e10b6c6321cd5c837dd52e3128a2012
Status: Downloaded newer image for archivesunleashed/docker-aut:latest
docker.io/archivesunleashed/docker-aut:latest
$ docker run --rm -it archivesunleashed/docker-aut:latest ls /aut/target
:
aut-0.90.3-SNAPSHOT-fatjar.jar
:
Push the most recent build of archivesunleashed/docker-aut
to dockerhub.
@ianmilligan1 or @greebie, you want this?
If you haven't worked with Docker before, this is very helpful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.