qubole / spark-on-lambda Goto Github PK

Apache Spark on AWS Lambda

License: Apache License 2.0

Shell 0.52% Batchfile 0.08% R 3.03% Makefile 0.03% C 0.01% Python 7.16% Java 9.40% Scala 71.83% JavaScript 0.46% CSS 0.08% HTML 0.03% PowerShell 0.01% Roff 0.09% ANTLR 0.11% PLpgSQL 0.02% Thrift 0.12% TSQL 0.22% HiveQL 6.29% q 0.51%

spark aws lambda big-data serverless apache-spark aws-lambda aws-cloud

spark-on-lambda's Introduction

Note: "This repo contains Vulnerable Code, as such should not be used for any purpose whatsoever."

Spark on Lambda - README

AWS Lambda is a Function as a Service which is serverless, scales up quickly and bills usage at 100ms granularity. We thought it would be interesting to see if we can get Apache Spark run on Lambda. This is an interesting idea we had, in order to validate we just hacked it into a prototype to see if it works. We were able to make it work making some changes in Spark's scheduler and shuffle areas. Since AWS Lambda has a 5 minute max run time limit, we have to shuffle over an external storage. So we hacked the shuffle parts of Spark code to shuffle over an external storage like S3.

This is a prototype and its not battle tested possibly can have bugs. The changes are made against OS Apache Spark-2.1.0 version. We also have a fork of Spark-2.2.0 which has few bugs will be pushed here soon. We welcome contributions from developers.

For users, who wants to try out:

Bring up an EC2 machine with AWS credentials to invoke lambda function (~/.aws/credentials) in a VPC. Right now we only support credentials file way of loading lambda credentials with AWSLambdaClient. The spark driver will run on this machine. Also configure a security group for this machine.

Spark on Lambda package for driver [s3://public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz] - This can be downloaded to an ec2 instance where the driver can be launched as Driver is generally long running needs to run inside an EC2 instance

Create the Lambda function with name spark-lambda from AWS console using the (https://github.com/qubole/spark-on-lambda/bin/lambda/spark-lambda-os.py) and configure lambda function’s VPC and subnet to be same as that of the EC2 machine. Right now we use private IPs to register with Spark driver but this can be fixed to use public IP there by the Spark driver even can run on Mac or PC or any VM. It would also be nice to have a Docker container having the package which works out of the box.

Also configure the security group of the lambda function to be the same as that of the EC2 machine. Note: Lambda role should have access to [s3://public-qubole/] Also if you want to copy the packages to your bucket, use

aws s3 cp s3://s3://public-qubole/lambda/spark-lambda-149.zip s3://YOUR_BUCKET/
aws s3 cp s3://s3://public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz s3://YOUR_BUCKET/

Spark on Lambda package for executor to be launched inside lambda [s3://public-qubole/lambda/spark-lambda-149.zip] - This will be used in the lambda (executor) side. In order to use this package on the lambda side, pass spark configs like below:

    1. spark.lambda.s3.bucket s3://public-qubole/
    2. spark.lambda.function.name spark-lambda
    3. spark.lambda.spark.software.version 149

Launch spark-shell

/usr/lib/spark/bin/spark-shell --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=

Spark on Lambda configs (spark-defaults.conf)

spark.shuffle.s3.enabled true
spark.shuffle.s3.bucket s3://  -- Bucket to write shuffle (intermediate) data
spark.lambda.s3.bucket s3://public-qubole/  
spark.lambda.concurrent.requests.max 50
spark.lambda.function.name spark-lambda
spark.lambda.spark.software.version 149
spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.AbstractFileSystem.s3.impl org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3n.impl org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3a.impl org.apache.hadoop.fs.s3a.S3A
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2

For developers, who wants to make changes:

To compile

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests

Due to aws-java-sdk-1.7.4.jar which is used by hadoop-aws.jar and aws-java-sdk-core-1.1.0.jar has compatibility issues, so as of now we have to compile it using Qubole shaded hadoop-aws-2.6.0-qds-0.4.13.jar.

To create lambda package for executors

bash -x bin/lambda/spark-lambda 149 (spark.lambda.spark.software.version) spark-2.1.0-bin-spark-lambda-2.1.0.tgz [s3://public-qubole/] (this maps to the config value of spark.lambda.s3.bucket)

(spark/bin/lambda/spark-lambda-os.py) is the helper lambda function used to bootstrap lambda environment with necessary Spark packages to run executors.

Above Lambda function has to be created inside VPC which is same as the EC2 instance where driver is brought up for having communication between Driver and Executors (lambda function)

References

http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

spark-on-lambda's People

Contributors

Stargazers

Watchers

spark-on-lambda's Issues

Setup - Need S3 object read permission

As per instructions, I was trying to copy the file and errors out.

aws s3 cp s3://public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz s3://myBucket-xxx/spark-on-lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz --acl bucket-owner-full-control

Error:
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Could you provide appropriate access for the users to access the neccesary files.

Thanks
Vali

Dockerize Spark on Lambda with jupyter notebook.

Idea here is creating a docker image for Spark on Lambda so that Spark Driver can be spun up on either AWS ECS or on Fargate. It would be nicer if there is a frontend of jupyter notebook interfacing with the Spark app.

Issue from downloading s3 files

aws s3 cp s3://public-qubole/lambda/spark-lambda-149.zip s3://my-bucket
Fatal error: An error occurred (403)when calling the Headobject operation: Forbidden

when I tried this command, another error occured likes below:
aws s3ls s3: //public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz
An calling the ListobjectsV2operation: Access Denied

Could you provide appropriate access for the users to access the neccesary files.
Thanks
Yang

Spark does not talk to Lambda at all

I have been struggling with setting up this framework on my EC2 server. I tried the best to follow the instruction of both this repo and also faromero's forked repo, but I have been getting this error message each time I run sudo ./../driver/bin/spark-submit ml_kmeans.py --master lambda://test:

19/04/05 05:11:40 ERROR ShuffleBlockFetcherIterator: Error occurred while fetching local blocks
java.io.FileNotFoundException: No such file or directory: s3://mc-cse597cc/tmp/executor-driver-4775351731/30/shuffle_0_0_0.index
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1636)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:684)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:772)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getRemoteBlockData(S3ShuffleBlockResolver.scala:240)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getBlockData(S3ShuffleBlockResolver.scala:263)
	at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:318)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:258)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:292)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/04/05 05:11:40 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 8, localhost, executor driver): FetchFailed(BlockManagerId(driver, 172.31.123.183, 33995, None), shuffleId=0, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: No such file or directory: s3://mc-cse597cc/tmp/executor-driver-4775351731/30/shuffle_0_0_0.index
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:154)
	at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:85)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3://mc-cse597cc/tmp/executor-driver-4775351731/30/shuffle_0_0_0.index
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1636)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:684)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:772)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getRemoteBlockData(S3ShuffleBlockResolver.scala:240)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getBlockData(S3ShuffleBlockResolver.scala:263)
	at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:318)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:258)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:292)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
	... 9 more

)
19/04/05 05:11:40 INFO DAGScheduler: Marking ResultStage 8 (countByValue at KMeans.scala:399) as failed due to a fetch failure from ShuffleMapStage 7 (countByValue at KMeans.scala:399)
19/04/05 05:11:40 INFO DAGScheduler: ResultStage 8 (countByValue at KMeans.scala:399) failed in 0.069 s due to org.apache.spark.shuffle.FetchFailedException: No such file or directory: s3://mc-cse597cc/tmp/executor-driver-4775351731/30/shuffle_0_0_0.index
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:154)
	at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:85)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3://mc-cse597cc/tmp/executor-driver-4775351731/30/shuffle_0_0_0.index
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1636)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:684)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:772)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getRemoteBlockData(S3ShuffleBlockResolver.scala:240)
	at org.apache.spark.shuffle.S3ShuffleBlockResolver.getBlockData(S3ShuffleBlockResolver.scala:263)
	at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:318)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:258)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:292)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
	... 9 more

Here is the full log

My config file:

spark.dynamicAllocation.enabled                 true
spark.dynamicAllocation.minExecutors            2
spark.shuffle.s3.enabled                        true
spark.lambda.concurrent.requests.max            100
spark.hadoop.fs.s3n.impl                        org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl                         org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.AbstractFileSystem.s3.impl      org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3n.impl     org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3a.impl     org.apache.hadoop.fs.s3a.S3A
spark.hadoop.qubole.aws.use.v4.signature        true
spark.hadoop.fs.s3a.fast.upload                 true
spark.lambda.function.name                      spark-lambda
spark.lambda.spark.software.version             149
spark.hadoop.fs.s3a.endpoint                    s3.us-east-1.amazonaws.com
spark.hadoop.fs.s3n.awsAccessKeyId             KEY
spark.hadoop.fs.s3n.awsSecretAccessKey          SECRET
spark.shuffle.s3.bucket                         s3://mc-cse597cc
spark.lambda.s3.bucket                          s3://mc-cse597cc

~/.aws/config is us-east-1. VPC subnets are configured following the forked repo's instruction.

My Lambda function is tested to be able to write and read to S3. My spark-submit command ran on EC2 is able to write to S3 (it generates a tmp/ folder on the bucket), but does not let the Lambda run at all. CloudWatch for my Lambda has no logs. However, I am able to run my Lambda from EC2 using something like aws lambda invoke --function-name spark-lambda ~/test.txt. I guess I configured Spark-on-Lambda wrong but I've been following the instructions.

I am now trying to dive into the source code. Is there any clue for this message?

s3a error

in the example I have attached the following problem appears, which seems to be related to the management of shuffle in spark in the s3 context.
Did I confirm that the problem occurs or is it a configuration problem of mine?

ShuffleExample.scala.zip

How to make commands execute on Lambda

How do I run commands from the spark-shell so that they are executed on Lambda? Right now, the commands are being executed locally on my machine, but I would like Lambda to be the backend.

I am running the following command to start the shell (which does start successfully):
bin/spark-shell --conf spark.hadoop.fs.s3n.awsAccessKeyId=<my-key> --conf spark.hadoop.fs.s3n.awsSecretAccessKey=<my-secret-key> --conf spark.shuffle.s3.bucket=s3://<my-bucket> --conf spark.lambda.function.name=spark-lambda --conf spark.lambda.s3.bucket=s3://<my-bucket>/lambda --conf spark.lambda.spark.software.version=149

I have created the function spark-lambda to be the contents of spark-lambda-os.py and have given it S3 and EC2 permissions. In addition, the S3 bucket <my-bucket>/lambda has the package spark-lambda-149.zip which was put together by the spark-lambda script. Is there anything else I need to do to have it execute on Lambda?

Lost executor and file not found

Hello,

I have been using Spark-on-lambda for some time now and I have recently faced some issues. I have made some code changes regarding the shuffle writes but I am getting the following errors: (I am working with Spark's standalone cluster manager)

The cluster manager seems to loose the first executor every time a task is submitted :

ERROR TaskSchedulerImpl: Lost executor 0 on 172.31.108.8: Unable to create executor due to assertion failed
19/04/06 18:04:07 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 172.31.108.8, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Unable to create executor due to assertion failed
19/04/06 18:04:07 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.31.108.8, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Unable to create executor due to assertion failed
19/04/06 18:04:07 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 172.31.108.8, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Unable to create executor due to assertion failed
19/04/06 18:04:07 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.31.108.8, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Unable to create executor due to assertion failed
19/04/06 18:04:07 INFO DAGScheduler: Executor lost: 0 (epoch 0)
19/04/06 18:04:07 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
19/04/06 18:04:07 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
19/04/06 18:04:07 INFO ExecutorAllocationManager: Existing executor 0 has been removed (new total is 0)
19/04/06 18:04:07 WARN ExecutorAllocationManager: Attempted to mark unknown executor 0 idle
19/04/06 18:04:07 WARN TransportChannelHandler: Exception in connection from /172.31.108.8:56608
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)

All the executors seem to register and work fine after the first one.
2. Using the shuffle writes to S3 makes the applications run very slow. Can you let me know exactly why this is?

I want to use HDFS to read files from within a spark application and write the final output to. Can you tell me the best Hadoop/HDFS version to use with Spark-on-lambda (Spark 2.1.0-rc5) and the appropriate dependencies (especially Netty version).

Thanks

Compiling

Hello,
I'm trying to install spark on lambda. When I run

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests

The Project Launcher fails and I get the following error.

[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Failure to find com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19 in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

I tried to explicitly add hadoop-lzo as a dependency in the launcher pom.xml, but I still get the same error. Is there something I need to download or change to get this to work?

Thanks!

Python example file's data location does not meet Lambda's expectation

I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.

Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist.

I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda, I looked at what actually was under /tmp/lambda. There was a spark folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/ under my EC2, move my data file there, and then point to that file in spark.read. Specifically I changed line 42 to

    import os
    data_folder = '/home/ec2-user/driver/data/mllib'
    lambda_folder = '/tmp/lambda/spark/data/mllib'
    filename = 'sample_kmeans_data.txt'
    os.system('mkdir -p ' + lambda_folder)
    os.system('cp {}/{} {}/{}'.format(data_folder, filename, lambda_folder, filename))
    dataset = spark.read.format("libsvm").load('{}/{}'.format(lambda_folder, filename))

And then it worked fine.

I suppose that part or many Python files has this problem, so it can be a barrier for python users.