Code Monkey home page Code Monkey logo

spark-eks's Introduction

spark-on-eks

Examples and custom spark images for working with the spark-on-k8s operator on AWS.

Allows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.

Note: Spark 3 images also include relevant jars for working with the S3A commiters

If you're looking for the Spark 3 custom distributions, you can find them here

Note: Spark 2 images will not be updated, please see the FAQ


operator spark-eks

Prerequisites

Suggested values for the helm chart can be found in the flux example.

Note: Do not create the spark service account automatically as part of chart use.

using IAM roles for service accounts on EKS

Creating roles and service account

  • Create an AWS role for driver
  • Create an AWS role for executors

AWS docs on creating policies and roles

  • Add default service account EKS role for executors in your spark job namespace ( optional )
# NOTE: Only required when not building spark from source or using a version of spark < 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.
apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
  namespace: SPARK_JOB_NAMESPACE
  annotations:
    # can also be the driver role
    eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/executor-role"
  • Make sure spark service account ( used by driver pods ) is configured to an EKS role as well
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: SPARK_JOB_NAMESPACE
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/driver-role"

Building a compatible image

Submit your spark application with IRSA support

Select the right implementation for you

Below are examples for latest versions.

If you want to use pinned versions, all images are tagged by the commit SHA.

You can find a full list of tags here

# spark2
FROM bbenzikry/spark-eks:spark2-latest
# spark3
FROM bbenzikry/spark-eks:spark3-latest
# pyspark2
FROM bbenzikry/spark-eks:pyspark2-latest
# pyspark3
FROM bbenzikry/spark-eks:pyspark3-latest

Submit your SparkApplication spec

hadoopConf:
  # IRSA configuration
  "fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
driver:
  .....
  labels:
    .....
  serviceAccount: SERVICE_ACCOUNT_NAME

  # See: https://github.com/kubernetes/kubernetes/issues/82573
  # Note: securityContext has changed in recent versions of the operator to podSecurityContext
  podSecurityContext:
    fsGroup: 65534

Working with AWS Glue as metastore

Glue Prerequisites

  • Make sure your driver and executor roles have the relevant glue permissions
{
  /* 
  Example below depicts the IAM policy for accessing db1/table1.
  Modify this as you deem worthy for spark application access.
  */

  Effect: "Allow",
  Action: ["glue:*Database*", "glue:*Table*", "glue:*Partition*"],
  Resource: [
    "arn:aws:glue:us-west-2:123456789012:catalog",
    "arn:aws:glue:us-west-2:123456789012:database/db1",
    "arn:aws:glue:us-west-2:123456789012:table/db1/table1",

    "arn:aws:glue:eu-west-1:123456789012:database/default",
    "arn:aws:glue:eu-west-1:123456789012:database/global_temp",
    "arn:aws:glue:eu-west-1:123456789012:database/parquet",
  ],
}
  • Make sure you are using the patched operator image
  • Add a config map to your spark job namespace as defined here
apiVersion: v1
data:
  hive-site.xml: |-
    <configuration>
        <property>
            <name>hive.imetastoreclient.factory.class</name>
            <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
        </property>
    </configuration>
kind: ConfigMap
metadata:
  namespace: SPARK_JOB_NAMESPACE
  name: spark-custom-config-map

Submitting your application

In order to submit an application with glue support, you need to add a reference to the configmap in your SparkApplication spec.

kind: SparkApplication
metadata:
  name: "my-spark-app"
  namespace: SPARK_JOB_NAMESPACE
spec:
  sparkConfigMap: spark-custom-config-map

Working with the spark history server on S3

  • Use the appropriate spark version and deploy the helm chart

  • Flux / Helm values reference here

FAQ

  • Where can I find a Spark 2 build with Glue support?

    As spark 2 becomes less and less relevant, I opted against the need to add glue support. You can take a look here for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2 dockerfile

  • Why a patched operator image?

    The patched image is a simple implementation for properly working with custom configuration files with the spark operator. It may be added as a PR in the future or another implementation will take its place. For more information, see the related issue kubeflow/spark-operator#216

spark-eks's People

Contributors

bbenzikry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

spark-eks's Issues

Performance Degradation while using Glue

Hi @bbenzikry ,

I am using the the spark image (with IRSA, Glue, etc.) and am finally able to access Glue Metastore. I set the hive-site.xml properties using sparkConfigMap field. But the problem I'm now seeing is very very slow performance while running the spark job. When I see the driver logs, it takes 15 minutes in the step of registering block manager and then it completes. Below is the log screenshot where it is getting stuck for 15 minutes (08:16 -> 08:31)
image

Also below is the YAML code for reference.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: testpyspark
  namespace: dev
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "<my image repository>"
  imagePullPolicy: Always
  mainApplicationFile: "s3a://spark-bucket/code/TestFile.py"
  sparkVersion: "3.0.1"
  sparkConfigMap: spark-config-map
  sparkConf:
    "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3n.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    "spark.kubernetes.authenticate.driver.serviceAccountName": "svc-spark-iam"
  driver:
    cores: 2
    memory: 1000m
    labels:
      version: 3.0.1
  executor:
    cores: 1
    instances: 1
    memory: 500m
    labels:
      version: 3.0.1

In the PYSpark job, I'm using the following code to create the Spark session and then using that session to run sql queries on Glue Tables.
spark = SparkSession.builder.appName("testpyspark").enableHiveSupport().getOrCreate()

Do you have an idea as to why this slow performance would be happening? Would appreciate any guidance :)

P.S. This only happens when I create the Spark Session with "enableHiveSupport()" option.

Update references for spark 3.1 builds

  • Deprecate spark-glue repository
  • Update docker build pipeline for 3.1.2 w/ Hadoop 3.3.0 using Earthly
  • Update flux to v2
  • Update spark operator fork to include upstream changes w/ subpath additions
  • Change spark operator image reference to new upstream fork

S3 Access Denied in Spark History Server

Hi @bbenzikry,

I am trying to setup the spark-history-server using the link you have provided and the values similar to what you have suggested in flux folder. I am using S3 bucket and IRSA (IAM Role) for accessing the bucket where event logs are stored. I already have a service-account setup, which is linked to an IAM role (IRSA) that has full access on the S3 bucket. Therefore, I am telling the helm chart not to create a new service-account, but instead use the existing one. Below is my helm command for reference:

helm install spark-history-server --set rbac.create=false,serviceAccount.create=false,serviceAccount.name=svc-spark-iam,image.repository=bbenzikry/spark-eks,image.tag=spark3-latest,pvc.enablePVC=false,nfs.enableExampleNFS=false,service.type=ClusterIP,s3.enableS3=true,s3.logDirectory=s3a://spark-bucket/events/ stable/spark-history-server --namespace dev

Now, once the pod is launched, I can see that the container is assuming the service-account defined above and is also assuming the IAM role associated with that service-account via RBAC. I can see the IAM role in the "AWS_ROLE_ARN" key on Pod environment when I describe the pod. But even then I am getting the Access Denied "403 Forbidden" Exception from S3, as below:

Exception in thread "main" java.nio.file.AccessDeniedException: s3a://spark-bucket/events/: getFileStatus on s3a://spark-bucket/events/: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 756F67F5ED6E44E1; S3 Extended Request ID: hdyxhz/KSXnB4HY5eWSlfikJzTbNPeo+w9BTysu9iL2QwQxZIc6Pb++aAzAH2NsldNDevsc1Fkg=; Proxy: null), S3 Extended Request ID: hdyxhz/KSXnB4HY5eWSlfikJzTbNPeo+w9BTysu9iL2QwQxZIc6Pb++aAzAH2NsldNDevsc1Fkg=:403 Forbidden

Any idea why this would be happening? Would appreciate any guidance :)

several troubles

hi, several feedbacks after testing this:

  1. spark2 image actually contains spark3 on dockerhub
  2. spark2 image does not connect to glue (empty default dataabse) on my tests
  3. pyspark tags are not accessible on docker-hub
  4. spark3 image breaks with datanucleus missing lib
  5. released binary does not provide hadoop as stated here https://github.com/bbenzikry/spark-glue/releases

conclusion:

not being able to connect spark3 to glue

Support for Spark 3.1.1

Hi,

I tried to follow the Dockerfile for Spark 3.1.1 but S3 connectivity in EKS didn't work. However same steps works for 3.0.1.

How can I find the right JARS combination? I get timeout error in 3.1.1:

AmazonHttpClient: Unable to execute HTTP request: connect timed out
java.net.SocketTimeoutException: connect timed out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.