awslabs / amazon-emr-cli Goto Github PK

View Code? Open in Web Editor NEW

33.0 5.0 10.0 154 KB

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs

License: Apache License 2.0

Python 96.51% Dockerfile 3.49%

aws amazon-emr apache-spark emr-serverless

amazon-emr-cli's People

Stargazers

Watchers

Forkers

kazdy subratcall co360 soumilshah1995 chrisjrabbott chrisabbott toannhu96 vvying matheusbsilva kszonsteg

amazon-emr-cli's Issues

--entry-point in subfolder does not work

If the entrypoint file is in a subfolder locally, it attempts to use that subfolder in constructing the entrypoint path on S3.

We should either:

Truncate the folder when constructing the entrypoint
Upload the entrypoint to the same subfolder in S3

Add support for `--profile` flag for choosing AWS profile

Currently, all boto3 clients are initiated without specifying the AWS profile name. Hence, the default profile will be used.

Adding the --profile flag would significantly improve the handiness of the CLI because it is very common to switch the profile frequently during development (for example, changing the target environment between dev and prod by switching the AWS profile).

Add timestamps to [emr-cli] logs

Hi,

Do you think it's fine to have a basic understanding of how much time it took to proceed between stages and when the job was initially started, by looking only at the logs in the terminal?

How to Enable Glue Hive MetaStore with EMR CLI

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

I have tried and CLi works great simplifies submitting jobs
However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI

Here is a sample of how we are submitting jobs

emr run --entry-point entrypoint.py
--application-id --job-role <arn>
--s3-code-uri s3:///emr_scripts/ --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
--build `
--wait

If you can kindly get back to us on issue that would be great 😃
@dacort

Add the ability to start a local EMR dev environment

Currently the amazon-emr-vscode-toolkit provide a way to write your code on an EMR environment on your local machine leveraging the EMR on EKS container image with devcontainer.

This is limited to IDEs that support devcontainer. It would be good to have a standalone implementation from within the EMR CLI to start a local dev environment a mount the project as a volume to the docker container of the EMR image.

This feature would be interesting for PyCharm that offer a way to connect to a remote jupyter server

Add --job-args and --spark-submit-opts for EMR on EC2

Add --job-args (variable never used) and --spark-submit-opts (variable not exists) for EMR on EC2.
Currently --job-args as well as --spark-submit-opts works only for EMR Serverless:

# application_id indicates EMR Serverless job
    if application_id is not None:
        # We require entry-point and job-role
        if entry_point is None or job_role is None:
            raise click.BadArgumentUsage(
                "--entry-point and --job-role are required if --application-id is used."
            )

        if job_args:
            job_args = job_args.split(",")
        emrs = EMRServerless(application_id, job_role, p)
        emrs.run_job(
            job_name, job_args, spark_submit_opts, wait, show_stdout, s3_logs_uri
        )

    # cluster_id indicates EMR on EC2 job
    if cluster_id is not None:
        if job_args:
            job_args = job_args.split(",")
        emr = EMREC2(cluster_id, p, job_role)
        emr.run_job(job_name, job_args, wait, show_stdout)  # add spark_submit_opts

Tasks

Beta Give feedback

Add --job-args to EMREC2.run_job
Add --spark-submit-opts to EMREC2.run_job
Options

EMR CLI run throws error

Hello I am using Mac OS running Docker container


emr run \
--entry-point entrypoint.py \
--application-id <ID> \
--job-role arn:aws:iam::043916019468:role/AmazonEMR-ExecutionRole-1693489747108 \
--s3-code-uri s3://jXXXv/emr_scripts/ \
--spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"  \
--build \
--wait

Error

[emr-cli]: Packaging assets into dist/
[+] Building 23.2s (10/15)                                                                                                                                                                                docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                      0.0s
 => => transferring dockerfile: 2.76kB                                                                                                                                                                                    0.0s
 => [internal] load .dockerignore                                                                                                                                                                                         0.0s
 => => transferring context: 83B                                                                                                                                                                                          0.0s
 => [internal] load metadata for docker.io/library/amazonlinux:2                                                                                                                                                          0.6s
 => [auth] library/amazonlinux:pull token for registry-1.docker.io                                                                                                                                                        0.0s
 => [base 1/7] FROM docker.io/library/amazonlinux:2@sha256:e218c279c7954a94c6f4c8ab106c3ea389675d429feec903bc4c93fa66ed4fd0                                                                                               0.0s
 => [internal] load build context                                                                                                                                                                                         0.0s
 => => transferring context: 486B                                                                                                                                                                                         0.0s
 => CACHED [base 2/7] RUN yum install -y python3 tar gzip                                                                                                                                                                 0.0s
 => CACHED [base 3/7] RUN python3 -m venv /opt/venv                                                                                                                                                                       0.0s
 => CACHED [base 4/7] RUN python3 -m pip install --upgrade pip                                                                                                                                                            0.0s
 => ERROR [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 -                                                                                                                                         22.6s
------
 > [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 -:
22.53 Retrieving Poetry metadata
22.53 
22.53 # Welcome to Poetry!
22.53 
22.53 This will download and install the latest version of Poetry,
22.53 a dependency and package manager for Python.
22.53 
22.53 It will add the `poetry` command to Poetry's bin directory, located at:
22.53 
22.53 /root/.local/bin
22.53 
22.53 You can uninstall at any time by executing this script with the --uninstall option,
22.53 and these changes will be reverted.
22.53 
22.53 Installing Poetry (1.6.1)
22.53 Installing Poetry (1.6.1): Creating environment
22.53 Installing Poetry (1.6.1): Installing Poetry
22.53 Installing Poetry (1.6.1): An error occurred. Removing partial environment.
22.53 Poetry installation failed.
22.53 See /poetry-installer-error-ulqu13bo.log for error logs.
------
Dockerfile:32
--------------------
  30 |     
  31 |     RUN python3 -m pip install --upgrade pip
  32 | >>> RUN curl -sSL https://install.python-poetry.org | python3 -
  33 |     
  34 |     ENV PATH="$PATH:/root/.local/bin"
--------------------
ERROR: failed to solve: process "/bin/sh -c curl -sSL https://install.python-poetry.org | python3 -" did not complete successfully: exit code: 1
Traceback (most recent call last):
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/bin/emr", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context().obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/emr_cli.py", line 238, in run
    p.build()
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/packaging/python_project.py", line 56, in build

Implement PyInstaller

It would be useful to be able to install the CLI without having a functional Python installation and across different operating systems.

PyInstaller can be utilized for this.

In addition, in order to package the EMR CLI in the VS Code extension, this would provide an easy installation method.

Support of adding only extra spark properties for JobRun

Hi there,

As I understand it, there is a config_overrides = {} piece of code that will always override all existing configs at the app level. Consequently, I assume that from the UI, for example, while cloning or starting a new job, only extra "Spark properties" might be provided for the job run.

From your point of view: is this currently supported? And if not, what's the easiest way to implement it? Perhaps an additional call to the GetApplication API to fetch that information. Ideally, I want to have some common-ground dependency and not have to worry about remembering to pass it every time

Add a build flag on deploy

Currently, the deploy command assumes that the project has already been built or packaged.

We should either (or both):

Add some safeguards to prevent deploying if the artifacts don't exist
Add a --build flag to the deploy command similar to how we have on the run command.

Add support for `--show-logs` on EMR Serverless

We recently added a new --show-logs flag for EMR on EC2, but this functionality is not yet supported on EMR Serverless.

In order to add this, we also need to add an --s3-log-uri flag for EMR Serverless. EMR on EC2 clusters have a common loguri defined at cluster start that we fetch at runtime.

Add support for `--show-logs` in cluster mode on EMR on EC2

With the recent --show-logs flag, we switch the deploy mode to client so that EMR steps can capture the driver stdout.

Unfortunately, --client mode doesn't work with additional archives provided via the --archives flag or --conf spark.archives parameter. See https://issues.apache.org/jira/browse/SPARK-36088 for more a related issue.

In order to support this for cluster mode, we'd need to parse the step stderr logs to retrieve the Yarn application ID, then fetch the Yarn application logs from S3.

Add support for runtime roles for EMR steps

EMR 6.7.0 and later now support runtime roles that allow you to provide a role to use with a specific Step.

We should update the run command to accept a --job-role for EMR on EC2 clusters similar to how we do for EMR Serverless.

Add support for local builds

Not everybody wants to use Docker to build artifacts, or in some cases like in a CI pipeline it may be undesirable.

We should add support for some sort of --local-build flag that, if the system is compatible (Amazon Linux 2 only?), it has the proper set of commands to bootstrap the build environment and package the artifacts.

Don't find option to configure external jars

What's the way to add external jars in Spark application.
My requirement is to add Delta Lake jars.

--conf spark.jars=s3://DOC-EXAMPLE-BUCKET/jars/delta-core_2.12-1.1.0.jar

Add timeout option to the emr-serverless job run

The default timeout for jobs in the EMR serverless is 12 hours (https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html). For most uses cases this timeout is sufficient, but not for all and especially for the spark structured streaming job, which is running all the time.

The solution proposal is to add the --emr-serverless-timeout flag to the run command and pass the value to the start_job_run method.

Finalize CLI API

The EMR CLI has been available for a while now and through my own usage and others, we have a good idea of the final set of commands and subcommands that should be supported by the CLI.

Today, certain things are confusing:

There are three main commands: package, deploy, and run
- emr package builds a local version of the assets, while emr run ... --build both packages and deploys the assets.
- Do we need both package and deploy? Or can we simply have build and run.

Typically, I only use run ... --build, but in CI/CD pipelines both package and deploy can be useful. Package if you want to move the assets yourself and deploy if you want to have the CLI do the copy for you in 1 step.

It would be useful to be able to chain these commands as opposed to providing parameters. For example:

emr build deploy run --entrypoint file.py ... would perform all of build, deploy, and run in that order. That said there are some things that don't make sense, so should protect against scenarios like this.

If you already built your assets, you wouldn't repeatedly use deploy run.
You wouldn't use emr build run

Add documentation for emr-cli config file

Today, the EMR CLI supports a a config file for storing information related to your deployment environment (app ID, job role, etc), but it's completely undocumented. We should add:

Documentation around the config file
Command-line flags to save/update the config file, e.g. emr run would be much easier than emr run with 9 different flags. 😆 So we could add a new option like emr run --save-config that would save the set of arguments you used so you can reuse it the next time. This is useful when you're using the EMR CLI to iteratively develop/run a job in a remote environment.

Sample job got failed : Python no such file or directory

[emr-cli]: Waiting for job to complete...
[emr-cli]: Job state is now: PENDING
[emr-cli]: Job state is now: SCHEDULED
[emr-cli]: Job state is now: RUNNING
[emr-cli]: Job state is now: FAILED
[emr-cli]: EMR Serverless job failed: Job failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: java.io.IOException: error=2, No such file or directory
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory...