awslabs / amazon-emr-cli Goto Github PK
View Code? Open in Web Editor NEWA command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs
License: Apache License 2.0
A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs
License: Apache License 2.0
If the entrypoint file is in a subfolder locally, it attempts to use that subfolder in constructing the entrypoint path on S3.
We should either:
Currently, all boto3
clients are initiated without specifying the AWS profile name. Hence, the default
profile will be used.
Adding the --profile
flag would significantly improve the handiness of the CLI because it is very common to switch the profile frequently during development (for example, changing the target environment between dev
and prod
by switching the AWS profile).
Hi,
Do you think it's fine to have a basic understanding of how much time it took to proceed between stages and when the job was initially started, by looking only at the logs in the terminal?
AWS has announced AWS EMR CLI
I have tried and CLi works great simplifies submitting jobs
However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI
Here is a sample of how we are submitting jobs
emr run --entry-point entrypoint.py
--application-id --job-role <arn>
--s3-code-uri s3:///emr_scripts/ --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
--build `
--wait
If you can kindly get back to us on issue that would be great ๐
@dacort
Currently the amazon-emr-vscode-toolkit
provide a way to write your code on an EMR environment on your local machine leveraging the EMR on EKS container image with devcontainer.
This is limited to IDEs that support devcontainer. It would be good to have a standalone implementation from within the EMR CLI to start a local dev environment a mount the project as a volume to the docker container of the EMR image.
This feature would be interesting for PyCharm that offer a way to connect to a remote jupyter server
Add --job-args
(variable never used) and --spark-submit-opts
(variable not exists) for EMR on EC2.
Currently --job-args
as well as --spark-submit-opts
works only for EMR Serverless:
# application_id indicates EMR Serverless job
if application_id is not None:
# We require entry-point and job-role
if entry_point is None or job_role is None:
raise click.BadArgumentUsage(
"--entry-point and --job-role are required if --application-id is used."
)
if job_args:
job_args = job_args.split(",")
emrs = EMRServerless(application_id, job_role, p)
emrs.run_job(
job_name, job_args, spark_submit_opts, wait, show_stdout, s3_logs_uri
)
# cluster_id indicates EMR on EC2 job
if cluster_id is not None:
if job_args:
job_args = job_args.split(",")
emr = EMREC2(cluster_id, p, job_role)
emr.run_job(job_name, job_args, wait, show_stdout) # add spark_submit_opts
Hello I am using Mac OS running Docker container
emr run \
--entry-point entrypoint.py \
--application-id <ID> \
--job-role arn:aws:iam::043916019468:role/AmazonEMR-ExecutionRole-1693489747108 \
--s3-code-uri s3://jXXXv/emr_scripts/ \
--spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" \
--build \
--wait
[emr-cli]: Packaging assets into dist/
[+] Building 23.2s (10/15) docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.76kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 83B 0.0s
=> [internal] load metadata for docker.io/library/amazonlinux:2 0.6s
=> [auth] library/amazonlinux:pull token for registry-1.docker.io 0.0s
=> [base 1/7] FROM docker.io/library/amazonlinux:2@sha256:e218c279c7954a94c6f4c8ab106c3ea389675d429feec903bc4c93fa66ed4fd0 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 486B 0.0s
=> CACHED [base 2/7] RUN yum install -y python3 tar gzip 0.0s
=> CACHED [base 3/7] RUN python3 -m venv /opt/venv 0.0s
=> CACHED [base 4/7] RUN python3 -m pip install --upgrade pip 0.0s
=> ERROR [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 - 22.6s
------
> [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 -:
22.53 Retrieving Poetry metadata
22.53
22.53 # Welcome to Poetry!
22.53
22.53 This will download and install the latest version of Poetry,
22.53 a dependency and package manager for Python.
22.53
22.53 It will add the `poetry` command to Poetry's bin directory, located at:
22.53
22.53 /root/.local/bin
22.53
22.53 You can uninstall at any time by executing this script with the --uninstall option,
22.53 and these changes will be reverted.
22.53
22.53 Installing Poetry (1.6.1)
22.53 Installing Poetry (1.6.1): Creating environment
22.53 Installing Poetry (1.6.1): Installing Poetry
22.53 Installing Poetry (1.6.1): An error occurred. Removing partial environment.
22.53 Poetry installation failed.
22.53 See /poetry-installer-error-ulqu13bo.log for error logs.
------
Dockerfile:32
--------------------
30 |
31 | RUN python3 -m pip install --upgrade pip
32 | >>> RUN curl -sSL https://install.python-poetry.org | python3 -
33 |
34 | ENV PATH="$PATH:/root/.local/bin"
--------------------
ERROR: failed to solve: process "/bin/sh -c curl -sSL https://install.python-poetry.org | python3 -" did not complete successfully: exit code: 1
Traceback (most recent call last):
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/bin/emr", line 8, in <module>
sys.exit(cli())
^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context().obj, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/emr_cli.py", line 238, in run
p.build()
File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/packaging/python_project.py", line 56, in build
It would be useful to be able to install the CLI without having a functional Python installation and across different operating systems.
PyInstaller can be utilized for this.
In addition, in order to package the EMR CLI in the VS Code extension, this would provide an easy installation method.
Hi there,
As I understand it, there is a config_overrides = {} piece of code that will always override all existing configs at the app level. Consequently, I assume that from the UI, for example, while cloning or starting a new job, only extra "Spark properties" might be provided for the job run.
From your point of view: is this currently supported? And if not, what's the easiest way to implement it? Perhaps an additional call to the GetApplication API to fetch that information. Ideally, I want to have some common-ground dependency and not have to worry about remembering to pass it every time
Currently, the deploy
command assumes that the project has already been built or packaged.
We should either (or both):
--build
flag to the deploy
command similar to how we have on the run
command.We recently added a new --show-logs
flag for EMR on EC2, but this functionality is not yet supported on EMR Serverless.
In order to add this, we also need to add an --s3-log-uri
flag for EMR Serverless. EMR on EC2 clusters have a common loguri defined at cluster start that we fetch at runtime.
With the recent --show-logs
flag, we switch the deploy mode to client
so that EMR steps can capture the driver stdout
.
Unfortunately, --client
mode doesn't work with additional archives provided via the --archives
flag or --conf spark.archives
parameter. See https://issues.apache.org/jira/browse/SPARK-36088 for more a related issue.
In order to support this for cluster mode, we'd need to parse the step stderr
logs to retrieve the Yarn application ID, then fetch the Yarn application logs from S3.
EMR 6.7.0 and later now support runtime roles that allow you to provide a role to use with a specific Step.
We should update the run
command to accept a --job-role
for EMR on EC2 clusters similar to how we do for EMR Serverless.
Not everybody wants to use Docker to build artifacts, or in some cases like in a CI pipeline it may be undesirable.
We should add support for some sort of --local-build
flag that, if the system is compatible (Amazon Linux 2 only?), it has the proper set of commands to bootstrap the build environment and package the artifacts.
What's the way to add external jars in Spark application.
My requirement is to add Delta Lake jars.
--conf spark.jars=s3://DOC-EXAMPLE-BUCKET/jars/delta-core_2.12-1.1.0.jar
The default timeout for jobs in the EMR serverless is 12 hours (https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html). For most uses cases this timeout is sufficient, but not for all and especially for the spark structured streaming job, which is running all the time.
The solution proposal is to add the --emr-serverless-timeout flag to the run command and pass the value to the start_job_run method.
The EMR CLI has been available for a while now and through my own usage and others, we have a good idea of the final set of commands and subcommands that should be supported by the CLI.
Today, certain things are confusing:
package
, deploy
, and run
emr package
builds a local version of the assets, while emr run ... --build
both packages and deploys the assets.build
and run
.Typically, I only use run ... --build
, but in CI/CD pipelines both package and deploy can be useful. Package if you want to move the assets yourself and deploy if you want to have the CLI do the copy for you in 1 step.
It would be useful to be able to chain these commands as opposed to providing parameters. For example:
emr build deploy run --entrypoint file.py ...
would perform all of build
, deploy
, and run
in that order. That said there are some things that don't make sense, so should protect against scenarios like this.
deploy run
.emr build run
Today, the EMR CLI supports a a config file for storing information related to your deployment environment (app ID, job role, etc), but it's completely undocumented. We should add:
emr run
would be much easier than emr run
with 9 different flags. ๐ So we could add a new option like emr run --save-config
that would save the set of arguments you used so you can reuse it the next time. This is useful when you're using the EMR CLI to iteratively develop/run a job in a remote environment.[emr-cli]: Waiting for job to complete...
[emr-cli]: Job state is now: PENDING
[emr-cli]: Job state is now: SCHEDULED
[emr-cli]: Job state is now: RUNNING
[emr-cli]: Job state is now: FAILED
[emr-cli]: EMR Serverless job failed: Job failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: java.io.IOException: error=2, No such file or directory
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.