Code Monkey home page Code Monkey logo

datajob's People

Contributors

dependabot[bot] avatar lorenzocevolani avatar petervandenabeele avatar vincentclaes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datajob's Issues

mention to cdk bootstrap in the readme

Do you wish to deploy these changes (y/n)? y
data-pipeline-simple-dev: deploying...

 โŒ  data-pipeline-simple-dev failed: Error: This stack uses assets, so the toolkit stack must be deployed to the environment (Run "cdk bootstrap aws://077590795309/eu-west-1")
    at Object.addMetadataAssetsToManifest (/usr/local/lib/node_modules/aws-cdk/lib/assets.ts:27:11)
    at Object.deployStack (/usr/local/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:205:29)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at CdkToolkit.deploy (/usr/local/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24)
    at initCommandLine (/usr/local/lib/node_modules/aws-cdk/bin/cdk.ts:204:9)
This stack uses assets, so the toolkit stack must be deployed to the environment (Run "cdk bootstrap aws://077590795309/eu-west-1")
Traceback (most recent call last):
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/bin/datajob", line 5, in <module>
    run()
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 20, in run
    app()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 51, in deploy
    call_cdk(command="deploy", args=args, extra_args=extra_args)
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 103, in call_cdk
    subprocess.check_call(shlex.split(full_command))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cdk', 'deploy', '--app', 'python /Users/vincent/Workspace/datajob/examples/data_pipeline_simple/datajob_stack.py', '-c', 'stage=dev']' returned non-zero exit status 1.
(datajob-KxqvMF6C-py3.6) Vincents-MacBook-Pro:data_pipeline_simple vincent$ 

make the workflow name optional

    with StepfunctionsWorkflow(
        datajob_stack=mailswitch_stack, name="workflow"
    ) as step_functions_workflow:
        join_labels >> ...

it might also be easier to execute a workflow that has the same name as the stack

better handle when no aws account could be resolved

check this in advance and raise an error from within datajob

Unable to resolve AWS account to use. It must be either configured when you define your CDK or through the environment
Traceback (most recent call last):
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/bin/datajob", line 8, in <module>
    sys.exit(run())
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 17, in run
    app()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 37, in deploy
    call_cdk(command="deploy", args=args, extra_args=extra_args)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 73, in call_cdk
    subprocess.check_call(shlex.split(full_command))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cdk', 'deploy', '--app', 'python /Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py', '-c', 'stage=stg']' returned non-zero exit status 1.

expand datajob to deploy ecs fargate tasks

  • can we subclass from DataJobBase and implement the requirements for an ecs fargate task/job?
  • maybe name it FargateJob ? (job is consistent within the lib, but i think task is the correct term for ecs/fargate)
  • can we add ecs fargate job to stepfunctions workflow?
  • add test to create a class FargateJob
  • add example to synth a job with fargate in github actions

upgrade to cdk 1.87.1

  • if you want this to work in a vs code devcontainter
  • you need 1.87.1 for the cli and the python libs
  • fix dependencies

handle aws region better

step functions workflow should first inherit from the datajob stack before checking env vars

Traceback (most recent call last):
  File "/Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py", line 87, in <module>
    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name=stackname) as sfn:
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/jsii/_runtime.py", line 83, in __call__
    inst = super().__call__(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/stepfunctions/stepfunctions_workflow.py", line 53, in __init__
    self.region = region if region else os.environ["AWS_DEFAULT_REGION"]
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'AWS_DEFAULT_REGION'

self.region = region if region else os.environ["AWS_DEFAULT_REGION"]

implement a notification notification="[email protected]" to the stepfunctionsworkflow

    # We instantiate a step functions workflow and orchestrate the glue jobs.
    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow", notification="some-email...") as sfn:
        task1 >> task2
  • if we define a notification we create an SNS to add to the pipeline
  • it accepts an email address as string or a list of email addresses
  • it notifies in case of failure or in case of success

example deploy fails `Error: Invalid S3 bucket name (value: data-pipeline-simple-None-deployment-bucket)`

The None has a capital letter which is invalid.

I ran:

export AWS_DEFAULT_ACCOUNT=_____________29
export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=your-region # e.g. eu-west-1

<..>/datajob/examples/data_pipeline_simple$ datajob deploy --config datajob_stack.py 
cdk command: cdk deploy --app  "python <..>/datajob/examples/data_pipeline_simple/datajob_stack.py"  -c stage=None
jsii.errors.JavaScriptError: 
  Error: Invalid S3 bucket name (value: data-pipeline-simple-None-deployment-bucket)
  Bucket name must only contain lowercase characters and the symbols, period (.) and dash (-) (offset: 21)

bug when runnin last jobs in parallel

hi V, I noticed that the workflow fails if the last step is not an unique job.
Ex: task1 >> [task2, task3] fails
9:14
but [task1, task2] >> task3 works
9:15
(in my example each task are independent)
9:15
is this the expected behavior?

bug with credentials

(node:17808) ExperimentalWarning: The fs.promises API is experimental
python: can't open file 'deployment_glue_datajob.py': [Errno 2] No such file or directory
Subprocess exited with error 2
DVCL643@10NB03610:~/workspace/python/aws_best_practices$ cd glue
DVCL643@10NB03610:~/workspace/python/aws_best_practices/glue$ cdk deploy --app  "python deployment_glue_datajob.py"
(node:10368) ExperimentalWarning: The fs.promises API is experimental
Traceback (most recent call last):
  File "deployment_glue_datajob.py", line 60, in <module>
    python_job >> pyspark_job
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 115, in __exit__
    self._build_workflow()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 91, in _build_workflow
    self.client = boto3.client("stepfunctions")
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\boto3\__init__.py", line 93, in client
    return _get_default_session().client(*args, **kwargs)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\boto3\session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\session.py", line 826, in create_client
    credentials = self.get_credentials()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\session.py", line 431, in get_credentials
    'credential_provider').load_credentials()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1975, in load_credentials
    creds = provider.load()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1102, in load
    credentials = fetcher(require_expiry=False)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1137, in fetch_credentials
    provider=method, cred_var=mapping['secret_key'])
botocore.exceptions.PartialCredentialsError: Partial credentials found in env, missing: AWS_SECRET_ACCESS_KEY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deployment_glue_datajob.py", line 60, in <module>
    python_job >> pyspark_job
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 74, in __exit__
    self.create_resources()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 93, in create_resources
    [resource.create() for resource in self.resources]
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 93, in <listcomp>
    [resource.create() for resource in self.resources]
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 104, in create
    text_file.write(self.workflow.get_cloudformation_template())
AttributeError: 'NoneType' object has no attribute 'get_cloudformation_template'
Subprocess exited with error 1

make all objects configurable

let the user add **kwargs to;

  • all the cdk object, the create functions
  • to all the step functions object. check the stepfunctions_workflow

if stepfunctions workflow is None we should handle this

If an error occurs it looks like the "exit" function is called of the contextmanager and it might be the workflow is still None, which raises another exception when trying to create the resources.

KeyError: 'AWS_DEFAULT_REGION'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py", line 91, in <module>
    ] >> crop_raster_per_country >> dump_data_layer_to_gbq >> dump_display_names_to_gbq
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 72, in __exit__
    self.create_resources()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 91, in create_resources
    [resource.create() for resource in self.resources]
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 91, in <listcomp>
    [resource.create() for resource in self.resources]
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/stepfunctions/stepfunctions_workflow.py", line 102, in create
    text_file.write(self.workflow.get_cloudformation_template())
AttributeError: 'NoneType' object has no attribute 'get_cloudformation_template'

bugfix - let boto3 handle region

now we explicitly specify a region when defining a stepfunctions workflow.
we should handle it implicitly by boto3
also don't raise an error when no region is found

get a default sagemaker role

  • sagemaker processor/estimator can use a default role when none is supplied
  • maybe a static function from SagemakerBase?

include stack name in the tasks that we run.

if you have the same task name and stage over 2 different pipelines, you will have a conflict
e.g name "task1" and stage="stg" will result in task1-stg
we need to prefix this with our stack name, e.g.; my-stack-task1-stg

subclass SomeMockedClass from DatajobBase


@stepfunctions_workflow.task
class SomeMockedClass(object):
    def __init__(self, unique_name):
        self.unique_name = unique_name
        self.sfn_task = Task(state_id=unique_name)

better resemble reality

README: Add explanation how to run the tests

On a Linux box with conda, this could be an explanation on how to get pytest running.

/home/peter_v/anaconda3/bin/python -m pip install --upgrade pip  # to avoid warnings about spyder 4.1.5 versions
make
sudo apt install nodejs  # to avoid massive warnings about RuntimeError: generator didn't stop after throw()

$ poetry run pytest
========================================== test session starts ===========================================
platform linux -- Python 3.8.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/peter_v/data/github/vincentclaes/datajob
collected 16 items                                                                                       

datajob_tests/test_datajob_context.py .                                                            [  6%]
datajob_tests/test_datajob_stack.py ....                                                           [ 31%]
datajob_tests/datajob_cli_tests/test_datajob_deploy.py .......                                     [ 75%]
datajob_tests/datajob_cli_tests/test_datajob_execute.py .                                          [ 81%]
datajob_tests/glue/test_glue_job.py .                                                              [ 87%]
datajob_tests/stepfunctions/test_stepfunctions_workflow.py ..                                      [100%]

=========================================== 16 passed in 5.62s ===========================================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.