Code Monkey home page Code Monkey logo

agha-data-validation-pipeline's People

Contributors

pdiakumis avatar reisingerf avatar sarahcasauria avatar scwatts avatar williamputraintan avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agha-data-validation-pipeline's Issues

Unlock bucket mechanism for FAILED checks

Context:

Current todos if the submission failed any checks would require manual work to delete folderLock bucket policy for that submission. This includes Data Validation or Manifest Check.

Actions:

  • Extend the current folderLock lambda to be able to unlock bucket. (Current implementation only locks the submission)
  • Implement trigger unlock lambda from the point of failed checks detected

Enable DynamoDb Stream

Context:

  • Data from DynamoDb maybe exported to somewhere else, and the desired way is to sync dynamodb automatically. One use case is to export data to elsa-data

  • To trigger an action based on DynamoDb update. Useful for example notifying actions based on the dynamodb data. See #26

DynamoDb stream have a batch-event processing which allow event to be processed in batch. See stream-event-source.

Action:
Implement

Include Submission Prefixes on S3 Sharing

Context:
Same filename may appear multiple times in each flagship. This could cause an issue when sharing objects as the destination key is only constructed via filename. This will result file gets overridden when sharing. (The matching filename will typically belong to the same studyId)

Todo:
Include filename prefixes so object will not be overridden when shared.

Ref:
https://trello.com/c/NFSr0Gjq

Submit BatchJob `TooManyRequestsException`

Had an issue when submitting batch job at large amount. This happen when large submission are processed at the same time.

ClientError: An error occurred (TooManyRequestsException) when calling the SubmitJob operation (reached max retries: 4): Too Many Requests

~ CloudWatch

From AWS-Blog , there is possible that we hit one of the ServiceQuotas limit. The current suspect is hitting the last point of the service quotas.

Maximum number of transactions per second (TPS) for each account for SubmitJob operations

BatchServiceQuotaSS

Re-running the validation_manager lambda (that trigger the submitJob) could get around it, but might track this here if it happen again in the future.

Batch Notification

Context:
No notification when batch has finishes checking datas.

Solution:
Implement notification using EventBridge for all success or all fail batch result. This solution will only notify and will not execute actions to move objects, this to ensure administrator have full control of what needed to be moved.

S3 Bucket Policy Concurrency Issue

Context:
Uploading manifest.txt in a batch could have a concurrency issue when retrieving and updating the bucket policy. Two different lambda events trigger at approximately the same time, could retrieve the same bucket policy (instead of waiting for the other to finish) and so it will only update one extra policy instead of two.

This issue could also happen when transferring data from one bucket to another. This happens on the data transfer manager lambda, where lambda needs to unlock (removing policy) the policy for a specific submission. This could also lead to the same issue.

Update [24/02/2022]
Bucket policy now whitelist access on modifying the staging bucket even when it is locked. Please refer to folder_lock lambda on the condition statement section. The whitelist access are for:

  • EC2 instance batch profile
  • Lambda Cleanup (to remove uncompressed and indexed file)

Temporary solution:
Have at least 2 seconds between lambda invocation/submitting manifest.txt

Fixed solution:
-

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.