Code Monkey home page Code Monkey logo

amplify's Introduction

🔊 Amplify

Amplify attaches afterburners to your data. Amplify:

  • explain: metadata extraction, classification, tagging, and reporting
  • erich: derivative data generation like thumbnails, previews, conversions, etc.
  • enhance: batteries-included value-adds like data quality reports, image augmentation, OCR, translations, etc.

Amplify leverages the decentralized compute provided by Bacalhau to magically enrich your data. A built-in suite of pipelines decides what your data is and how to best improve upon it. You can also self-host Amplify to trigger off your offline data sources and implement your own custom pipelines.

GitHub release (latest SemVer) Website CircleCI GitHub all releases GitHub GitHub last commit

Documentation

Project Status

We decided to bump Amplify to v1 for the Compute over Data Summit 2023 in Boston to signify the following.

We have been running Amplify in production since the beginning of the project. So I suggest that it is stable enough to be ready for developer use. However, development was rapid, so there are edge cases and test coverage is low.

This project was time-boxed and therefore we are no longer actively developing this project. It has not yet been decided whether continue development or enter maintenance mode. If you are interested in further development, please contact the Bacalhau team on Slack.

amplify's People

Contributors

aronchick avatar enricorotundo avatar philwinder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

amplify's Issues

Backoff when queue is full

There's currently about a 3 hour delay in starting jobs because the queue is so backed up. Need to start backing off to allow it to catch up.

Amplify Metrics and Monitoring

  • Add metrics and traces to Amplify for QoS purposes (otel)
  • Add QoS dashboard + alerting
  • Add (generic) ability to extra information from jobs and push to Bacalhau dashboard (e.g. types of data)
  • Add metrics to Bacalhau dashboard to link Bacalhau jobs to content types (and more)

Workload: CSV Summary Statistics + plotting

Interesting one here, parquet doesn't have a registered mime type yet. Wonder if tika can parse?

Metadata | predicate [text/csv| application/parquet] -> Load and produce summaries of data -> Merge

Ideas:


From issue #26

Dag Improvements

  • Add output information to API
  • Node ID should be job ID
  • Add status information to API
  • Change API to allow for proper dags
  • Re-add job api
  • Swap-out workflows API for nodes API

Amplify Docs

  • Document thoroughly what predicate is and what is for
  • Add usage docs to docs.bacalhau.org

Add API to run a CID over all workflows

  • Add API
  • Alter task code so that when no workflow is named it runs that CID over all workflows
    * [ ] Cache/Dedup workflows with same jobs Will be addressed in #27
  • Produce single derivative (optional flag)

Amplify QoS

  • Rate limit the queue
  • Rate limit the clients based upon the queue
  • Use a prioritised queue so manual submitters can bump their job up the queue
  • Record queue performance metrics
  • Report predicted delay in API

In the image workflow, weird ~ on outputs.

Not sure why.

Repro:

go run . run QmbRr4kUXMxQfZPnLUSb1kSMDvFtBUcv1HVSDaAUKe4ePj
b get 0b7f6e1a-ef94-441d-9679-ef23904504d0

❯ ll job-0b7f6e1a/default
total 288
-rw-r--r--@ 1 enricorotundo  staff    11K Apr 19 13:10 image1.jpg
-rw-r--r--@ 1 enricorotundo  staff    11K Apr 19 13:10 image1.jpg~
-rw-r--r--@ 1 enricorotundo  staff   4.5K Apr 19 13:10 image2.jpg
-rw-r--r--@ 1 enricorotundo  staff   4.5K Apr 19 13:10 image2.jpg~
-rw-r--r--@ 1 enricorotundo  staff   6.2K Apr 19 13:10 image3.jpg
-rw-r--r--@ 1 enricorotundo  staff   6.2K Apr 19 13:10 image3.jpg~
-rw-r--r--@ 1 enricorotundo  staff   5.0K Apr 19 13:10 image4.jpg
-rw-r--r--@ 1 enricorotundo  staff   5.1K Apr 19 13:10 image4.jpg~
-rw-r--r--@ 1 enricorotundo  staff   9.2K Apr 19 13:10 image4sa.jpg
....

??

Lots and lots of workflows for CoDSummit

Brain dump

  • IPFS data statistics workflow
  • Format conversion, like csv to parquet, or csv to json
  • Validation
  • Descriptions

Use Cases

IPFS Data Statistics

Important to do this first, because there's no point working on ${DataType} use cases if there are no ${DataTypes} in IPFS -- {insert data type here}.

Metadata -> Pushing metadata info to external database (security concern, whitelist?)

Web-Focussed Image Compression

Metadata | predicate image/*-> Image Processing -> Merge

Web-Focussed Video Compression

Metadata | predicate video/*-> Video Processing -> Merge

Transcription of Video and Audio

Metadata | predicate [video|audio]/* -> Transcription model(s) -> Merge

CSV/Parquet Summary Statistics

Interesting one here, parquet doesn't have a registered mime type yet. Wonder if tika can parse?

Metadata | predicate [text/csv| application/parquet] -> Load and produce summaries of data -> Merge

${DataType} Enrichment

For example, given CSV file, columns have a lat and long, run job that converts lat/long to country/city and creates output csv with the same row format, then merge back together with original data.

Metadata | predicate ${DataType} -> Parse columnar data type | predicate ${ColumnDataType} -> Data Enrichment -> Merge

Image Dataset Analysis

https://cleanvision.readthedocs.io/

Metadata | predicate image/*-> Image analysis -> Merge

Video-Resize job produces videos of the same size

For this CID QmTEvry1uo8qoqBMCSdHobsS7RVXr2M4JZRZXTUKVpoMdp <- a blob video

This execution: bacalhau describe 0cfd3408-7f59-47f9-91e2-739be937affa

This output:

❯ ls -lah job-0cfd3408/default
total 5.2M
drwxr-xr-x 8 phil  256 Apr 20 09:32 .
drwxr-xr-x 7 phil  224 Apr 20 09:32 ..
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_1080_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_144_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_240_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_360_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_480_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_720_video.mp4

Note how they're all the same size.

Convert tabular data to CSV

I note that this frictionless also does conversion. That could be cool. Convert all structured data formats into other structured data formats (e.g. csv->xls)

Question: does this command work with any structured type? Looks like you’re just testing with csv.

Yeah, I think that’s worth doing. Could you test adding the https://framework.frictionlessdata.io/docs/console/convert.html command to your run script please? Looks like you have to install specific packages for other data types.

Can you add more functionality?

Implement deployment architecture

  • Test current release scripts. Should work (copied from bac) but haven't tested yet.
  • Ideally want to do CD
  • Figure out where to host the infra (does it run on a separate node?)
  • Write deployment scripts
  • Hook up GH CI to auto-deploy on release. Yolo, can't be doing with manual deploys.

More Amplify Triggers

  • Filecoin deal trigger
  • IPFS DHT trigger
  • IPFS PubSub trigger
  • HTTP watch trigger
  • IPFS stream trigger

Job/Workflow improvements from review

Phil

  • Clearly define job input/output interface
  • Change so that all jobs are SingleJob (and refactor)
    • Remove MapJob
    • Refactor
  • Remove/Simplify composite
    • Remove children
    • Refactor
  • Workers should work on job level i.e. Want to rate limit the number of concurrent bacalhau jobs.

amplify-Page-2 drawio

Containerize job runner script to facilitate preserving the input directory structure at the output

Goal: Preserve the input directory structure at the output (e.g. consider /foo/bar.jpg and /hey/bar.jpg)

Implementation:

  • python script to shell out the job's command
    • job's stdout/err and exit code should be returned/printed correctly
    • handle case when in file(s) extension is not available (some processors e.g. ffmpeg require that)
    • replace /input/<path> with /outputs/<path>
    • script can take 1 parameter (e.g. resolution for image resize job), it will be used to create "levels" in the output dir (e.g. with param 720p,1080p the output dir structure would be /outputs/720p/<path> and /outputs/1080p/<path>.
    • no external dependencies other than just python3
  • write test for this python utils
  • containerize the script above
  • convert the existing jobs to make use of the container above
  • publish container to ghcr.io in this repo

Why not using Bash:

  • Escaping hell
  • many cases to handle: single file CID vs directory CID, file with or without extension
  • Unsupported files (e.g. .DS_Store) shall be handled
  • should check for Empty files
  • handle uppercase vs lowecase extensions, more complexity
  • Not quite testable.
  • Manipulating paths as strings is too fragile

Persistence

  • Add a DB to production
  • Develop code to persist queue information
  • Develop code to persist dag information
  • Refactor dag if necessary
  • Pagination in DB queries
  • Pagination in API

Feedback after review -- small enhancements

  • Add log message when skipping jobs
  • Bac describe xxx log message should include job info
    * [ ] Reinstate amplify run job command (possibly?) Not urgent.
  • Change default root directory name so as it's not confused with the root user
  • Add ability to assume the "default" output when specifying inputs/outputs
  • Remove need for specifying path in input. If not required then don't mount.
    * [ ] Amplify create graph command (or equivalent) to plot the DAG Deferred to UI work. Not urgent.
  • Show that image on the graph page in the server.
  • Print full server address so people can click on it
  • Add form to submit job

Priority queue

Bad for demos/usability to get put to the back of the queue when submitting manually. Should move straight to the front.

[amplify] SIGINT doesn't release the cli

❯ go run . run QmYUteNWk2rMDPmXHnFxhvZZYNuyV2daYJ72N5sZqs9rbC
INF Running job jobID=root-job
INF Running job jobID=metadata-job
INF bacalhau describe 986e41d3-2da8-4a43-822e-9077556a369b
^CINF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
WRN Error executing job error="getting Bacalhau job info: publicapi: after posting request: Post \"http://bootstrap.production.bacalhau.org:1234/requester/list\": context canceled"
INF Worker received quit command.
INF Running job jobID=merge-job
INF Running job jobID=tree-job



^C^C


^C
^C

Conditional workflows

  • Add conditional workflow
    • Need some way of adding the previous stdout to the next stdout so downstream jobs can use previous information
  • #31

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.