bacalhau-project / amplify Goto Github PK

View Code? Open in Web Editor NEW

11.0 5.0 0.0 22.87 MB

Bacalhau Amplify: automatic enrichment, enhancement, and explanation of your data

Home Page: https://bacalhau.org

License: Apache License 2.0

Makefile 7.89% Dockerfile 2.00% Shell 15.55% Go 61.12% HCL 4.24% HTML 0.17% CSS 0.04% TypeScript 8.42% Python 0.57%

data data-engineering data-science ipfs

amplify's Introduction

🔊 Amplify

Amplify attaches afterburners to your data. Amplify:

explain: metadata extraction, classification, tagging, and reporting
erich: derivative data generation like thumbnails, previews, conversions, etc.
enhance: batteries-included value-adds like data quality reports, image augmentation, OCR, translations, etc.

Amplify leverages the decentralized compute provided by Bacalhau to magically enrich your data. A built-in suite of pipelines decides what your data is and how to best improve upon it. You can also self-host Amplify to trigger off your offline data sources and implement your own custom pipelines.

Documentation

User documentation is hosted on the Bacalhau Docs website
Developer documentation can be found in the docs directory

Project Status

We decided to bump Amplify to v1 for the Compute over Data Summit 2023 in Boston to signify the following.

We have been running Amplify in production since the beginning of the project. So I suggest that it is stable enough to be ready for developer use. However, development was rapid, so there are edge cases and test coverage is low.

This project was time-boxed and therefore we are no longer actively developing this project. It has not yet been decided whether continue development or enter maintenance mode. If you are interested in further development, please contact the Bacalhau team on Slack.

amplify's People

Contributors

Stargazers

Watchers

amplify's Issues

Set explicit video codec in video-resize job

Backoff when queue is full

There's currently about a 3 hour delay in starting jobs because the queue is so backed up. Need to start backing off to allow it to catch up.

bug with merge job when file has spaces

Amplify Metrics and Monitoring

Add metrics and traces to Amplify for QoS purposes (otel)
Add QoS dashboard + alerting
Add (generic) ability to extra information from jobs and push to Bacalhau dashboard (e.g. types of data)
Add metrics to Bacalhau dashboard to link Bacalhau jobs to content types (and more)

Workload: CSV Summary Statistics + plotting

Interesting one here, parquet doesn't have a registered mime type yet. Wonder if tika can parse?

Metadata | predicate [text/csv| application/parquet] -> Load and produce summaries of data -> Merge

Ideas:

From issue #26

Dag Improvements

Add output information to API
Node ID should be job ID
Add status information to API
Change API to allow for proper dags
Re-add job api
Swap-out workflows API for nodes API

Let bacalhau executor label amplify jobs as such

video job: handle "not divisible by 2" videos

https://stackoverflow.com/questions/20847674/ffmpeg-libx264-height-not-divisible-by-2

Add more resolution scaling to image-resize job

Non unique job_id in config.yaml should throw an error

See this error: https://filecoinproject.slack.com/archives/C02RLM3JHUY/p1681807393198339?thread_ts=1681754897.016109&cid=C02RLM3JHUY

Amplify Docs

Document thoroughly what predicate is and what is for
Add usage docs to docs.bacalhau.org

Add API to run a CID over all workflows

Add API
Alter task code so that when no workflow is named it runs that CID over all workflows
* [ ] Cache/Dedup workflows with same jobs Will be addressed in #27
Produce single derivative (optional flag)

Issue with previous code/link in the UI

The logic isn't quite right. Test it better.

Prevent amplify re-processing amplify CIDs

Add a step in the merge job that adds a .amplify.lock file and then in the metadata step check for that file. If it finds it, skip.

Failing docker build, needs a node container.

See the tags build. Just need to add another build container.

Amplify QoS

Rate limit the queue
Rate limit the clients based upon the queue
Use a prioritised queue so manual submitters can bump their job up the queue
Record queue performance metrics
Report predicted delay in API

In the image workflow, weird ~ on outputs.

Not sure why.

Repro:

go run . run QmbRr4kUXMxQfZPnLUSb1kSMDvFtBUcv1HVSDaAUKe4ePj
b get 0b7f6e1a-ef94-441d-9679-ef23904504d0

❯ ll job-0b7f6e1a/default
total 288
-rw-r--r--@ 1 enricorotundo  staff    11K Apr 19 13:10 image1.jpg
-rw-r--r--@ 1 enricorotundo  staff    11K Apr 19 13:10 image1.jpg~
-rw-r--r--@ 1 enricorotundo  staff   4.5K Apr 19 13:10 image2.jpg
-rw-r--r--@ 1 enricorotundo  staff   4.5K Apr 19 13:10 image2.jpg~
-rw-r--r--@ 1 enricorotundo  staff   6.2K Apr 19 13:10 image3.jpg
-rw-r--r--@ 1 enricorotundo  staff   6.2K Apr 19 13:10 image3.jpg~
-rw-r--r--@ 1 enricorotundo  staff   5.0K Apr 19 13:10 image4.jpg
-rw-r--r--@ 1 enricorotundo  staff   5.1K Apr 19 13:10 image4.jpg~
-rw-r--r--@ 1 enricorotundo  staff   9.2K Apr 19 13:10 image4sa.jpg
....

Lots and lots of workflows for CoDSummit

Brain dump

IPFS data statistics workflow
Format conversion, like csv to parquet, or csv to json
Validation
Descriptions

Use Cases

IPFS Data Statistics

Important to do this first, because there's no point working on ${DataType} use cases if there are no ${DataTypes} in IPFS -- {insert data type here}.

Metadata -> Pushing metadata info to external database (security concern, whitelist?)

Web-Focussed Image Compression

Metadata | predicate image/*-> Image Processing -> Merge

Web-Focussed Video Compression

Metadata | predicate video/*-> Video Processing -> Merge

Transcription of Video and Audio

Metadata | predicate [video|audio]/* -> Transcription model(s) -> Merge

CSV/Parquet Summary Statistics

Interesting one here, parquet doesn't have a registered mime type yet. Wonder if tika can parse?

Metadata | predicate [text/csv| application/parquet] -> Load and produce summaries of data -> Merge

`${DataType}` Enrichment

For example, given CSV file, columns have a lat and long, run job that converts lat/long to country/city and creates output csv with the same row format, then merge back together with original data.

Metadata | predicate ${DataType} -> Parse columnar data type | predicate ${ColumnDataType} -> Data Enrichment -> Merge

Image Dataset Analysis

https://cleanvision.readthedocs.io/

Metadata | predicate image/*-> Image analysis -> Merge

Video-Resize job produces videos of the same size

For this CID QmTEvry1uo8qoqBMCSdHobsS7RVXr2M4JZRZXTUKVpoMdp <- a blob video

This execution: bacalhau describe 0cfd3408-7f59-47f9-91e2-739be937affa

This output:

❯ ls -lah job-0cfd3408/default
total 5.2M
drwxr-xr-x 8 phil  256 Apr 20 09:32 .
drwxr-xr-x 7 phil  224 Apr 20 09:32 ..
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_1080_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_144_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_240_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_360_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_480_video.mp4
-rw-r--r-- 1 phil 877K Apr 20 09:32 scaled_720_video.mp4

Note how they're all the same size.

UI: Improve graph view, add to execution page

Make the canvas bigger
Add the predicate so you can see when it's stopped
Add it to the execution page, and color the nodes depending on state

Convert tabular data to CSV

I note that this frictionless also does conversion. That could be cool. Convert all structured data formats into other structured data formats (e.g. csv->xls)

Question: does this command work with any structured type? Looks like you’re just testing with csv.

Yeah, I think that’s worth doing. Could you test adding the https://framework.frictionlessdata.io/docs/console/convert.html command to your run script please? Looks like you have to install specific packages for other data types.

Can you add more functionality?

Work on adding all data formats that are listed under data formats: https://framework.frictionlessdata.io/docs/formats/html.html
Add the convert step to convert all formats to all other formats.

Implement deployment architecture

Test current release scripts. Should work (copied from bac) but haven't tested yet.
Ideally want to do CD
Figure out where to host the infra (does it run on a separate node?)
Write deployment scripts
Hook up GH CI to auto-deploy on release. Yolo, can't be doing with manual deploys.

Some jobs not showing std out in API

Why does this job not show the stdout/result even when the job was successful?

http://amplify.bacalhau.org/api/v0/queue/4f101e53-6851-4617-99cb-c60e3ae48ffc

Fix log level setting issue

Have the ability to change the log level, but it doesn't work...

Add ID to text summary output so we can link back to the job that produced it

More Amplify Triggers

bug when a job times out, it reports an empty cid, and therefore subsequent nodes fail to start

Should just skip over that input or something, no need to fail subsequent nodes (unless all inputs have failed, obviously)

Amplify UI

Add proper UI and landing page

Job/Workflow improvements from review

Phil

Clearly define job input/output interface
Change so that all jobs are SingleJob (and refactor)
- Remove MapJob
- Refactor
Remove/Simplify composite
- Remove children
- Refactor
Workers should work on job level i.e. Want to rate limit the number of concurrent bacalhau jobs.

Add proper graphs to aid workflow development

Might be a JavaScript thing. But I did find this which works in go: https://d2lang.com/tour/api

Amplify Tech Debt Management

Release note generator
Docker Job tester
#30

Containerize job runner script to facilitate preserving the input directory structure at the output

Goal: Preserve the input directory structure at the output (e.g. consider /foo/bar.jpg and /hey/bar.jpg)

Implementation:

Why not using Bash:

Escaping hell
many cases to handle: single file CID vs directory CID, file with or without extension
Unsupported files (e.g. .DS_Store) shall be handled
should check for Empty files
handle uppercase vs lowecase extensions, more complexity
Not quite testable.
Manipulating paths as strings is too fragile

bug looks like nodes aren't running in parallel?

Persistence

Add a DB to production
Develop code to persist queue information
Develop code to persist dag information
Refactor dag if necessary
Pagination in DB queries
Pagination in API

Improvements to queue item UI

Feedback after review -- small enhancements

Add log message when skipping jobs
Bac describe xxx log message should include job info
* [ ] Reinstate amplify run job command (possibly?) Not urgent.
Change default root directory name so as it's not confused with the root user
Add ability to assume the "default" output when specifying inputs/outputs
Remove need for specifying path in input. If not required then don't mount.
* [ ] Amplify create graph command (or equivalent) to plot the DAG Deferred to UI work. Not urgent.
Show that image on the graph page in the server.
Print full server address so people can click on it
Add form to submit job

StdOut: {"Content-Type":"application/json; charset=ISO-8859-1"} {"Content-Type":"application/json; charset=ISO-8859-1"}

But json isn't reflected in the table...

[amplify] SIGINT doesn't release the cli

❯ go run . run QmYUteNWk2rMDPmXHnFxhvZZYNuyV2daYJ72N5sZqs9rbC
INF Running job jobID=root-job
INF Running job jobID=metadata-job
INF bacalhau describe 986e41d3-2da8-4a43-822e-9077556a369b
^CINF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
INF Worker received quit command.
WRN Error executing job error="getting Bacalhau job info: publicapi: after posting request: Post \"http://bootstrap.production.bacalhau.org:1234/requester/list\": context canceled"
INF Worker received quit command.
INF Running job jobID=merge-job
INF Running job jobID=tree-job



^C^C


^C
^C