Code Monkey home page Code Monkey logo

derivative_rodeo's Introduction

Table of Contents generated with DocToc

DerivativeRodeo

“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)

The DerivativeRodeo "moves" files from one storage location (e.g. input) to one or more storage locations (e.g. output) via a generator.

Process Life Cycle

In the case of a input storage location (e.g. input_location), we expect that the underlying file pointed at by the input storage location exists. After all we can't move what we don't have.

In the case of a output storage location (e.g. output_location), we expect that the underlying file will exist after the generator has completed. The output storage location could already exist or we might need to generate the file for the output location.

There is also the concept of the pre_processed storage location; when the pre_processed storage location exists for the given input, copy that pre_processed file to the output location. And skip running the derivative generator on the input storage location. In other words, if we've already done the derivation elsewhere, use that.

During the generator's process, we need to have a working copy of both the input and output file. This is done by creating a temporary file.

In the case of the input, the creation of that temporary file involves getting the file from the input storage location. In the case of the output, we create a temporary file that the output storage location then knows how to move to the resulting place.

Storage Lifecycle

The above Storage Lifecycle diagram is as follows: input location to input tmp file to generator to output tmp file to output location.

Note: We've designed and implemented the data life cycle to automatically clean-up the temporary files as the generator completes. In this way we can use the smallest working space possible. A design decision that helps run DerivativeRodeo within distributed clusters (e.g. AWS Serverless).

Concepts

Overview

The PlantUML Text for the Overview Diagram
@startuml
!theme amiga

cloud "Source 1" as S1
cloud "Source 2" as S2
cloud "Source 3" as S3

storage "IMAGEs" as IMAGEs
storage "HOCRs" as HOCRs
storage "TXTs" as TXTs

control Preprocess as G1

S1 -down-> G1
S2 -down-> G1
S3 -down-> G1

G1 -down-> IMAGEs
G1 -down-> HOCRs
G1 -down-> TXTs

control Import as I1

IMAGEs -down-> I1
HOCRs -down-> I1
TXTs -down-> I1

package FileSet as FileSet1 {
	file Image1
	file Hocr1
	file Txt1
}
package FileSet as FileSet2 {
	file Image2
	file Hocr2
	file Txt2
}

I1 -down-> FileSet1
I1 -down-> FileSet2

@enduml

Common Storage

In this case, common storage could mean the storage where we're writing all pre-processing of files. Or it could mean the storage where we're writing for application access (e.g. Fedora Commons for a Hyrax application).

In other words, the DerivativeRodeo is part of moving files from one location to another, and ensuring that at each step we have all of the expected files we want.

Related Files

This is not strictly related to Hyrax's FileSet, that is a set of files in which one is considered the original and all others are derivatives of the original.

However it is helpful to think in those terms; files that have a significant relation to each other; one derived from the other. For example an original PDF and it's extracted text would be two significantly related files.

Sequence Diagram

Sequence Diagram

The PlantUML Text for the Sequence Diagram
@startuml
!theme amiga

actor Instigator
database S3
control AWS
queue SQS
control SpaceStone
control DerivativeRodeo
collections From
collections To
Instigator -> S3 : "Upload bucket\nof files associated\n with FileSet"
S3 -> AWS : "AWS enqueues\nthe bucket"
AWS -> SQS : "AWS adds to SQS"
SQS -> SpaceStone : "SQS invokes\nSpaceStone method"
SpaceStone -> DerivativeRodeo : "SpaceStone calls\n DerivativeRodeo"
DerivativeRodeo --> S3 : "Request file for\ntemporary processing"
S3 --> From : "Write requested\n file to\ntemporary storage"
DerivativeRodeo <-- From
DerivativeRodeo -> To : "Generate derivative\n writing to local\n processing storage."
To --> S3 : "Write file\n to S3 Bucket"
DerivativeRodeo <-- To : "Return to DerivativeRodeo\n with generated URIs"
SpaceStone <- DerivativeRodeo : "Return generated\n URIs"
SpaceStone -> SQS : "Optionally enqueue\nfurther work"
@enduml

Given a single original file in a previous home, we are copying that original file (and derivatives) to various locations:

  • From previous home to S3.
  • From S3 to local temporary storage (for processing).
  • Create a derivative temporary file based on existing file.
  • Copying derivative temporary file to S3.

Installation

Add this line to your application's Gemfile:

gem 'derivative-rodeo'

(Due to historical reasons the gem name is derivative-rodeo even though the repository is derivative_rodeo. The following "require" methods will work:

  • require 'derivative_rodeo'
  • require 'derivative-rodeo'
  • require 'derivative/rodeo'

And then execute: $ bundle install

Be aware that you need pdfinfo command line tool installed for this gem to run specs or when using PDF functionality.

Usage

TODO

Technical Overview of the DerivativeRodeo

Generators

Generators are responsible for ensuring that we have the file associated with the generator. For example, the HocrGenerator is responsible for ensuring that we have the .hocr file in the expected desired storage location.

Interface(s)

Generators must have an initializer and build command:

  • .new(array_of_file_urls, output_location_template, preprocessed_location_template)
  • #generated_files (executes the generators actions) and returns array of files
  • #generated_uris (executes the generators actions) and returns array of output uris

Supported Generators

Below is the current list of generators.

  • HocrGenerator :: generated tesseract files from images, also creates monocrhome files as a prestep
  • MonochromeGenerator :: converts images to monochrome
  • CopyGenerator :: sends a set of uris to another location. For example from S3 to SQS or from filesystem to S3.
  • PdfSplitGenerator :: split a PDF into one image per page
  • WordCoordinatesGenerator :: create a JSON file representing the words and coordinates (derived from the .hocr file).

Registered Generators

TODO: We want to expose a list of registered generators

Storage Locations

Storage locations are where we put things. Each location has a specific implementation but is expected to inherit from the DerivativeRodeo::StorageLocation::BaseLocation.

DerivativeRodeo::StorageLocation::BaseLocation.locations method tracks the registered locations.

The location represents where the file should be.

Supported Storage Locations

Storage locations follow a URI pattern

  • file:// :: “local” file system storage
  • s3:// :: AWS’s S3 storage system
  • sqs:// :: AWS’s SQS

Templates

Throughout the code you'll see reference to the following concepts:

  • input_location_template
  • output_location_template
  • preprocessed_location_template

In Process Life Cycle we discussed the input_location, output_location, and preprocessed_location. The concept of the template provides a flexibility in mapping a location to another location

Examples of mapping one file path to another are:

  • I want to copy https://hello.com/world/GUID/file.jpg to file:///tmp/GUID/file.jpg.
  • I want to transform file:///tmp/GUID/file.jpg to file:///tmp/GUID/file.hocr; that is run OCR on an image and write a .hocr file.
  • I want to use the file:///tmp/GUID/file.hocr to generate a file:///tmp/GUID/file.coordinates.json; that is convert the HOCR file to a coordinates.json file.

See DerivativeRodeo::Service::ConvertUriViaTemplateService for more details.

Development

  • Checkout the repository: git clone https://github.com/scientist-softserv/derivative_rodeo
  • Install dependencies: cd derivative_rodeo; bundle install
  • Install git hooks: rake install_hooks
  • Install binaries:
    • pdfinfo: provided by poppler (e.g. brew install poppler)
    • GhostScript (e.g. gs): run brew install gs

Then go about writing your code and documentation.

The git hooks call rake default which will:

Logging in Test Environment

Throughout the DerivativeRodeo we log some activity. In the typical test run, the logs are overly chatty. If you want the more chatty logs run the following: DEBUG=t rspec.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative_rodeo.

derivative_rodeo's People

Contributors

jeremyf avatar orangewolf avatar kirkkwang avatar

Watchers

Braydon Justice avatar Shana Moore avatar

derivative_rodeo's Issues

Implement Template In Generators and Storage Adapter

With Write the template function handling for URIs · Issue scientist-softserv/adventist-dl#1 · scientist-softserv/derivative_rodeo we introduced the idea of encoding path information in the adapter name.

  • Change the Generators::BaseGenerator and subclasses to use the template

    • base_generator.rb
    • copy_generator.rb
    • hocr_generator.rb
    • monochrome_generator.rb
    • pdf_split_generator.rb
  • Change the StorageAdapters::BaseStorage and subclasses to leverage the template

    • base_adapter.rb
    • download_adapter.rb
    • file_adapter.rb
    • s3_adapter.rb
    • sqs_adapter.rb

One of the tests for “done” is to write a test for the CopyGenerator that copies from one file adapter to another file adapter. Another test, and necessary to demonstrate the feature is to write a test confirming the PdfSplitGenerator behavior.

Related to:

Write the template function handling for URIs

Given the following initialization:

DerivativeRodeo::Generators::BaseGenerator.new(
  input_uris: ["file:///path1/A/file.pdf", "file:///path2/B/file.pdf"],
  output_uri_template: "file:///dest1/{{path_parts[-2..-1]}}",
  preprocess_uri_template: "s3://bucket_name/{{path_parts[-2..-1]}}"
)

We want the templates to evaluate to the following:

{
  output_uris: ["file:///dest1/A/file.pdf", "file:///dest1/B/file.pdf"],
  preprocess_uris: ["s3://bucket_name/A/file.pdf", "s3://bucket_name/B/file.pdf"]
}

Update README to include Discussion about the flow of files through the Rodeo

Within the context of the DerivativeRodeo we have conceptual data structures associated with the file set.

  • Chute: What are the files we’re bringing to the rodeo
  • Agenda: What files are the rodeo responsible for making
  • RoundUp: What files do we actually have when all is done; what we will associate with the corresponding FileSet.

The DerivativeRodeo is responsible for making the files required by the Agenda that are not in the Chute (e.g. Agenda - Chute). The RoundUp is the AgendaChute;

Related to:

🎁 Create a generator for hocr to alto.xml

This is analogous to the DerivativeRodeo::Generators::WordCoordinatesGenerator in that it takes a .hocr file and converts that to alto.xml

Discussion

There is an optimization consideration, namely that we could parse the .hocr file once and from that one parsing we could then create the 3 output files. To leverage that optimization would require considering how we'd have a generator create multiple output URIs of different mime-types.

That answer is not immediately obvious to me.

🎁 Create a generator for hocr to plain text

This is analogous to the DerivativeRodeo::Generators::WordCoordinatesGenerator in that it takes a .hocr file and converts that to plain text.

Discussion

There is an optimization consideration, namely that we could parse the .hocr file once and from that one parsing we could then create the 3 output files. To leverage that optimization would require considering how we'd have a generator create multiple output URIs of different mime-types.

That answer is not immediately obvious to me.

Import OAI feed into SpaceStone

Summary

related:

Once SpaceStone has pre-processed all Books and Issues, begin importing the OAI feed

Acceptance Criteria

  • [ ]

Testing Instructions

TBD

Notes

Below is the CSV for the single file:

oai_set,aark_id,original,text,reader,thumbnail,other_files
adl:other,20121816,https://adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.ARCHIVAL.pdf,https://adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.RAW.txt,,https://adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.TN.jpg

🎁 Thumbnail generator needs to have configuration based on mime-type

Note, we're not actually asking the mime-type, so we might be sniffing things based on file extension.

To do this configuration would mean passing a parameter on generator initialization that is the dimensions; we'd likely want an application based configuration.

Considerations:

We could have a class variable on the generator that is a hash with keys of "file types" and values of "dimensions". This would allow for both SpaceStone and IiifPrint to configure accordingly.

NotImplementedError regarding DerivativeRodeo::StorageLocations::HttpLocation#matching_locations_in_file_dir

Likely introduced in scientist-softserv/iiif_print/#283 as part of trying to gracefully handle missing PDFs in SpaceStone, we need a different approach.

The challenge is SpaceStone failed to copy the PDF, so we want to use the FileSet#import_url as the remote. However the Rodeo then assumes the location of that PDF (e.g. HTTPS) to then be the location of candidate PDFs. However, the DerivativeRodeo::StorageLocations::HttpLocation does not return matching locations in file dir.

Meaning we need to consider how the application falls back.

Exception URL: https://scientist-inc.sentry.io/issues/4610010020/?project=6745020

Related to:

Derivative Rodeo Branching Logic Around HOCR

With the following PR, we break space_stone-serverless:

In particulare the assumption that the input URIs for the generators will be HOCR files:

To reconcile, we may need to further leverage the template format.

Consider that in SpaceStone we explicitly calling the Hocr generator. But in IiifPrint we’re not calling the HocrGenerator explicitly; perhaps it is something we could do in IiifPrint.

Ultimately, I suspect that SpaceStone calls the HocrGenerator then enqueues for performance reasons.

Whereas the IiifPrint configuration for derivatives is coming from the angle of “What derivatives should I attach to the FileSet?”; hence the configuration here. https://github.com/scientist-softserv/iiif_print/blob/d8c2ec240663512100ed9921d1efe04537032510/app/services/iiif_print/derivative_rodeo_service.rb#L57-L67

We’ll need to disentangle whether the above #74 is acceptable for a general solution and whether there is an impact on the performance of space stone.

🎁 Create `ThumbnailGenerator`

The ThumbnailGenerator will receive a binary and create the corresponding thumbnail.

The two cases for Adventist are: PDFs and Images.

☄️ Derivative Rodeo Integration Epic

The goal of this punchiest is to outline the steps necessary to verify that IIIF print picks up the changes

With the above SpaceStone and Derivative Rodeo adjustments

  • Set IIIF Print's application's logger level to :info
  • Update IIIF Print to use above Derivative Rodeo gem version
  • Update the IIIF Print configuration to leverage SpaceStone; this will require AWS credentials for the Pre Processed Buckets.
  • Run import of 20121816 entry (likely want to get a single CSV of this file)
  • Review logs; we should not see generating derivatives but instead should see log entries regarding found location
    • See Split PDF and constituent pages as works with expected derivative files (e.g. thumbnail and JSON)
  • Run import for a single image entry that has been ingested
Derivative Rodeo Integration Tests for PDF Splitting

The following are the scenarios I’m working through for integration testing:

  • Scenario: PDF Split does not exist
Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works
  • Scenario: Thumbnail of PDF does not exist
Given a work with a PDF
And SpaceStone has not pre-processed the thumbnail
When we import the PDF
Then the application generates a thumbnail
And attaches the thumbnail to the work
  • Scenario: PDF Split exists
Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works
  • Scenario: Thumbnail of PDF exists
Given a work with a PDF
And SpaceStone has pre-processed the thumbnail
When we import the PDF
Then the application retrieves the pre-processed thumbnail
And attaches the thumbnail to the work

After the integration test we will need to:

🎁 Add Logging to Generator Process

The generator should log three moments:

  • When the output destination already exists
  • When we find the derivative in the pre-processed location
  • When we need to generate the derivative

♻️ Revisit the file parts and template concept

The SQS adapter has pushed ahead on how we think about the URIs. The other target/location classes do not account for this.

We want to favor template based methods for URIs and remove the file parts methods, which are related to earlier implementations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.