uchicago-library / attachment-converter Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 26.92 MB

Attachment Converter: tool for batch converting attachments in an email mailbox

License: GNU General Public License v2.0

Makefile 1.89% OCaml 48.13% Shell 3.62% Perl 17.67% Terra 28.69%

attachment-converter's People

Contributors

Stargazers

Watchers

Forkers

hernandezkev305 cormacd9818

attachment-converter's Issues

Research file format conversion utilities

The goal in this issue is to write up a quick first-stab reference document that will enable us to use whatever file format conversions we want to use for testing.

Here's the current list of format conversions we're planning to test with during development:

Source	Target
.pdf	PDF-A 1b
.pdf	plaintext
.doc	PDF-A 1b
.doc	plaintext
.docx	PDF-A 1b
.docx	plaintext
.xls	TSV
.xlsx	TSV
.gif	TIFF
.bmp	TIFF
.jpg	TIFF

What we want, for each of the conversions in this table, is a command you can run at the UNIX shell to perform them. It doesn't need to have variables, or abstract command line options away, or anything like that. Just a concrete command you can type in using filenames like e.g. input.doc or output.pdf. We will determine how to handle command line options and different assumptions about input/output in a later issue.

Quick example. To convert a .doc to a PDF, you can use LibreOffice on the command line:

$ soffice --headless --convert-to pdf:writer_pdf_Export --outdir . input.doc

So in other words, the goal of this issue is to compile a list of example shell commands like this, together with information about what software packages must be installed to run them. We can put it in the root directory of the project for now, and make it either Org or Markdown. (Follow your bliss!) Each section of your document can just indicate:

the name of the package to install
the command to run to perform the conversion

Some pointers on utilities

LibreOffice

In addition to the above example, which converts a .doc to a PDF, I believe LibreOffice can be used similarly to convert a .doc to plaintext, and also to convert an .xls to PDF or plaintext.

`pdf2archive`

The utility pdf2archive can be used to convert a plain vanilla PDF to an archival PDF-A 1b. (In fact, that utility is just a shell script that handles the labyrinthine command line options that Ghostscript requires to produce a PDF that meets the demanding PDF-A 1b spec.)

`pandoc`

I believe pandoc (written by philosopher of language and Haskeller extraordinaire John MacFarlane) can be used to convert .docx to either PDF or plaintext. See if you can figure out how using some combination of the documentation, Stack Overflow, and anything else that could be useful.

https://pandoc.org/

If I recall correctly, pandoc can also convert .xls files to TSV format.

Image conversion utilities

imagemagick and vips can, I believe, be used to convert images in most standard formats to TIFF. Haven't looked at them in depth, but please feel free to experiment with other utilities too.

https://imagemagick.org/
https://www.libvips.org/

Please feel free to pose any questions you may have on our Slack channel! Hopefully the above is enough to get started.

Better Converted File Naming

Currently, we append a simple time stamp to the name of a converted file. This is problematic if we want to convert a file to the same format with multiple tools. We do keep track of an id for each conversion which can be added to the filename.

Create Minimal Executable

You'll notice that our main.ml module currently does nothing except print a message saying that it is converting some attachments.

We will do a full command line interface spec in a later issue, probably using Daniel Buenzli's cmdliner library, but for this issue, the goal is just to get the world's simplest command line executable module up and running, in main.ml. Just splitting the argv list on whitespace should do it for these purposes. (Note that Prelude defines an argv constant in the form of a list, which tends to be easier to work with than the argv array that Stdlib provides. It also provides an argv0 constant, in case you need to make reference to the name under which the utility is run.)

Beyond that, this first stab at an excutable should:

assume it will be given the text of a single email via standard in
parse that text into an email AST and convert whatever attachments in it the configuration file says should be converted
print the result to standard out

Later on, when we flesh this out into a more mature 'minimum viable product' version of the command line interface, we will probably retain this 'pass one email into stdin and convert attachments in it to stdout' functionality as a way to run the utility, if need be, with the right command line options. It will likely be quite useful to have around for testing.

Cram test the conversion shell scripts

Now that we have determined how run basic cram tests, the next step is to write some. For this issue, we'll focus on creating cram tests for our conversion shell scripts.

For each of the shell scripts in conversion-scripts, write a test which:

runs the script
checks that the exit status is 0
runs the file command on the output of the script to determine that the output is of the correct file format

Adding `conversion-scripts` as a cram dependency

Since dune creates a .sandbox directory to run cram tests in whenever they are run, we will need to figure out how to get dune to copy conversion-scripts into the .sandbox directory as well. Configuring dune is always a funky adventure, but it seems likely that this will involve using the deps option in the stanza that defines the cram tests in cram-tests/dune.

For more information:
https://dune.readthedocs.io/en/stable/tests.html#test-options

Once dune adds the conversion-scripts as a dependency within dune, it may also be necessary to add that directory to the path within the sandboxed environment? One simple thing we could try is to just add that directory inside the cram test, etc.

$ PATH=../conversion-scripts/:$PATH

Slight adjustment to CONVERT module signature

Change `acopy` and `amap` in `CONVERT` module signature

Update CONVERT so that amap and acopy take two inputs: one for modifying the content-type mime header and one for modifying the data of an attachment. Since strings in OCaml are just sequences of bytes, the inputs to each of these functions should be functions of type string -> string.

New signature (for now, until we change it):

module type CONVERT =
  sig
    type filepath
    type parsetree
    type htransform = string -> string
    type dtransform = string -> string
    val parse : string -> parsetree
    val amap : htransform -> dtransform -> parsetree -> parsetree
    val acopy : htransform -> dtransform -> parsetree -> parsetree
    val to_string : parsetree -> string
    val convert : filepath -> string -> string
    val acopy_email : string -> (string -> string) -> string
  end

Progress bar should print to `/dev/tty`

Per our discussion on the #email_archives Slack channel, it would make more sense for the progress bar to print to /dev/tty rather than to stderr.

Currently, the progress bar is getting printed to stderr, which makes it visible by default. However, if the user redirects stderr to a file, which they may want to do in order to carefully inspect the error messages, then the progress bar is silenced. (The reason for this is that our current progress bar-printing function only prints conditionally on stderr being a tty.)

It should probably instead implement the following logic:

if stderr is a tty, do what we do now and print to stderr
otherwise, if /dev/tty can't be opened, do nothing
if /dev/tty can be opened, open and print to it

Update `acopy` and `amap` to fit config spec

For this issue, update the type signature of acopy and amap so that instead of taking header and data transform inputs, they take Formats.t as an input. The type signature should work out to something approximately like this:

val acopy : Formats.t -> parsetree -> (parsetree, Formats.error) result

The new behavior should be: whenever acopy [amap] hits on an attachment, it consults the Formats.t dictionary to see which format conversions to perform on that attachment (if any).

Error handling

This will also involve introducing some error handling into acopy and amap. For this issue, we don't need tons of error handling: just the basics corresponding to whatever error possibilities exist in the code from issue #9. As of right now, it looks like the only error possibilities we have in Formats.Error.t are a refer parse error and a config file parse error. I'll leave it to the assignee's discretion whether they want to e.g.:

introduce a Lib.Error sum type that tagged union-s Formats.Error.t-s and another error datatype defined for email parse errors, mbox parsing errors, and so forth
for now, make the simplifying assumption that the only kind of error you can have is a config file parse error, leaving email parsing error handling for a later issue
just stick one or two basic email parsing errors into Formats.Error.t

I'll mention that if we decide to make a global Lib.Error.t datatype, that feels like a great use case for a row sum type, i.e. polymorphic variants. That should allow us to "split up" the error cases between different modules, but have all the variants trace back to a single error datatype.

Anyway, I leave it to the assignee's discretion exactly how much of the error handling rabbit hole to go down for this issue.

Update Sandboxed Switch Config to work off of `.opam` files

It has come to our attention that we don't actually need to be using the opam-lock utility for our build config. Whoops! opam-lock, it turns out, does the same thing as a pip freeze: it takes a working opam project and creates an Opam file for it (with the .opam.locked extension) consisting of the original .opam file with every version of every package locked down to its exact current version, plus all transitive dependencies.

That is unnecessary for current purposes. Our dune-project configuration generates .opam files, and we can build sanboxed switches from those directly.

Two Tasks

For this issue, the goal is to:

update the Makefile for the project to include make deps and make sandbox targets
update the doc/sandboxing.md so that it mentions these new make targets and omits all the complicated stuff involving opam-lock

`make deps`

Mirroring the design of spinup, the make deps rule will install all dependencies for the project to the current opam switch. This is a useful command to have lying around when you want to do dev work on this project without having to create an entire switch (which means waiting for a fresh OCaml compiler to build).

`make sandbox`

Also mirroring the design of spinup, the make sandbox rule will create a sandboxed switch in which the DLDC Opam repository is available, and which contains the latest OCaml compiler, plus Prelude, plus all of Attachment Converter's library dependencies from the opam repositories, and nothing else.

Add percentage complete info to progress bar

Add Progress to Completion Status to Progress Bar

A nice enhancement of our current progress bar would be to have it display the number of attachments it has already converted vs. the total number of attachments it has to convert. This feature would probably only make sense when the application is run standard mode (i.e. in mbox processing mode).

For example, instead of saying this:

Processing complete.

The progress bar could say something like this:

Attachment 14/20 has finished processing.

Etc. I leave the fine points of prose style up to the developer assigned to this feature.

Reuse code from `report.ml`?

We discussed the possibility of reusing some of the the code from the Report module. Without going into detail here, the rough strategy could be:

grep the mbox for lines containing Content-Disposition to get the total number of attachments
grep the mbox for lines containing CONVERTED to get the number of attachments that were put there by our sotware - subtract the second from the first to get the total number of original attachments, Y
create a counter to get the number of the current attachment, X
that will allow us to display the information: processing attachment X out of Y

Progress bar shouldn't print anything if it isn't modifying an attachment.

Currently, the Attachment Converter progress bar prints a before/after statement for every email, including those with no attachments and those with attachments it doesn't modify:

> attc < input.mbox > output.mbox
Parsing email...
Processing email with structure...
=================================
Multipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.
Parsing email...
Processing email with structure...
=================================
Multipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.
Parsing email...
Processing email with structure...
=================================
Mulptipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.

Etc. We would like to hush these progress bar messages for emails with no attachments and/or with unmodified attachments, so that the progress bar only displays the before/after skeletons when the after skeletons introduce new converted copies.

Allow for multiple email parsing backends

Post Issue #75, the email parsing backend will be configurable via a command line flag or configuration option. For this issue, define a combinator that can take in two email parsing backends and output a new email parsing backend that implements alternative logic---i.e. the logic of the (<|>) operator from Haskell.

The logic of A <|> B is, roughly:

try A
if it succeeds, move on
if it fails, try B

Broadly speaking, this could most likely become a module functor that would take two modules of signature PARSETREE and output a new module with the same signature as the output of Conversion.Make, which implements the above logic for two email parsing backends A and B. Another option might be to implement an alternative module functor for PARSETREE modules, i.e. one whose return signature is also PARSETREE.

ifeq error in `make home-install` target

Currently, the home-install target is throwing the following error on sequent (Matt's Arch Linux box.)

> make home-install
dune build --display short
dune install --display short
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/META
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/META
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/dune-package
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/dune-package
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/opam
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/opam
Deleting /home/teichman/.opam/4.14.1/bin/attachment-converter
Installing /home/teichman/.opam/4.14.1/bin/attachment-converter
Deleting /home/teichman/.opam/4.14.1/doc/attachment-converter/LICENSE
Installing /home/teichman/.opam/4.14.1/doc/attachment-converter/LICENSE
Deleting /home/teichman/.opam/4.14.1/doc/attachment-converter/README.org
Installing /home/teichman/.opam/4.14.1/doc/attachment-converter/README.org
ifeq [pacman --version]
libreoffice pandoc ghostscript libvips
else
libreoffice pandoc ghostscript vips verapdf
endif
echo Cloning attc git repo...
cd ~
mkdir attachment-converter
cd attachment-converter
git clone https://github.com/uchicago-library/attachment-converter.git
echo Copying shell scripts...
cd ~/attachment-converter
mkdir -p ~/.config/attachment-converter/scripts
cp conversion-scripts/*.sh ~/.config/attachment-converter/scripts
echo Installing to ~/bin/attc...
cp /home/teichman/.opam/4.14.1/bin/attachment-converter ~/bin/attc
ls -lh ~/bin/attc
echo Attachment Converter has been installed to ~/bin/attc.
echo Please ensure that ~/bin is on your path.
bash: line 1: ifeq: command not found
make: *** [GNUmakefile:85: home-install] Error 127

For this issue, see if you can:

a) reproduce the error on macOS
b) amend the code for the home-install target so that it works on both macOS and Linux platforms

Move documentation into its own part of the app

The README.md file is starting to get a bit big. In this issue, we'll create a directory called attachment-converter/doc whose purpose is to house all sections of README.md that we choose to break out into separate files.

First on deck to go into the doc/ directory will be a document explaining the workflow for developing under a sandboxed Opam switch, which we'll call sandboxing.md. Sort of as a 'test run' for putting docs there. We'll also move file-format-conversions.md in there.

Create (first version of) progress bar

Task: create a progress bar for Attachment Converter.

Background

So out of the box, our initial version of Attachment Converter isn't exactly the fastest performing application in the world. Some of our initial tests took about a minute to convert all the emails in an mbox with a total of 10-15 attachments.

Our major bottleneck is caused by the fact that Attachment Converter calls out to external applications to perform its conversions. We haven't done much benchmarking, but one reasonable starting assumption is that LibreOffice, which we are currently using heavily to convert to PDF-A, takes a while to do each one.

There are a couple "low hanging fruit" tactics we can take to lessen the runtime of the application on mbox-es that of a realistic size:

automatically parallelizing the conversions (this is theoretically possible because none of the conversions logically depend on any of the others)
finding faster utilities to perform specific conversions

We will continue to explore those options as we work on the project. That said, no matter how many of these approaches to speedings things up we end adopting, it seems pretty clear that at least for large mbox-es we will need to be prepared for a full round of attachment conversions to take a while.

Given the apparent inevitability of some amount of slowness, Attachment Converter will need to send some indication to the user of what is happening.

Progress Bar

Getting the progress bar to output useful information to the user while also outputting its actual data to standard out involves a little finessing of UNIX terminals and file handles.

Before we get into that, let's outline what information should be in the progress bar.

Layout

For this initial version, what we're calling a "progress bar" will just be a printed line of information with something like the following format:

converting <ATTACHMENT-FILENAME> to <ATTACHMENT-BASENAME>.<TARGET-EXTENSION> ...

It should print that line of information just before it begins each conversion, so that if that conversion takes five seconds, the user will see that line of information on the bottom of the screen for five seconds.

How to do it

Broadly, we want to:

open the current tty as a file handle and print the progress bar text to that
print the software's actual output to standard out

That approach will display both the output data and the progress bar messages intermixed at the same time, in the terminal. Really, what we want is an either/or situation:

print the progress bar if standard out is being redirected to a file
if standard out is not being redirected, print that instead

The following UNIX hijinks should give us that result:

do a systemcall which asks, of standard out, "is this a tty"?
if the answer is yes, that means the output is not being redirected, and we print standard out with no progress bar
if the answer is no, that means the output is being redirected (which means it won't be printed to the console), and we print the progress bar to the tty

The Stdlib.Unix module provides a pretty comprehensive interface to UNIX system calls. You can use Unix.isatty to check whether a device is a tty. Since this take a Unix file descriptor as an input (rather than an input channel), use Unix.stdout rather than Pervasives.stdout.

Headers for New Multipart Messages

An email which is only an attachment will have roughly the form

(header,
  (Leaf data))

And so may be converted to something of the form

(~header,
  (Multipart [(header, (Leaf data)),
                     (header, (Leaf data)),
                     (header, (Leaf data))]))

The question is: what should ~header be? We can't simply adopt the header from the original email since this header will have a content disposition of attachment or inline. This header should be the minimal one which indicates that the message is a multipart, and should include some custom header field which says that we constructed it.

Note that the function make_header can be used here.

(NOT AN ISSUE FOR THIS REPO) Create PR to Mr. Mime for serialize.ml code

Putting this issue here so that all our TODOs are in one place. We would like to eventually move the serialize.ml code out into the actual Mr. Mime project. This will involve contributing to it.

Test Suite for Generated Emails

I'll fill this in with more detail later. The basic idea: we need a cleaner pipeline for generating fake emails and then writing quick unit tests for them. This means fleshing out a interface for building these tests.

Attachment Converter should check that its dependencies are met

Attachment Converter currently just assumes that all of its OS-level dependencies are installed, so for example, if it is run on an email containing a JPEG attachment and vips isn't installed, it will try to run vips anyway and we're in the wild west of uncaught exceptions. What it needs to do verify that all OS-level dependencies are installed when it is run and refuse to do anything if it can't find any of them, printing an error message describing which utilities need to be installed.

We'll eventually need to take a careful look at all of Attachment Converter's OS-level dependencies, but for a first stab, we can write code that:

examines the configuration the application is run using
checks that it can find something at every binary path indicated in the configuration file
if it fails to find anything, don't run
instead, print out a message telling the user to install whichever applications it can't find

Prelude contains some helpful library functions to check whether OS-level dependencies are installed. I will post more information about that in a follow-up on this issue.

No-op bug

Currently, if a conversion tool fails, attachment converter "converts" the file anyway by taking the original an copying it to a new name. This is obviously bad, and I think I was planning to fix it but forgot (my bad). I'm creating an issue for it just to log the error.

Spec out configuration file for Attachment Converter

This is kind of a big task and may end up having to be split up into multiple issues. Nonetheless, we'll begin with a number of tasks in a single issue and split it up where applicable.

determine the format of the config file
determine how best to parse the config file and what datatype the config file will be parsed into
determine how the information from that parsed config file will be incorporated into the logic of the app

Let's take these in turn.

Config File Language

Some preliminary design brainstorming has led me to conclude, at least for now, that GNU Refer is a good choice for a config file format for Attachment Converter. (We can always change it later if need be.) Here are some reasons why.

the syntax is dead simple, and therefore it's easy to parse
the syntax is easy enough that a non-technical person could edit it without too much difficulty
the syntax is simple in a way that reduces the risk of the user accidentally making the config file unusuable because of an accidental typo
our use case doesn't require anything with a tree structure, which is doable but requires a little bit of sublte love/care in GNU Refer
Prelude, the OCaml standard library we are using for this project, has a Refer parser; so we can parse config files in this format without incurring a dependency on Yet Another third-party library
(rule of thumb: we like to avoid third-party dependencies except where absolutely necessary)

Getting your hands on an actual formal spec for refer is kind of annoying; you have to look at the GNU refer manpage. Nonetheless, a) we already have a parser for it and b) a quick example should illustrate how the syntax works. Suppose we would like to go through an email collection and perform two conversions: we want to convert all Word doc files to plaintext, and we want to convert all Word docx files to PDF-A-1b. The config file for performing those two conversions could look like this:

%source_type application/msword
%target_type text/plain
%shell_command /bin/doc2txt

%source_type application/vnd.openxmlformats-officedocument.wordprocessingml
%target_type application/pdf
%shell_command /bin/docx2pdf

(Those conversion utilities are fictional, for illustrative purposes. The real conversion utilities we'll be using will be complex invocations of a command line app with lots of options.)

So: a refer record is a key-value type of dealio: the percent sign followed by any string followed by a space gives you the field name. In between the space and the line break-followed-by-a-percent is the value. A refer database is simply a list of refer records, each one separated by two line breaks. That means we can have one record for each conversion we would like the app to perform.

We will eventually have to finesse the syntax for the shell_command value so that it can handle the distinction between:

a utility that takes a filepath as an input and prints to stdout
a utility that takes stdin as an input and prints to stdout
a utility that takes both an input and an output filepath
etc.

However, we will wait until a later issue to add that bit of fanciness. (The rough plan will be to handle it using printf-like escape syntax.) For the purposes of getting up and running with something simple, assume for now that all command line utilities for performing conversions take a filepath as input and output to stdout.

Parsing the Config File

What should we parse the datatype into? First things first: let's create a new file for the config information at lib/config.ml and put the following code there. This requires the following change to lib/dune:

(library
 (name lib)
 (libraries prelude versioj mrmime threads netstring unix)
 (modules lib config)
 (inline_tests (backend qtest.lib)))

I.e. what we had before, but with a modules S-expression whose tail includes lib and config.

A good first stab at laying out the datatype within config.ml would be something close to this:

module Formats = struct

  type htransform = string -> string
  type dtransform = string -> string
  
  type variety =
    | DataOnly of dtransform
    | DataAndHeader of (htransform * dtransform)
    | NoChange

  module Dict = Map.Make (String)
    
  type t = variety Dict.t
  
end

The Formats.variety datatype is a sum type whose purpose is to enumerate the different ways an email part could change/not change:

by leaving the MIME type header alone and just changing the data in the attachment, for when we don't need to update the MIME type in light of the change to the data
by changing the data in the attachment and updating the MIME type to reflect how the data were changed
leaving it alone

We should be able to parse a refer record of the type given above into a Formats.variety in the following way:

if the source MIME type and the target MIME type are the same (for example, converting PDF to PDF-A), it's a DataOnly whose dtransform is supplied by the path to the command line utility, passed into convert from #1
if they are different (for example, converting DOCX to plaintext), it's a DataAndHeader whose dtransform is supplied by the filepath and whose htransform is a function mapping the source MIME type string to the target MIME type string indicated in the refer record
we can either make everything else a NoChange based on an exhaustive list of all MIME types, or remove NoChange from the datatype (for now I think I like the idea of leaving it in)

We can work out the details of the parsing error messages (for badly formatted config files) etc. while implementing the config file parser. I expect we can use Prelude's parser to do the real parsing, then write a little wrapper code to do cleanup on the result of that plus whatever domain-specific error messaging we might want.

To get started with the parser, check out Prelude's documentation:
https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Refer.html

One final note on the idea behind the Formats.t dictionary datatype. The thought here is that Formats.t will be a lookup table with strings representing MIME types (such as application/pdf, text/plain, and so forth) as keys and Formats.variety-s as values. Then, when we are recursing through an email parsetree and come across a part of the email that is an attachment, what we'll do is examine the MIME header in the part we're looking at, look it up in the Formats.t dictionary, and the Formats.variety value the dictionary gives us back will tell use what to do with it: do nothing, convert just the data in the attachment, or convert both the data and the header in the attachment. See the next section for more info on how amap and acopy will have to be revised to make use of a Formats.t input in this way.

Design implications

The above design requires some revisions to our earlier spec from issues #1 and #6.

We'll keep the name amap for now, though since amap is at this point nowhere near functorial, we will eventually want to ditch the name. But the type should be updated to something like this:

val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result

Where error is a datatype we will probably have to revise heavily, but which can start off on these lines, inside the Formats module:

module Formats = struct

  type htransform = string -> string
  type dtransform = string -> string
  
  type variety =
    | DataOnly of dtransform
    | DataAndHeader of (htransform * dtransform)
    | NoChange

  module Dict = Map.Make (String)
    
  type t = variety Dict.t

  module Error = struct
    type t =
      | ReferParse of string
      | Unix of (string * Unix.error)
      | MimeParse of string
      | CharacterEncoding of string
  end
  type error = Error.t

end

Similar changes to acopy are in order:

val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result

When amap [acopy] encounters a new part of a mail, it will try to find the Content-Type header of that part in the input Formats.t dictionary. If it can't, then it won't do anything to that part. Otherwise, it looks that MIME header up in the input Formats.t dictionary to determine what kind of conversion to perform on that part of the mail, then converts accordingly.

One last thing. We will probably save this for a later issue, but it might be nice to make the first input to amap and acopy an optional parameter that defaults to some config we're planning to test with a lot.

Write Conversion Scripts for More Conversion Tools

The current collection of conversion scripts only covers a small collection of the tools considered in our docs. We need to write the rest of these scripts so that they can be tested, though the primary motivation for this is that pdf2pdfa is probably significantly faster than soffice.

I'll include in here that we should just include a default soffice profile in our project in the hope that that speeds this up a bit. We should also check if having the user set the profile by hand is any faster.

Hash Original Attachment for Reference in Converted Attachment

I'll flesh out this issue later, I just wanted to make sure it got added. It will be important in implementing idempotence.

Make Header Look-ups More Robust

Currently the code for lookup_param, lookup_unstructured, etc., is not that robust. In particular, they are case-sensitive. We could in theory do what mrmime does here, but it would probably be sufficient to eliminate case-sensitivity, whitespace-sensitivity, etc.

Parsing Messages using OCamlnet

One of the benefits of using mrmime over ocamlnet is that mrmime handles messages. To handle messages with ocamlnet, we basically have to:

Write a function like is_message to check if an email of the form (header, `Body body) is a message, which means checking its content-type by looking at header.
Parse the body using of_string. There are two options here. Either put everything in the result monad or simply don't recurse if parsing fails.
Recursively call replace_attachments on the resulting parse tree.
Serialize back into a string using to_string and put back into the body of the message.

Note that this has to be done for both the `Body case and the `Parts case, as they are handled slightly differently.

Also note that there are multiple subtypes for the message mime type. The only one that is pertinent here is rfc822. Dealing with the other two subtypes should probably go on the docket, but perhaps a bit further in the future.

Keep All Header Values in Converted Attachment

A small issue. We currently create a new header for converted attachments that only includes Content-Type, Content-Disposition and Content-Transfer-Encoding. I don't actually think I've ever seen any other headers, but no need to assume there aren't any. We can just take the header of the file we're converting and update it, carrying along all the other header values.

Report Feature: Display All Attachment Types

For this issue, implement the following two functions that 'query' an email for information:

val attachment_types : string -> string list
val mbox_attachments : string -> string list

attachment_types takes an email string as an input and outputs a list of the MIME type strings corresponding to all its attachments, e.g. ["application/pdf", "text/plain"]. This should be recursive, so that if the email has parts of parts of parts, and those have attachments, those all go into the output list. Might as well sort it in alphabetical order.

mbox_attachments does the same thing at the MBOX level. So given an MBOX string as an input, it returns a list of all the MIME types that appear anywhere in that input MBOX, in alphabetical order.

They can go in their own module called Report, either in Lib or in a separate file called lib/report.ml. The motivation behind putting them in their own module is that it seems likely we will be incorporating a bunch more "report" features into Attachment Converter, and they will eventually turn into command line options, or perhaps a subcommand.

Preliminary MBOX-level conversion

Thus far, we have been developing Attachment Converter to work on a per-email basis. In this issue, we'll take our first baby steps towards making the application do what it is actually going to do, namely: provide a map from an input MBOX to an output MBOX.

The final version will most likely have certain optimizations in it. For example, it might stream the emails in from the input MBOX one at a time, so that each conversion process won't have more than one email loaded into memory at a time, and it might also perform as many conversions as it can in parallel. (Conceptually that's easy, given that no conversion will depend on the output on any other conversion, but we all know that parallelism can be harder than it looks to in fact pull off safely. So I guess I'll say I'm catiously optimistic!)

However, this preliminary version doesn't yet need to be optimized in either of those ways. It can load the entire input MBOX into memory, and it can perform every conversion it's going to perform sequentially. The focus for this go-round should just be on making the logic correct:

parse an input MBOX string into a list of emails
map convert (with the appropriate configuration) over the list
output the result as a string

Once we have working code to promote the per-email functionality of the application to the MBOX level, we can then tackle the fun task of making it as efficient as we can. The code can go either in lib/lib.ml or in a separate file called lib/mbox.ml. If it goes in lib/lib.ml, it should probably in its own module. (The assignee could call the module Mbox, or something to that effect.)

One final note: don't forget about the to_mbox function, introduced into the code as part of #11. It ought to be useful for getting this code up and running.

Create `.mli` documentation

As we get ready to release Attachment Converter, we need to start documenting the code in more detail. The first step of that is:

creating interface files for all modules in the project
writing docstrings for all values those interface files expose

One of the ways we've customarily cut corners in the DLDC is to skip defining .mli files for our projects, which means this will be kind of a new endeavor for us. That said, we can recommend the following general approach, for each of the module files in the project:

generate an initial .mli file using the ocaml-print-intf utility
comment the whole thing out
see what compiler errors result from the interface exposing nothing
uncomment the value the compiler complains isn't in scope
rinse and repeat

Once we have used this technique to identify the minimal set of values that each module needs to expose, we can set about writing docstrings for those values, which are what will go into our auto-generated odoc documentation.

For more info on the syntax of odoc's markup language, please see:
https://ocaml.github.io/odoc/odoc_for_authors.html

Config File Serialization

I'm making a dummy issue for this problem (sorry Matt if you've already started). Feel free to populate it with more info.

Basically: generate the default config file from attc by serializing the default config.

Stateless Build Config, Part 1: Sandboxed Opam Switches

Once we're getting closer to releasing Attachment Converter, we'll want to offer some options to help developers build it on their system. One traditional approach to this is to write up a detailed document explaining how to install all the relevant developer tools, os dependencies, and library dependencies, then say a bunch of hail marys and hope to heck it'll work, which it certainly won't. The reason this never works is that when you get something to work on your system, it probably only happened because of a bunch of different system environment settings that you either forgot or didn't know you had turned on. On another random person's machine, most of those assumptions won't hold anymore, and if you didn't realize your installation instructions were making them, they are pretty much guaranteed not to work.

In ride stateless build environments. The goal of a stateless build environment is to write down, explicitly, every last little thing that is required to compile some code, from OS-level dependencies to programming language libraries; the whole nine yards. Because the stateless build environment isn't drawing on any information that isn't available in the local config, you have a guarantee that if it builds on your machine, it will also build on the machine of any other developer who has the same stateless build tool installed.

Opam-Lock

In 2022, there are a number of options for getting The Stateless Build Config Of Your Dreams off the ground. In future issues we'll look at Nix and possibly also Esy--but for this issue, we'll start with what ought to be the easiest one to get working: sandboxed opam switches.

The basic idea is as follows. You create a special opam switch that is just for your project. (As opposed to a global switch; a global switch will typically have an exact OCaml compiler version for a name and contain pretty much every opam package you're planning to work with on your system.) You then set your build environment up the normal way, installing all packages you need until your project builds. Then you use a utility called opam-lock—somewhat similar in concept to pip freeze—to record information about which opam packages are installed in that switch.

Once you've created the opam lock file, another user can in the possession of it who has opam installed can opam to create a brand new switch in the image of the lock file. Assuming there are no errors installing dependencies for your project, that other user should then be able to compile your project on their system, because they will have everything they need to do so in their new switch.

Here is a guide that can hopefully get you started:
https://khady.info/opam-sandbox.html

Attachment Converter

The goal for this issue is to perform the above steps for our project, create an opam lock file, and add it to our GitHub repository. You should be able to test it on your system, but for a more ambitious test you can choose another computer; maybe even the computer of someone who doesn't use OCaml!

The steps are, roughly:

create the new sandboxed switch
ask dune to tell you what Attachment Converter's opam dependencies are
use the opam-lock utility to generate the lock file from your sandboxed switch

fix acopy bug

acopy should make a fresh copy of the entire message part instead of modifying the body in-place

Write MBOX conversion code for testing

This code can be approached as something we'll only use for casual testing while in development, rather than as part of our production code. (Though it may turn out to be easy to turn into something robust enough for production; we'll see.)

What we want for this issue is a function that will take a list of individual emails (each in the form of a string) and make it into an MBOX. This will provide a convenient way to take an email we generated, and see whether it will pass whatever we've decided is our standard of validation. (At present, our standard of validation is whether the file will open in Apple Mail, since Apple Mail is the only mail user agent (MUA) normal computer users would have heard of that can read MBOX.)

Something in the ballpark of this type signature ought to work:

val to_mbox : ?escape:bool -> string list -> string

to_mbox can go in a new module called lib/mbox.ml. As good a place as any for the time being. For more info on the optional escape parameter, please see below.

Background on MBOX

Much like the specification of email itself, the MBOX format is pretty nuts. The thing to note is that an MBOX is just a flat list. The emails appear in that list, and the delimiter the format uses is a From string that looks like this:

From foo@bar Fri Jan 21 11:48:27 2022

Most MUAs will accept any line starting with From (capital F, one space after the word) as an MBOX delimiter, but for maximum compatibility, an email address and date afterward are recommended. It doesn't matter what they are because they get ignored when the MBOX is parsed into a list of emails. Why? The From line is just a delimiter that's considered to be a part of the mailbox; it isn't part of any email.

This is in contrast to lines beginning with From: (that's 'from' with a colon): those are actual email headers, which means any line starting with From: you encounter while flipping through a file will be part of an email.

For more background on this absurd weirdness, along with different conventions for escaping the MBOX delimiter, please see:

https://en.wikipedia.org/wiki/Mbox

Desired Behavior

At the MBOX level

to_mbox should take a list of emails (i.e. email strings) and do the following:

intersperse them with a From line resembling the example above---in fact, it can literally just be the exact example above every time, unless you want to get fancy and insert the current date/time
if there is no \r\n (CRLF) at the end of a given email, add two
if there is a CRLF at the end of a given email, add one

You can test that the result works by trying to import it into Apple Mail.

Character escaping

Because the delimiter for the MBOX format is the From line, this leads to all the all the usual annoyances re: quoting and character escaping. For example, imagine that the following Classic Britney Lyrics were part of the body of an email:

And you didn't hear
All my joy through my tears
All my hopes through my fears
Did you know, still, I miss you somehow?
From the bottom of my broken heart
There's just a thing or two I'd like you to know
You were my first love, you were my true love
From the first kisses to the very last rose
From the bottom of my broken heart
Even though time may find me somebody new
You were my real love, I never knew love
'Til there was you
From the bottom of my broken heart
Baby, I said, please stay (stay)
Give our love a chance for one more day, oh
We could've worked things out (taking time is what it's all about)
Taking time is what love's all about (oh)

An MBOX parser would obviously not want to parse the above string into four separate emails. The traditional workaround, discussed in the Wikipedia article linked to above, is to replace all From -s with >From -s in the input string. That leads to further problems, because >From could also theoretically be intended to be part of an email body.

A more robust way to handle this situation is to MIME-encode every email body using quoted-printable (rather than base64) encoding when parsing an MBOX:

https://en.wikipedia.org/wiki/Quoted-printable

This has the advantage of allowing you to sleep better at night re: parse errors, but the disadvantage of turning every single email in the input MBOX into a MIME-encoded email. For the purpose of being able to view things in an MUA that makes no difference, but for archival purposes, we generally want to err on the side of keeping as much of the original information in the input MBOX as we can intact. (Like, maybe Indiana Jones of the future is looking at Professor Smartypants' email backup and is interested in how many of their emails were MIME-encoded.) The jargon for this among archivists is 'orginal order':

https://en.wikipedia.org/wiki/Original_order

Keith and I discussed these trade-offs at some length and settled on the following solution for now. All the input MBOX-es we are planning to handle were either:

created by libpst
created by GMail
the actual format the person's MUA was using

That means that we can safely assume 'the input has correctly-escaped From lines' as an invariant, which in turn means that our handling of unescaped From lines can be more minimal than it would be for the kind of recalcitrant input we are fully expecting to have to deal with. So for the purposes of to_mbox, I think we can get away with it having an optional parameter of type bool, call it escape. The behavior would be as follows:

if escape is true, have to_mbox replace all occurrences of "From " with ">From " and all occurrences of ">From " with ">>From "
if escape is false (which it will be by default), have to_mbox throw an exception when it encounters From in any of the strings in the input list

You can define the relevant exception along these lines:

# exception MBOXParseError of string;;
exception MBOXParseError of string
# raise @@ MBOXParseError "whatever info you want in here";;
Exception: MBOXParseError "whatever info you want in here".

Getting Started

Here is some example MBOX-parsing code from the precursor to Prelude, a standard library called Kw. You can use it as a basis for our MBOX parsing code---in fact, probably with few to no changes.

(** {1 Mbox parser ({i Xavier Leroy})}

  Snarfed from: <{{:http://cristal.inria.fr/~xleroy/software.html#spamoracle}http://cristal.inria.fr/~xleroy/software.html#spamoracle}>

  Hacked by KW 2010-05-13 <{{:http://www.lib.uchicago.edu/keith/}http://www.lib.uchicago.edu/keith/}>
    - added map and fold functionals

  @author Xavier Leroy, projet Cristal, INRIA Rocquencourt
 *)
(***********************************************************************)
(*                                                                     *)
(*                 SpamOracle -- a Bayesian spam filter                *)
(*                                                                     *)
(*            Xavier Leroy, projet Cristal, INRIA Rocquencourt         *)
(*                                                                     *)
(*  Copyright 2002 Institut National de Recherche en Informatique et   *)
(*  en Automatique.  This file is distributed under the terms of the   *)
(*  GNU Public License version 2, http://www.gnu.org/licenses/gpl.txt  *)
(*                                                                     *)
(***********************************************************************)

(* $Id: mbox.ml,v 1.4 2002/08/26 09:35:25 xleroy Exp $ *)

(* Reading of a mailbox file and splitting into individual messages *)

type t =
  { ic: in_channel;
    zipped: bool;
    mutable start: string;
    buf: Buffer.t }

let open_mbox_file filename =
  if Filename.check_suffix filename ".gz" then
    { ic = Unix.open_process_in ("gunzip -c " ^filename);
      zipped = true;
      start = "";
      buf = Buffer.create 50000 }
  else
    { ic = open_in filename;
      zipped = false;
      start = "";
      buf = Buffer.create 50000 }

let open_mbox_channel ic =
    { ic = ic;
      zipped = false;
      start = "";
      buf = Buffer.create 50000 }

let read_msg t =
  Buffer.clear t.buf;
  Buffer.add_string t.buf t.start;
  let rec read () =
    let line = input_line t.ic in
    if String.length line >= 5
    && String.sub line 0 5 = "From "
    && Buffer.length t.buf > 0 then begin
      t.start <- (line ^ "\n");
      Buffer.contents t.buf
    end else begin
      Buffer.add_string t.buf line;
      Buffer.add_char t.buf '\n';
      read ()
    end in
  try
    read()
  with End_of_file ->
    if Buffer.length t.buf > 0 then begin
      t.start <- "";
      Buffer.contents t.buf
    end else
      raise End_of_file

let close_mbox t =
  if t.zipped
  then ignore(Unix.close_process_in t.ic)
  else close_in t.ic

let mbox_file_iter filename fn =
  let ic = open_mbox_file filename in
  try
    while true do fn(read_msg ic) done
  with End_of_file ->
    close_mbox ic

(** [mbox_file_fold fn inchan acc]: fold the function [fn] over the messages in the mbox file open on [inchan] with [acc] as initial accumulator. *)
let mbox_file_fold fn inchan acc =		(* KW *)
  let ic = open_mbox_channel inchan in
  let rec loop acc =
    match try Some (read_msg ic) with End_of_file -> None with
      | Some msg -> loop (fn acc msg)
      | None     -> acc
  in
    loop acc

(** [mbox_file_map fn filename]: map the function [fn] over the messages in the mbox file [filename]. *)
let mbox_file_map fn filename =		(* KW *)
  let ic = open_in filename in
    try
      let result = List.rev (mbox_file_fold (fun acc msg -> fn msg::acc) ic []) in
	close_in ic;
	result
    with exn -> close_in ic; raise exn

let mbox_channel_iter inchan fn =
  let ic = open_mbox_channel inchan in
  try
    while true do fn(read_msg ic) done
  with End_of_file ->
    close_mbox ic

let read_single_msg inchan =
  let res = Buffer.create 10000 in
  let buf = Bytes.create 1024 in
  let rec read () =
    let n = input inchan buf 0 (Bytes.length buf) in
    if n > 0 then begin
      Buffer.add_subbytes res buf 0 n;
      read ()
    end in
  read ();
  Buffer.contents res

Make Email Backend Configurable

Currently, to switch between Mr. Mime and Ocamlnet as backends, all you have to do is change this line in convert.ml:

module Converter = Ocamlnet_Converter

To this:

module Converter = Mrmime_Converter

For this issue, make the email backend an option in the configuration file (or in a command line flag, or both) and set the value of the Converter module according to what is in the configuration file. In the config file case, it could look something like this:

%backend mrmime

In the command line flag case, it could look approximately like this:

$ attc --backend=mrmime < input.mbox > output.mbox

Write a MIME Type to Extension Function

We currently have an incomplete definition of the mime type to extension function in lib/configuration.ml. This should be fleshed out completely at least for the most common mime types, and then we need to make sure to log the fact that we defaulted to using no extension in the case that we see an uncommon mime type.

Refactor Conversion Code

Move conversion code to it's own file, better mirroring the structure of the rest of the code.

Include ID for Conversion in the Config File

We've talked about the possibility that we might want to do the same kind of conversion that has already been done, but with a different tool. It also seems generally easier if we have an identifier for the conversions themselves, independent of the shell command. I think the most natural way to do this is to include a required id entry in the config:

%source_type application/pdf
%target_type application/pdf
%shell_command /Users/nmmull/Developer/Repositories/attachment-converter/conversion-scripts/soffice-wrapper.sh -i pdf -o pdf
%id soffice-pdf-to-pdfa

%source_type application/pdf
%target_type text/plain
%shell_command /Users/nmmull/Developer/Repositories/attachment-converter/conversion-scripts/pdftotext-wrapper.sh
%id pdftotext-pdf-to-txt

...

Create Test Database of Fake Emails

In this issue, we will create a database of emails for testing. 'Database' in quotes, of course; as this will involve no actual databases. Rather, the idea is simply to have a list of emails that are version-controlled, in this repository, that we can use for testing. (And which any developers who want to try the project out can use for testing.) What we want, for each source MIME type, is an email that contains an attachment of that source type.

To an extent, this follows the paradigm we follow on the UChicago Library Wagtail site, where there is a development database that looks similar to the production database, except that rather than having information about actual library staff stored in it, it has information about fictional Star Trek characters in it. It is notably smaller, because rather than having a development database that's comparable in size to the actual development database, the goal is to have one fictional staff member represent every 'staff member possibility' our application logic is meant to cover. Similarly here, our testing database need not be large, but we'd like it to have at least one example of every type of email we expect to encounter in the wild. Then we'll have something to write a lot of our unit test suite against.

Constructing the Test Database

My recommendation is to base each test email on an actual email from one of our collections, but scrub all personal information from it. That is:

all email addresses should be changed to made-up email addresses
all IP addresses should be changed to made-up IP addresses
all sender and recipient names should be changed to fake names
the body of every email should be replaced with different text---as to what, exactly, feel free to follow your bliss
every attachment in an email should be replaced with a different attachment of the same MIME type (so, for example, if you come across a .doc attachment, remove that from the email and replace it with a .doc of your creation, containing whatever dummy text you see fit to have it contain)

The emails should be in individual files, in a subdirectory of the tests/ directory in the project. Could be tests/test_emails, or whatever the assignee would like to call it. I'll leave the exact naming scheme up to the preference of the assignee as well: it could be email1, email2, etc. or something else.

Revisit Mr. Mime as an email parsing backend

When Owen Price Skelly began work on this project, the library we were using for an email parsing backend was Mr. Mime. In the fall of 2021, we switched to using OCamlnet as our email parsing backend, mainly due to challenges getting everything up and running with Mr. Mime.

The main challenge was that Mr. Mime, astonishingly, does not come with a function to serialize parsetrees back into email strings. We learned the reason for that after conversing with the author of the library, and it was interesting. The reason is that Mr. Mime mainly exists in order to help developers write mail user agents in OCaml, in a way that is compatible with the Mirage unikernel ecosystem. But a mail user agent just needs to parse emails; once it has the data it needs, it works with that directly. There's no need to serialize a parsetree back into a well-formed email. So coincidentally, our use case never really came up.

Getting back to Attachment Converter, we would ideally like it to have multiple email parsing backends. There are a few reasons for this. One is that that should give us the flexibility to try parsing each email twice. Since the email specification is so complex, it is inevitable that any two email parsing libraries will differ at least slightly in what emails they consider to be syntactically well-formed. If Attachment Converter could try parsing with one backend and then with another in the event that that fails, it would be able to convert a larger range of attachments than otherwise. Who knows what we will come across in the wild? It's probably best to be as prepared as we can.

Starter Code Part 1: Practicum Code

In the original Winter 2020 practicum version, the project did not yet have the structure it currently has. However, for your reference, I have put the original Mr. Mime code into a new branch called [mrmime-starter-code](https://github.com/uchicago-library/attachment-converter/tree/mrmime-starter-code). mrmime_starter_code.ml contains the source for the email parsing code using Mr. Mime and mrmime_todos.org contains notes on how far the Mr. Mime version of the email parsing backend progressed.

Here's a quick summary of what those notes say the starter code can do:

parse an email
pull an attachment out of an email parse tree
decode it to binary file data
write those binary data to a file

The remaining core functionality would involve:

applying the external conversion utility
capturing the output of that utility in the form of a string
putting that string back in the email parsetree in the form of a new
attachment
serializing the parsetree into a string

In other words, it would involve implementing a module inhabiting the CONVERT signature for Mr. Mime. Maybe call it Conversion_mrmime, to follow our prior naming scheme?

Starter Code Part 2: From The Author Himself

Romain Calascibetta was gravious enough to provide us with example code that creates a new email from scratch. In principle, we should be able to use this to build the original back email up again from the converted attachment.

Here is a link to that code:
https://github.com/mirage/mrmime/blob/master/examples/attachment.ml

Config file should be in several canonical places

Currently, Attachment Converter looks for the config file (.config) in the current working directory, i.e. the directory where it is run from. This is ok for simple testing, but in practice, the user is going to be running it from wherever they run it from---and where they run it from certainly shouldn't affect the configuration it uses!

The standard approach in these situations is for a command-line tool to look for the configuration file in a couple canonical places. As a first stab, perhaps we could have Attachment Converter check for these locations in descending order:

a path specified on the command line (for example: attachment-converter --single-email --config=~/my-config-files/.acrc)
a path given by a shell environment variable called something like $AC_CONFIG
something in the .config directory, which is where a lot of UNIX applications keep their configuration files, so:
~/.config/attachment-converter/acrc
~/.acrc
if no config file exists at any of these paths, Attachment Converter should likely default to using a hardcoded configuration, perhaps a constant defined in configuration.ml

Preliminary Testing: Part 1

Looks like we're ready to begin testing! Not bad. We're going to select some emails at random from our collection and give them a good poke.

Getting the Emails

The email collections can be found in your file share here:

/copies

For any given mailbox in an email backup, the plaintext versions of the emails for most slash all of those collections should be located at this path:

/copies/ICU.SPCL.NAME-OF-PERSON/NUMERIC-ID/Data/Working - Copy/maildir/[email protected]/[email protected] (Primary)/Top of Information Store/NAME-OF-MAILBOX

Note that the folder is called maildir but the mailbox is not in fact in Maildir format—this is due to an old mistake that Matt made. However, that is where you'll find each email as a separate file, converted from the .pst original.

Getting the Document Ready

I guess it's finally time to be minimally less informal about documentation for the project. Go ahead and create a folder called doc in the root of the Attachment Converter project, and move file-format-conversions.md (which is currently our one and only doc) in there.

Then, create a new document called doc/preliminary_tests1.md. That will be where you write up the results of your testing.

Testing

Based on the executable code from #21, you should be able to test Attachment Converter out on a given email as follows:

$ dune exec -- ./main.exe < PATH-TO-EMAIL

For this issue, please select a sequence of emails at random, from a mix of different collections, and try running the above command on them with the default configuration. See if you can choose a range of emails that cover the different formats we're looking to convert by default. If Attachment Converter gives you reasonable shell output, use to_mbox to make that output into an MBOX and see whether it loads in Apple Mail, looks the way we want it to look, and so forth. More specifically, for each email you look at:

provide the path to the email on the network share
indicate the commit of the executable code you're running, so that once we've modified the executable code later on, we know which version had this behavior
say whether the conversions seem to have produced a valid email with converted attachments appearing where we'd like them to appear
if they have, you can move on to the next email
if not, then choose among our 'deal with a problem' options

Our 'deal with a problem' options are these:

correct the bug leading to the aberrant behavior, add any unit tests that aberrant behavior inspired, and briefly describe all of the above in doc/preliminary_tests1.md
describe the aberrant behavior in doc/preliminary_tests1.md and note that we will plan to address it in a future issue

Roughly, if the bug you've discovered seems like it's going to require some big rewriting/redesign, you can feel free to take the second option. If it seems like it's easily fixed, feel free to take the first option. We'll leave it up to you what you decide is best in each case.

You've done enough testing for this issue once either of these things happens:

you've uncovered 5 emails for which there is at least one problem
you've looked at a total of 20 emails

We might want to adjust those numbers as we go, but it seems like a good starting point for now.

Implement `convert`

Write `convert`, which will call out to UNIX

Issue #1 left convert in the module instantiating the CONVERT signature with OCamlnet as a backend undefined; this issue will supply a definition for it.

convert should be of the following type:

val convert : filepath -> string -> string

We may fancy this up later, but for now you can just assume that in the relevant module filepath is a type alias for string:

type filepath = string

convert should take a full path to a command line executable as an input and output a function that calls out to that command line executable to map one string to another. OCaml strings are just sequences of bytes—no character encoding information present in the value—so really, the output of convert is just a 'raw data transformation' function.

Re-implement owen-practicum using ocamlnet

Picking up from the winter practicum

The goal for this issue will be to re-implement the functionality of @OwenPriceSkelly's code written for the Winter 2021 Masters Program in Computer Science practicum, using ocamlnet rather than mrmime as a backend.

The Interface

I expect us to adjust the details of this spec quite a bit as the project continues, but here's something to get the ball rolling:

module type CONVERT =
  sig
    type filepath
    type parsetree
    val parse : string -> parsetree
    val amap : ('a -> 'b) -> parsetree -> parsetree
    val acopy : ('a -> 'b) -> parsetree -> parsetree
    val to_string : parsetree -> string
    val convert : filepath -> string -> string
    val acopy_email : string -> (string -> string) -> string
  end

The filepath and parsetree types can be set to something distinct every time we implement a new module with this signature. As for the functions:

parse will map the string containing an entire email into a parsetree.
amap will return a new email parsetree that is the result of applying an input function f to every attachment in the input parsetree.
acopy will return a new email parsetree that adds the result of applying an input function f to every given attachment, next to the original attachment
to_string will serialize an email parsetree back out into a syntactically well-formed string
convert will map the filepath for a command line utility fitting the specification of a UNIX pipe to a function that transforms data in one format to data in the other format (for example, Word .doc to PDF-A-1b).
acopy_email will map an email string to a new email string with all of its attachments converted, as per the function of acopy

Where to put the code

I think lib/lib.ml is a good place for this code for now. You can put it in a module inside of that file called Convert_ocamlnet, or something on those lines.

Using `ocamlnet`

Goal: implement a module with signature CONVERT using ocamlnet as a backend. Here's some example code that might be able to help you get started.

First, create a sample email:

$ cat - > /tmp/bill_clinton <<EOF
From:  Bill Clinton <[email protected]>
To:    Al (The Enforcer) Gore <[email protected]>
Subject:  Argentina Junket
MIME-Version: 1.0

Hey Big Al,

We haven't been to Argentina, have we? Can we schedule that in the
next couple of months? I'd like to find where Dan Quayle found those
little dolls ...

Bill
EOF

Then, to parse that file in the OCaml toplevel:

# let c = open_in "/tmp/bill_clinton";;
val c : in_channel = <abstr>
# let netc = new Netchannels.input_channel c;;
val netc : Netchannels.input_channel = <obj>
# let data = new Netstream.input_stream netc;;
val data : Netstream.input_stream =
  <NETSTREAM pos_in:0 window_length:327 eof=true>
# let ast = Netmime_channels.read_mime_message data;;
val ast : Netmime.complex_mime_message = (<obj>, `Body <obj>)
# close_in c;;
- : unit = ()

Unix stuff

To implement convert you'll need to do some Unix stuff. The unix package is enabled in the dune config, but rather than working with the lower-level Unix module directly, a good jumping off point is to use the higher-level functionality offered by Prelude, much of which can be found in Prelude.Proc:

https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Unix.Proc.html

UPDATE: for now, leave convert undefined: let convert = assert false. I am pulling the implementation of convert out into its own issue. (Issue #3)

Create a Collection of Dummy Documents

In line with creating a testing pipeline for generated emails. It would also be useful for these emails to have actual attachments. We should create a small collection of simple pdfs, image, etc. that can be added to these generated emails. so we can get more realistic behaviors from the application.

Better Timestamp Info in Header Field

We currently put a timestamp in the X-Attachment-Converter header field, but we should probably include a human readable form as well so that users can quickly see when they last converted with a given tool. Long term goal, it would be great to add a feature which reads the converted mbox and creates a doc of all the conversion tools used and when.

Research into unit testing that the destination file format is as intended

We would eventually like to be able to run a test suite that will make sure the data Attachment Converter outputs matches the MIME type specified for the relevant conversion in the config file.

For example, if the config file says that Attachment Converter should convert a Word .doc to plaintext, we would like to run a file format detection utility on the output to determine that it is indeed plaintext rather than some other format, such as .pdf.

Utilities for identifying file formats

Matt's current top pick for a file format identification utility is:

conan

This seems promising, insofar as it exists not only as a shell utility but as an OCaml library. It is also, coincidentally, authored by the same developer who gave us Mr. Mime, which is one of our two email parsing backends.

Some more mainstream utilities for identifying file types include:

For this issue, there are two goals.

Goal 1: brainstorm a set of unit tests

First, write a up some ideas for unit tests---could be 1 or 2, but possibly more---that we could run to try to catch bugs like the file format bug described above. Those ideas can go into a markdown (or org) file in the doc/ directory of this repository.

Goal 2: implement a simple cram test

We haven't tried running Cram tests yet, but apparently dune has the ability to do that. Before we start implementing actual Cram tests, let's try writing a trivial one, say that running attc --help prints the following help message:

> attc --help
ATTC(1)                           Attc Manual                          ATTC(1)



NAME
       attc - Converts email attachments.

SYNOPSIS
       attc [OPTION]… [ARG]

OPTIONS
       --config=PATH
           Sets the absolute path PATH to be checked for a configuration file.

       -r, --report
           Provides a list of all attachment types in a given mailbox.

       --report-params
           Prints a list of all MIME types in the input along with all header
           and field parameters that go with it.

       --single-email
           Converts email attachments assuming the input is a single plain
           text email.

COMMON OPTIONS
       --help[=FMT] (default=auto)
           Show this help in format FMT. The value FMT must be one of auto,
           pager, groff or plain. With auto, the format is pager or plain
           whenever the TERM env var is dumb or undefined.

EXIT STATUS
       attc exits with:

       0   on success.

       123 on indiscriminate errors reported on standard error.

       124 on command line parsing errors.

       125 on unexpected internal errors (bugs).



Attc                                                                   ATTC(1)

Here is a guide to writing a Cram test using dune:
https://dune.readthedocs.io/en/latest/tests.html#cram-tests

Once we have the world's simplest Cram test working, we can flesh our test suite out with a wider range of tests.

Idempotency of `acopy` and `amap`

We're at the point that we want to make sure it is possible to apply attachment-converter twice to an mbox without updating it. There are basically 4 behaviors we want to be possible.

Applying acopy twice only converts the original attachment, and only if there is a change in the conversion tool used.
Applying acopy converts everything it sees.
Applying amap only converts things that are not converted files.
Applying amap converts everything it sees.

Create Default Configuration

Using the document from #10 and the configuration datatype (Formats.t) from #9, define a value of type Formats.t to be used as the default configuration for the project. It can go in lib/lib.ml.

val default_config : Formats.t

Maybe it'll kick in as a default if there's an issue reading the configuration file; maybe it'll kick in as a default in some other circumstances. We'll figure all that out later on while writing the executable code.

For now, the idea is to slightly revise the type of acopy and amap so that they have an optional configuration parameter and default to this value when none is provided. This ought to make testing in the REPL a good deal easier.

Add meta-data header to attachments

This is primarily in pursuit of idempotence of acopy and amap, but I thought I'd make a separate issue. We want to add a new header in the attachment of the form

X-Attachment-Converter: converted;
    source-type="application/pdf";
    target-type="application/pdf";
    original-file-name="name.pdf";
    timestamp="...";
    ...

Speedup: MBOX mode should do some automatic parallelization

The problem

Currently, Attachment Converter takes a long time to process an MBOX. For example, here is an 8 MB MBOX file:

> ls -lah example.mbox 
Permissions Size User   Date Modified  Name
.rw-r--r--  8.8M me     5 May 15:58    example.mbox

It contains a pretty decent number of attachments in the formats we are looking for:

> attc -r example.mbox 
Content Types:
  application/msword : 24
  application/octet-stream : 3
  application/pdf : 10
  application/rtf : 32
  image/jpeg : 7
  message/rfc822 : 29
  multipart/alternative : 1
  multipart/mixed : 315
  text/plain : 311

And it takes 74 minutes to process:

> time attc < example.mbox > example_converted.mbox 2> example_errors.mbox

________________________________________________________
Executed in   73.57 mins    fish           external
   usr time   54.65 secs    0.00 millis   54.65 secs
   sys time   11.61 secs    1.34 millis   11.61 secs

We will undoubtedly learn more after profiling the code, but a quick eyeball running that same command with progress bar output shows that each conversion to a PDF-A is taking a long time. This is likely due to the fact that we are currently using LibreOffice to do the following conversions:

PDF >> PDF-A
DOC >> PDF-A
DOCX >> PDF-A

It will probably be a good idea at some point to explore utilities that can perform these conversions faster, since LibreOffice has introduced other inconveniences as well (needing to create a profile in order to be run on the command line, requiring a running X session when run on Linux, etc.). However, we will postpone that to a future issue and focus here on parallelizing the code.

Possible approaches

This part of the issue will theoretically get fleshed out once we learn more about the joys of parallel code in the modern era. However, here are three starting points to look into. The parmap package is set up to do CPU-bound list map calculations in parallel. It is pre-multicore OCaml. The parany package, following the release of multicore OCaml, has been re-implemented using domainslib. Finally, there is domainslib itself, which is lower-level than either of the previous libraries, but which does expose a "parallel for loop" function which it may be adaptable to our use case.

Move to `cmdliner` for command line interface

Currently, we have a simple command line interface that splits the command line inputs on whitespace and examines them. It works great for present purposes. But as one last Spring-quarter gesture towards making this application Unix-tastic, we would like to use Daniel Buenzli's cmdliner library. Moving to cmdliner will get us the following things:

the order of the command line switches won't matter
the switches will be typeable in short (e.g. -d, -r), verbose (e.g. --delete, --recursive), and grouped-together form (e.g. -dr)
it'll automatically generate USAGE messages
it'll automatically generate a man page
it has a notion of a subcommand, and can automatically generate error messages and man pages for those

Given that our goal for the initial Email Archives: Building Capacity and Community grant period is to have a working command line application in the UNIX style, this seems like a good step in the direction of our destination app architecture.

The Interface Itself

Our initial discussions seem to have landed on creating three modes that Attachment Converter can be run in:

MBOX Mode
Single-Email Mode
Report Mode

Attachment Converter will default to the first of these three when run without any switches. All three of these modes will work in filter style, reading from standard in by default when no command line arguments are supplied, and read from an input filepath when a filepath is supplied.

MBOX Mode

In MBOX mode, Attachment Converter will assume the input is in the form of an MBOX, and it will perform all the conversions the configration file said to perform. Its behavior should be as follows:

by default, read from standard in and print to standard out
if it is passed a single filepath, read data in from that file and print to standard out
reject more than one filepath with an error message
reject a filepath pointing at a directory with an error message

Single-Email Mode

In Single-Email mode, Attachment Converter will assume it is being given a plaintext email rather than a mailbox as an input. It will convert all attachments in the single email as per what is indicated in the configuration. This mode will activate when Attachment Converter is given the following command line switch:

--single-email

Its behavior is as follows:

by default, read from standard in and print to standard out
if it is given a single filepath, read the data from that instead and print to standard out
if passed more than one filepath, it does nothing and prints an error message
if passed a directory filepath, it does nothing and prints an error message

Report Mode

Report Mode provides a list of all the attachment types in a given mailbox. See issue #15 for details on how that feature turned out. There are two ways of running report mode: a normal version and a verbose version. The command line switch for for normal reporting is:

--report

This prints a list of all the MIME types that occur in the input, alphabetically sorted and deduped.

The command line switch for verbose reporting is:

--report-params

Thsi prints a list of all the MIME types that occur in the input MBOX, along with all the header field parameters that go along with it. (The MIME type in an email is indicated in the Content-Type header, and that header, like all headers, is allowed to have an arbitrary number of header field parameters.) The typical example of a header field parameter we've come across in this context gives the character encoding.

Regardless of whether or not the app is run in report mode verbosely or non-verbosely, it reads from standard in and writes to standard out, the same way as it does in MBOX and Single-Email modes.

Where to put the code

The code for this interface can most likely go into main.ml in the root of the project. I'll leave it to the developer's discretion about whether to break it out into a separate file.

uchicago-library / attachment-converter Goto Github PK

attachment-converter's People

Contributors

Stargazers

Watchers

Forkers

attachment-converter's Issues

Research file format conversion utilities

Some pointers on utilities

LibreOffice

pdf2archive

pandoc

Image conversion utilities

Create Minimal Executable

Adding conversion-scripts as a cram dependency

Change acopy and amap in CONVERT module signature

Update acopy and amap to fit config spec

Error handling

Update Sandboxed Switch Config to work off of .opam files

Two Tasks

make deps

make sandbox

Add Progress to Completion Status to Progress Bar

Reuse code from report.ml?

Move documentation into its own part of the app

Create (first version of) progress bar

Background

Progress Bar

Layout

How to do it

Spec out configuration file for Attachment Converter

Config File Language

Parsing the Config File

Design implications

Report Feature: Display All Attachment Types

Preliminary MBOX-level conversion

Stateless Build Config, Part 1: Sandboxed Opam Switches

Opam-Lock

Attachment Converter

Write MBOX conversion code for testing

Background on MBOX

Desired Behavior

At the MBOX level

Character escaping

Getting Started

Create Test Database of Fake Emails

Constructing the Test Database

Revisit Mr. Mime as an email parsing backend

Starter Code Part 1: Practicum Code

Starter Code Part 2: From The Author Himself

Preliminary Testing: Part 1

Getting the Emails

Getting the Document Ready

Testing

Write convert, which will call out to UNIX

Picking up from the winter practicum

The Interface

Where to put the code

Using ocamlnet

Unix stuff

Utilities for identifying file formats

Goal 1: brainstorm a set of unit tests

Goal 2: implement a simple cram test

Create Default Configuration

The problem

Possible approaches

Move to cmdliner for command line interface

The Interface Itself

MBOX Mode

Single-Email Mode

Report Mode

Where to put the code

Recommend Projects

Recommend Topics

Recommend Org

`pdf2archive`

`pandoc`

Adding `conversion-scripts` as a cram dependency

Change `acopy` and `amap` in `CONVERT` module signature

Update `acopy` and `amap` to fit config spec

Update Sandboxed Switch Config to work off of `.opam` files

`make deps`

`make sandbox`

Reuse code from `report.ml`?

Write `convert`, which will call out to UNIX

Using `ocamlnet`

Move to `cmdliner` for command line interface