uchicago-library / attachment-converter Goto Github PK
View Code? Open in Web Editor NEWAttachment Converter: tool for batch converting attachments in an email mailbox
License: GNU General Public License v2.0
Attachment Converter: tool for batch converting attachments in an email mailbox
License: GNU General Public License v2.0
The goal in this issue is to write up a quick first-stab reference document that will enable us to use whatever file format conversions we want to use for testing.
Here's the current list of format conversions we're planning to test with during development:
Source | Target |
---|---|
PDF-A 1b | |
plaintext | |
.doc | PDF-A 1b |
.doc | plaintext |
.docx | PDF-A 1b |
.docx | plaintext |
.xls | TSV |
.xlsx | TSV |
.gif | TIFF |
.bmp | TIFF |
.jpg | TIFF |
What we want, for each of the conversions in this table, is a command you can run at the UNIX shell to perform them. It doesn't need to have variables, or abstract command line options away, or anything like that. Just a concrete command you can type in using filenames like e.g. input.doc
or output.pdf
. We will determine how to handle command line options and different assumptions about input/output in a later issue.
Quick example. To convert a .doc
to a PDF, you can use LibreOffice on the command line:
$ soffice --headless --convert-to pdf:writer_pdf_Export --outdir . input.doc
So in other words, the goal of this issue is to compile a list of example shell commands like this, together with information about what software packages must be installed to run them. We can put it in the root directory of the project for now, and make it either Org or Markdown. (Follow your bliss!) Each section of your document can just indicate:
In addition to the above example, which converts a .doc
to a PDF, I believe LibreOffice can be used similarly to convert a .doc
to plaintext, and also to convert an .xls
to PDF or plaintext.
pdf2archive
The utility pdf2archive can be used to convert a plain vanilla PDF to an archival PDF-A 1b. (In fact, that utility is just a shell script that handles the labyrinthine command line options that Ghostscript requires to produce a PDF that meets the demanding PDF-A 1b spec.)
pandoc
I believe pandoc
(written by philosopher of language and Haskeller extraordinaire John MacFarlane) can be used to convert .docx
to either PDF or plaintext. See if you can figure out how using some combination of the documentation, Stack Overflow, and anything else that could be useful.
If I recall correctly, pandoc
can also convert .xls
files to TSV format.
imagemagick
and vips
can, I believe, be used to convert images in most standard formats to TIFF. Haven't looked at them in depth, but please feel free to experiment with other utilities too.
https://imagemagick.org/
https://www.libvips.org/
Please feel free to pose any questions you may have on our Slack channel! Hopefully the above is enough to get started.
Currently, we append a simple time stamp to the name of a converted file. This is problematic if we want to convert a file to the same format with multiple tools. We do keep track of an id
for each conversion which can be added to the filename.
You'll notice that our main.ml
module currently does nothing except print a message saying that it is converting some attachments.
We will do a full command line interface spec in a later issue, probably using Daniel Buenzli's cmdliner
library, but for this issue, the goal is just to get the world's simplest command line executable module up and running, in main.ml
. Just splitting the argv
list on whitespace should do it for these purposes. (Note that Prelude
defines an argv
constant in the form of a list, which tends to be easier to work with than the argv
array that Stdlib
provides. It also provides an argv0
constant, in case you need to make reference to the name under which the utility is run.)
Beyond that, this first stab at an excutable should:
Later on, when we flesh this out into a more mature 'minimum viable product' version of the command line interface, we will probably retain this 'pass one email into stdin and convert attachments in it to stdout' functionality as a way to run the utility, if need be, with the right command line options. It will likely be quite useful to have around for testing.
Now that we have determined how run basic cram tests, the next step is to write some. For this issue, we'll focus on creating cram tests for our conversion shell scripts.
For each of the shell scripts in conversion-scripts
, write a test which:
file
command on the output of the script to determine that the output is of the correct file formatconversion-scripts
as a cram dependencySince dune
creates a .sandbox
directory to run cram tests in whenever they are run, we will need to figure out how to get dune
to copy conversion-scripts
into the .sandbox
directory as well. Configuring dune
is always a funky adventure, but it seems likely that this will involve using the deps
option in the stanza that defines the cram tests in cram-tests/dune
.
For more information:
https://dune.readthedocs.io/en/stable/tests.html#test-options
Once dune
adds the conversion-scripts
as a dependency within dune
, it may also be necessary to add that directory to the path within the sandboxed environment? One simple thing we could try is to just add that directory inside the cram test, etc.
$ PATH=../conversion-scripts/:$PATH
acopy
and amap
in CONVERT
module signatureUpdate CONVERT
so that amap
and acopy
take two inputs: one for modifying the content-type mime header and one for modifying the data of an attachment. Since strings in OCaml are just sequences of bytes, the inputs to each of these functions should be functions of type string -> string
.
New signature (for now, until we change it):
module type CONVERT =
sig
type filepath
type parsetree
type htransform = string -> string
type dtransform = string -> string
val parse : string -> parsetree
val amap : htransform -> dtransform -> parsetree -> parsetree
val acopy : htransform -> dtransform -> parsetree -> parsetree
val to_string : parsetree -> string
val convert : filepath -> string -> string
val acopy_email : string -> (string -> string) -> string
end
Per our discussion on the #email_archives
Slack channel, it would make more sense for the progress bar to print to /dev/tty
rather than to stderr
.
Currently, the progress bar is getting printed to stderr
, which makes it visible by default. However, if the user redirects stderr
to a file, which they may want to do in order to carefully inspect the error messages, then the progress bar is silenced. (The reason for this is that our current progress bar-printing function only prints conditionally on stderr
being a tty
.)
It should probably instead implement the following logic:
stderr
is a tty
, do what we do now and print to stderr
/dev/tty
can't be opened, do nothing/dev/tty
can be opened, open and print to itacopy
and amap
to fit config specFor this issue, update the type signature of acopy
and amap
so that instead of taking header and data transform inputs, they take Formats.t
as an input. The type signature should work out to something approximately like this:
val acopy : Formats.t -> parsetree -> (parsetree, Formats.error) result
The new behavior should be: whenever acopy
[amap
] hits on an attachment, it consults the Formats.t
dictionary to see which format conversions to perform on that attachment (if any).
This will also involve introducing some error handling into acopy
and amap
. For this issue, we don't need tons of error handling: just the basics corresponding to whatever error possibilities exist in the code from issue #9. As of right now, it looks like the only error possibilities we have in Formats.Error.t
are a refer parse error and a config file parse error. I'll leave it to the assignee's discretion whether they want to e.g.:
Lib.Error
sum type that tagged union-s Formats.Error.t
-s and another error datatype defined for email parse errors, mbox parsing errors, and so forthFormats.Error.t
I'll mention that if we decide to make a global Lib.Error.t
datatype, that feels like a great use case for a row sum type, i.e. polymorphic variants. That should allow us to "split up" the error cases between different modules, but have all the variants trace back to a single error datatype.
Anyway, I leave it to the assignee's discretion exactly how much of the error handling rabbit hole to go down for this issue.
.opam
filesIt has come to our attention that we don't actually need to be using the opam-lock
utility for our build config. Whoops! opam-lock
, it turns out, does the same thing as a pip freeze
: it takes a working opam
project and creates an Opam file for it (with the .opam.locked
extension) consisting of the original .opam
file with every version of every package locked down to its exact current version, plus all transitive dependencies.
That is unnecessary for current purposes. Our dune-project
configuration generates .opam
files, and we can build sanboxed switches from those directly.
For this issue, the goal is to:
Makefile
for the project to include make deps
and make sandbox
targetsdoc/sandboxing.md
so that it mentions these new make targets and omits all the complicated stuff involving opam-lock
make deps
Mirroring the design of spinup
, the make deps
rule will install all dependencies for the project to the current opam switch. This is a useful command to have lying around when you want to do dev work on this project without having to create an entire switch (which means waiting for a fresh OCaml compiler to build).
make sandbox
Also mirroring the design of spinup
, the make sandbox
rule will create a sandboxed switch in which the DLDC Opam repository is available, and which contains the latest OCaml compiler, plus Prelude, plus all of Attachment Converter's library dependencies from the opam
repositories, and nothing else.
A nice enhancement of our current progress bar would be to have it display the number of attachments it has already converted vs. the total number of attachments it has to convert. This feature would probably only make sense when the application is run standard mode (i.e. in mbox
processing mode).
For example, instead of saying this:
Processing complete.
The progress bar could say something like this:
Attachment 14/20 has finished processing.
Etc. I leave the fine points of prose style up to the developer assigned to this feature.
report.ml
?We discussed the possibility of reusing some of the the code from the Report
module. Without going into detail here, the rough strategy could be:
grep
the mbox
for lines containing Content-Disposition
to get the total number of attachmentsgrep
the mbox
for lines containing CONVERTED
to get the number of attachments that were put there by our sotware - subtract the second from the first to get the total number of original attachments, Y
X
processing attachment X out of Y
Currently, the Attachment Converter progress bar prints a before/after statement for every email, including those with no attachments and those with attachments it doesn't modify:
> attc < input.mbox > output.mbox
Parsing email...
Processing email with structure...
=================================
Multipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.
Parsing email...
Processing email with structure...
=================================
Multipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.
Parsing email...
Processing email with structure...
=================================
Mulptipart
|-- Body
=================================
Email now has structure...
=================================
Multipart
|-- Body
=================================
Processing complete.
Etc. We would like to hush these progress bar messages for emails with no attachments and/or with unmodified attachments, so that the progress bar only displays the before/after skeletons when the after skeletons introduce new converted copies.
Post Issue #75, the email parsing backend will be configurable via a command line flag or configuration option. For this issue, define a combinator that can take in two email parsing backends and output a new email parsing backend that implements alternative logic---i.e. the logic of the (<|>)
operator from Haskell.
The logic of A <|> B
is, roughly:
Broadly speaking, this could most likely become a module functor that would take two modules of signature PARSETREE
and output a new module with the same signature as the output of Conversion.Make
, which implements the above logic for two email parsing backends A
and B
. Another option might be to implement an alternative module functor for PARSETREE
modules, i.e. one whose return signature is also PARSETREE
.
Currently, the home-install
target is throwing the following error on sequent
(Matt's Arch Linux box.)
> make home-install
dune build --display short
dune install --display short
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/META
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/META
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/dune-package
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/dune-package
Deleting /home/teichman/.opam/4.14.1/lib/attachment-converter/opam
Installing /home/teichman/.opam/4.14.1/lib/attachment-converter/opam
Deleting /home/teichman/.opam/4.14.1/bin/attachment-converter
Installing /home/teichman/.opam/4.14.1/bin/attachment-converter
Deleting /home/teichman/.opam/4.14.1/doc/attachment-converter/LICENSE
Installing /home/teichman/.opam/4.14.1/doc/attachment-converter/LICENSE
Deleting /home/teichman/.opam/4.14.1/doc/attachment-converter/README.org
Installing /home/teichman/.opam/4.14.1/doc/attachment-converter/README.org
ifeq [pacman --version]
libreoffice pandoc ghostscript libvips
else
libreoffice pandoc ghostscript vips verapdf
endif
echo Cloning attc git repo...
cd ~
mkdir attachment-converter
cd attachment-converter
git clone https://github.com/uchicago-library/attachment-converter.git
echo Copying shell scripts...
cd ~/attachment-converter
mkdir -p ~/.config/attachment-converter/scripts
cp conversion-scripts/*.sh ~/.config/attachment-converter/scripts
echo Installing to ~/bin/attc...
cp /home/teichman/.opam/4.14.1/bin/attachment-converter ~/bin/attc
ls -lh ~/bin/attc
echo Attachment Converter has been installed to ~/bin/attc.
echo Please ensure that ~/bin is on your path.
bash: line 1: ifeq: command not found
make: *** [GNUmakefile:85: home-install] Error 127
For this issue, see if you can:
a) reproduce the error on macOS
b) amend the code for the home-install
target so that it works on both macOS and Linux platforms
The README.md
file is starting to get a bit big. In this issue, we'll create a directory called attachment-converter/doc
whose purpose is to house all sections of README.md
that we choose to break out into separate files.
First on deck to go into the doc/
directory will be a document explaining the workflow for developing under a sandboxed Opam switch, which we'll call sandboxing.md
. Sort of as a 'test run' for putting docs there. We'll also move file-format-conversions.md
in there.
Task: create a progress bar for Attachment Converter.
So out of the box, our initial version of Attachment Converter isn't exactly the fastest performing application in the world. Some of our initial tests took about a minute to convert all the emails in an mbox
with a total of 10-15 attachments.
Our major bottleneck is caused by the fact that Attachment Converter calls out to external applications to perform its conversions. We haven't done much benchmarking, but one reasonable starting assumption is that LibreOffice, which we are currently using heavily to convert to PDF-A, takes a while to do each one.
There are a couple "low hanging fruit" tactics we can take to lessen the runtime of the application on mbox
-es that of a realistic size:
We will continue to explore those options as we work on the project. That said, no matter how many of these approaches to speedings things up we end adopting, it seems pretty clear that at least for large mbox
-es we will need to be prepared for a full round of attachment conversions to take a while.
Given the apparent inevitability of some amount of slowness, Attachment Converter will need to send some indication to the user of what is happening.
Getting the progress bar to output useful information to the user while also outputting its actual data to standard out involves a little finessing of UNIX terminals and file handles.
Before we get into that, let's outline what information should be in the progress bar.
For this initial version, what we're calling a "progress bar" will just be a printed line of information with something like the following format:
converting <ATTACHMENT-FILENAME> to <ATTACHMENT-BASENAME>.<TARGET-EXTENSION> ...
It should print that line of information just before it begins each conversion, so that if that conversion takes five seconds, the user will see that line of information on the bottom of the screen for five seconds.
Broadly, we want to:
tty
as a file handle and print the progress bar text to thatThat approach will display both the output data and the progress bar messages intermixed at the same time, in the terminal. Really, what we want is an either/or situation:
The following UNIX hijinks should give us that result:
tty
"?tty
The Stdlib.Unix
module provides a pretty comprehensive interface to UNIX system calls. You can use Unix.isatty
to check whether a device is a tty
. Since this take a Unix file descriptor as an input (rather than an input channel), use Unix.stdout
rather than Pervasives.stdout
.
An email which is only an attachment will have roughly the form
(header,
(Leaf data))
And so may be converted to something of the form
(~header,
(Multipart [(header, (Leaf data)),
(header, (Leaf data)),
(header, (Leaf data))]))
The question is: what should ~header
be? We can't simply adopt the header from the original email since this header will have a content disposition of attachment
or inline
. This header should be the minimal one which indicates that the message is a multipart, and should include some custom header field which says that we constructed it.
Note that the function make_header
can be used here.
Putting this issue here so that all our TODOs are in one place. We would like to eventually move the serialize.ml
code out into the actual Mr. Mime project. This will involve contributing to it.
I'll fill this in with more detail later. The basic idea: we need a cleaner pipeline for generating fake emails and then writing quick unit tests for them. This means fleshing out a interface for building these tests.
Attachment Converter currently just assumes that all of its OS-level dependencies are installed, so for example, if it is run on an email containing a JPEG attachment and vips
isn't installed, it will try to run vips
anyway and we're in the wild west of uncaught exceptions. What it needs to do verify that all OS-level dependencies are installed when it is run and refuse to do anything if it can't find any of them, printing an error message describing which utilities need to be installed.
We'll eventually need to take a careful look at all of Attachment Converter's OS-level dependencies, but for a first stab, we can write code that:
Prelude
contains some helpful library functions to check whether OS-level dependencies are installed. I will post more information about that in a follow-up on this issue.
Currently, if a conversion tool fails, attachment converter "converts" the file anyway by taking the original an copying it to a new name. This is obviously bad, and I think I was planning to fix it but forgot (my bad). I'm creating an issue for it just to log the error.
This is kind of a big task and may end up having to be split up into multiple issues. Nonetheless, we'll begin with a number of tasks in a single issue and split it up where applicable.
Let's take these in turn.
Some preliminary design brainstorming has led me to conclude, at least for now, that GNU Refer is a good choice for a config file format for Attachment Converter. (We can always change it later if need be.) Here are some reasons why.
Prelude
, the OCaml standard library we are using for this project, has a Refer parser; so we can parse config files in this format without incurring a dependency on Yet Another third-party libraryGetting your hands on an actual formal spec for refer
is kind of annoying; you have to look at the GNU refer manpage. Nonetheless, a) we already have a parser for it and b) a quick example should illustrate how the syntax works. Suppose we would like to go through an email collection and perform two conversions: we want to convert all Word doc
files to plaintext, and we want to convert all Word docx
files to PDF-A-1b. The config file for performing those two conversions could look like this:
%source_type application/msword
%target_type text/plain
%shell_command /bin/doc2txt
%source_type application/vnd.openxmlformats-officedocument.wordprocessingml
%target_type application/pdf
%shell_command /bin/docx2pdf
(Those conversion utilities are fictional, for illustrative purposes. The real conversion utilities we'll be using will be complex invocations of a command line app with lots of options.)
So: a refer
record is a key-value type of dealio: the percent sign followed by any string followed by a space gives you the field name. In between the space and the line break-followed-by-a-percent is the value. A refer
database is simply a list of refer records, each one separated by two line breaks. That means we can have one record for each conversion we would like the app to perform.
We will eventually have to finesse the syntax for the shell_command
value so that it can handle the distinction between:
However, we will wait until a later issue to add that bit of fanciness. (The rough plan will be to handle it using printf-like escape syntax.) For the purposes of getting up and running with something simple, assume for now that all command line utilities for performing conversions take a filepath as input and output to stdout.
What should we parse the datatype into? First things first: let's create a new file for the config information at lib/config.ml
and put the following code there. This requires the following change to lib/dune
:
(library
(name lib)
(libraries prelude versioj mrmime threads netstring unix)
(modules lib config)
(inline_tests (backend qtest.lib)))
I.e. what we had before, but with a modules
S-expression whose tail includes lib
and config
.
A good first stab at laying out the datatype within config.ml
would be something close to this:
module Formats = struct
type htransform = string -> string
type dtransform = string -> string
type variety =
| DataOnly of dtransform
| DataAndHeader of (htransform * dtransform)
| NoChange
module Dict = Map.Make (String)
type t = variety Dict.t
end
The Formats.variety
datatype is a sum type whose purpose is to enumerate the different ways an email part could change/not change:
We should be able to parse a refer record of the type given above into a Formats.variety
in the following way:
DataOnly
whose dtransform
is supplied by the path to the command line utility, passed into convert
from #1DataAndHeader
whose dtransform
is supplied by the filepath and whose htransform
is a function mapping the source MIME type string to the target MIME type string indicated in the refer recordNoChange
based on an exhaustive list of all MIME types, or remove NoChange
from the datatype (for now I think I like the idea of leaving it in)We can work out the details of the parsing error messages (for badly formatted config files) etc. while implementing the config file parser. I expect we can use Prelude
's parser to do the real parsing, then write a little wrapper code to do cleanup on the result of that plus whatever domain-specific error messaging we might want.
To get started with the parser, check out Prelude
's documentation:
https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Refer.html
One final note on the idea behind the Formats.t
dictionary datatype. The thought here is that Formats.t
will be a lookup table with strings representing MIME types (such as application/pdf
, text/plain
, and so forth) as keys and Formats.variety
-s as values. Then, when we are recursing through an email parsetree and come across a part of the email that is an attachment, what we'll do is examine the MIME header in the part we're looking at, look it up in the Formats.t
dictionary, and the Formats.variety
value the dictionary gives us back will tell use what to do with it: do nothing, convert just the data in the attachment, or convert both the data and the header in the attachment. See the next section for more info on how amap
and acopy
will have to be revised to make use of a Formats.t
input in this way.
The above design requires some revisions to our earlier spec from issues #1 and #6.
We'll keep the name amap
for now, though since amap
is at this point nowhere near functorial, we will eventually want to ditch the name. But the type should be updated to something like this:
val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result
Where error
is a datatype we will probably have to revise heavily, but which can start off on these lines, inside the Formats
module:
module Formats = struct
type htransform = string -> string
type dtransform = string -> string
type variety =
| DataOnly of dtransform
| DataAndHeader of (htransform * dtransform)
| NoChange
module Dict = Map.Make (String)
type t = variety Dict.t
module Error = struct
type t =
| ReferParse of string
| Unix of (string * Unix.error)
| MimeParse of string
| CharacterEncoding of string
end
type error = Error.t
end
Similar changes to acopy
are in order:
val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result
When amap
[acopy
] encounters a new part of a mail, it will try to find the Content-Type
header of that part in the input Formats.t
dictionary. If it can't, then it won't do anything to that part. Otherwise, it looks that MIME header up in the input Formats.t
dictionary to determine what kind of conversion to perform on that part of the mail, then converts accordingly.
One last thing. We will probably save this for a later issue, but it might be nice to make the first input to amap
and acopy
an optional parameter that defaults to some config we're planning to test with a lot.
The current collection of conversion scripts only covers a small collection of the tools considered in our docs. We need to write the rest of these scripts so that they can be tested, though the primary motivation for this is that pdf2pdfa
is probably significantly faster than soffice
.
I'll include in here that we should just include a default soffice
profile in our project in the hope that that speeds this up a bit. We should also check if having the user set the profile by hand is any faster.
I'll flesh out this issue later, I just wanted to make sure it got added. It will be important in implementing idempotence.
Currently the code for lookup_param
, lookup_unstructured
, etc., is not that robust. In particular, they are case-sensitive. We could in theory do what mrmime does here, but it would probably be sufficient to eliminate case-sensitivity, whitespace-sensitivity, etc.
One of the benefits of using mrmime
over ocamlnet
is that mrmime
handles messages. To handle messages with ocamlnet
, we basically have to:
is_message
to check if an email of the form (header, `Body body)
is a message, which means checking its content-type by looking at header
.body
using of_string
. There are two options here. Either put everything in the result
monad or simply don't recurse if parsing fails.replace_attachments
on the resulting parse tree.to_string
and put back into the body of the message.Note that this has to be done for both the `Body
case and the `Parts
case, as they are handled slightly differently.
Also note that there are multiple subtypes for the message mime type. The only one that is pertinent here is rfc822
. Dealing with the other two subtypes should probably go on the docket, but perhaps a bit further in the future.
A small issue. We currently create a new header for converted attachments that only includes Content-Type
, Content-Disposition
and Content-Transfer-Encoding
. I don't actually think I've ever seen any other headers, but no need to assume there aren't any. We can just take the header of the file we're converting and update it, carrying along all the other header values.
For this issue, implement the following two functions that 'query' an email for information:
val attachment_types : string -> string list
val mbox_attachments : string -> string list
attachment_types
takes an email string as an input and outputs a list of the MIME type strings corresponding to all its attachments, e.g. ["application/pdf", "text/plain"]
. This should be recursive, so that if the email has parts of parts of parts, and those have attachments, those all go into the output list. Might as well sort it in alphabetical order.
mbox_attachments
does the same thing at the MBOX level. So given an MBOX string as an input, it returns a list of all the MIME types that appear anywhere in that input MBOX, in alphabetical order.
They can go in their own module called Report
, either in Lib
or in a separate file called lib/report.ml
. The motivation behind putting them in their own module is that it seems likely we will be incorporating a bunch more "report" features into Attachment Converter, and they will eventually turn into command line options, or perhaps a subcommand.
Thus far, we have been developing Attachment Converter to work on a per-email basis. In this issue, we'll take our first baby steps towards making the application do what it is actually going to do, namely: provide a map from an input MBOX to an output MBOX.
The final version will most likely have certain optimizations in it. For example, it might stream the emails in from the input MBOX one at a time, so that each conversion process won't have more than one email loaded into memory at a time, and it might also perform as many conversions as it can in parallel. (Conceptually that's easy, given that no conversion will depend on the output on any other conversion, but we all know that parallelism can be harder than it looks to in fact pull off safely. So I guess I'll say I'm catiously optimistic!)
However, this preliminary version doesn't yet need to be optimized in either of those ways. It can load the entire input MBOX into memory, and it can perform every conversion it's going to perform sequentially. The focus for this go-round should just be on making the logic correct:
convert
(with the appropriate configuration) over the listOnce we have working code to promote the per-email functionality of the application to the MBOX level, we can then tackle the fun task of making it as efficient as we can. The code can go either in lib/lib.ml
or in a separate file called lib/mbox.ml
. If it goes in lib/lib.ml
, it should probably in its own module. (The assignee could call the module Mbox
, or something to that effect.)
One final note: don't forget about the to_mbox
function, introduced into the code as part of #11. It ought to be useful for getting this code up and running.
As we get ready to release Attachment Converter, we need to start documenting the code in more detail. The first step of that is:
One of the ways we've customarily cut corners in the DLDC is to skip defining .mli
files for our projects, which means this will be kind of a new endeavor for us. That said, we can recommend the following general approach, for each of the module files in the project:
.mli
file using the ocaml-print-intf
utilityOnce we have used this technique to identify the minimal set of values that each module needs to expose, we can set about writing docstrings for those values, which are what will go into our auto-generated odoc
documentation.
For more info on the syntax of odoc
's markup language, please see:
https://ocaml.github.io/odoc/odoc_for_authors.html
I'm making a dummy issue for this problem (sorry Matt if you've already started). Feel free to populate it with more info.
Basically: generate the default config file from attc by serializing the default config.
Once we're getting closer to releasing Attachment Converter, we'll want to offer some options to help developers build it on their system. One traditional approach to this is to write up a detailed document explaining how to install all the relevant developer tools, os dependencies, and library dependencies, then say a bunch of hail marys and hope to heck it'll work, which it certainly won't. The reason this never works is that when you get something to work on your system, it probably only happened because of a bunch of different system environment settings that you either forgot or didn't know you had turned on. On another random person's machine, most of those assumptions won't hold anymore, and if you didn't realize your installation instructions were making them, they are pretty much guaranteed not to work.
In ride stateless build environments. The goal of a stateless build environment is to write down, explicitly, every last little thing that is required to compile some code, from OS-level dependencies to programming language libraries; the whole nine yards. Because the stateless build environment isn't drawing on any information that isn't available in the local config, you have a guarantee that if it builds on your machine, it will also build on the machine of any other developer who has the same stateless build tool installed.
In 2022, there are a number of options for getting The Stateless Build Config Of Your Dreams off the ground. In future issues we'll look at Nix and possibly also Esy--but for this issue, we'll start with what ought to be the easiest one to get working: sandboxed opam
switches.
The basic idea is as follows. You create a special opam
switch that is just for your project. (As opposed to a global switch; a global switch will typically have an exact OCaml compiler version for a name and contain pretty much every opam
package you're planning to work with on your system.) You then set your build environment up the normal way, installing all packages you need until your project builds. Then you use a utility called opam-lock
—somewhat similar in concept to pip freeze
—to record information about which opam
packages are installed in that switch.
Once you've created the opam lock file, another user can in the possession of it who has opam
installed can opam
to create a brand new switch in the image of the lock file. Assuming there are no errors installing dependencies for your project, that other user should then be able to compile your project on their system, because they will have everything they need to do so in their new switch.
Here is a guide that can hopefully get you started:
https://khady.info/opam-sandbox.html
The goal for this issue is to perform the above steps for our project, create an opam lock file, and add it to our GitHub repository. You should be able to test it on your system, but for a more ambitious test you can choose another computer; maybe even the computer of someone who doesn't use OCaml!
The steps are, roughly:
dune
to tell you what Attachment Converter's opam
dependencies areopam-lock
utility to generate the lock file from your sandboxed switchacopy should make a fresh copy of the entire message part instead of modifying the body in-place
This code can be approached as something we'll only use for casual testing while in development, rather than as part of our production code. (Though it may turn out to be easy to turn into something robust enough for production; we'll see.)
What we want for this issue is a function that will take a list of individual emails (each in the form of a string) and make it into an MBOX. This will provide a convenient way to take an email we generated, and see whether it will pass whatever we've decided is our standard of validation. (At present, our standard of validation is whether the file will open in Apple Mail, since Apple Mail is the only mail user agent (MUA) normal computer users would have heard of that can read MBOX.)
Something in the ballpark of this type signature ought to work:
val to_mbox : ?escape:bool -> string list -> string
to_mbox
can go in a new module called lib/mbox.ml
. As good a place as any for the time being. For more info on the optional escape
parameter, please see below.
Much like the specification of email itself, the MBOX format is pretty nuts. The thing to note is that an MBOX is just a flat list. The emails appear in that list, and the delimiter the format uses is a From
string that looks like this:
From foo@bar Fri Jan 21 11:48:27 2022
Most MUAs will accept any line starting with From
(capital F, one space after the word) as an MBOX delimiter, but for maximum compatibility, an email address and date afterward are recommended. It doesn't matter what they are because they get ignored when the MBOX is parsed into a list of emails. Why? The From
line is just a delimiter that's considered to be a part of the mailbox; it isn't part of any email.
This is in contrast to lines beginning with From:
(that's 'from' with a colon): those are actual email headers, which means any line starting with From:
you encounter while flipping through a file will be part of an email.
For more background on this absurd weirdness, along with different conventions for escaping the MBOX delimiter, please see:
https://en.wikipedia.org/wiki/Mbox
to_mbox
should take a list of emails (i.e. email strings) and do the following:
From
line resembling the example above---in fact, it can literally just be the exact example above every time, unless you want to get fancy and insert the current date/time\r\n
(CRLF) at the end of a given email, add twoYou can test that the result works by trying to import it into Apple Mail.
Because the delimiter for the MBOX format is the From
line, this leads to all the all the usual annoyances re: quoting and character escaping. For example, imagine that the following Classic Britney Lyrics were part of the body of an email:
And you didn't hear
All my joy through my tears
All my hopes through my fears
Did you know, still, I miss you somehow?
From the bottom of my broken heart
There's just a thing or two I'd like you to know
You were my first love, you were my true love
From the first kisses to the very last rose
From the bottom of my broken heart
Even though time may find me somebody new
You were my real love, I never knew love
'Til there was you
From the bottom of my broken heart
Baby, I said, please stay (stay)
Give our love a chance for one more day, oh
We could've worked things out (taking time is what it's all about)
Taking time is what love's all about (oh)
An MBOX parser would obviously not want to parse the above string into four separate emails. The traditional workaround, discussed in the Wikipedia article linked to above, is to replace all From
-s with >From
-s in the input string. That leads to further problems, because >From
could also theoretically be intended to be part of an email body.
A more robust way to handle this situation is to MIME-encode every email body using quoted-printable (rather than base64) encoding when parsing an MBOX:
https://en.wikipedia.org/wiki/Quoted-printable
This has the advantage of allowing you to sleep better at night re: parse errors, but the disadvantage of turning every single email in the input MBOX into a MIME-encoded email. For the purpose of being able to view things in an MUA that makes no difference, but for archival purposes, we generally want to err on the side of keeping as much of the original information in the input MBOX as we can intact. (Like, maybe Indiana Jones of the future is looking at Professor Smartypants' email backup and is interested in how many of their emails were MIME-encoded.) The jargon for this among archivists is 'orginal order':
https://en.wikipedia.org/wiki/Original_order
Keith and I discussed these trade-offs at some length and settled on the following solution for now. All the input MBOX-es we are planning to handle were either:
libpst
That means that we can safely assume 'the input has correctly-escaped From
lines' as an invariant, which in turn means that our handling of unescaped From
lines can be more minimal than it would be for the kind of recalcitrant input we are fully expecting to have to deal with. So for the purposes of to_mbox
, I think we can get away with it having an optional parameter of type bool
, call it escape
. The behavior would be as follows:
escape
is true
, have to_mbox
replace all occurrences of "From
" with ">From
" and all occurrences of ">From
" with ">>From
"escape
is false
(which it will be by default), have to_mbox
throw an exception when it encounters From
in any of the strings in the input listYou can define the relevant exception along these lines:
# exception MBOXParseError of string;;
exception MBOXParseError of string
# raise @@ MBOXParseError "whatever info you want in here";;
Exception: MBOXParseError "whatever info you want in here".
Here is some example MBOX-parsing code from the precursor to Prelude
, a standard library called Kw
. You can use it as a basis for our MBOX parsing code---in fact, probably with few to no changes.
(** {1 Mbox parser ({i Xavier Leroy})}
Snarfed from: <{{:http://cristal.inria.fr/~xleroy/software.html#spamoracle}http://cristal.inria.fr/~xleroy/software.html#spamoracle}>
Hacked by KW 2010-05-13 <{{:http://www.lib.uchicago.edu/keith/}http://www.lib.uchicago.edu/keith/}>
- added map and fold functionals
@author Xavier Leroy, projet Cristal, INRIA Rocquencourt
*)
(***********************************************************************)
(* *)
(* SpamOracle -- a Bayesian spam filter *)
(* *)
(* Xavier Leroy, projet Cristal, INRIA Rocquencourt *)
(* *)
(* Copyright 2002 Institut National de Recherche en Informatique et *)
(* en Automatique. This file is distributed under the terms of the *)
(* GNU Public License version 2, http://www.gnu.org/licenses/gpl.txt *)
(* *)
(***********************************************************************)
(* $Id: mbox.ml,v 1.4 2002/08/26 09:35:25 xleroy Exp $ *)
(* Reading of a mailbox file and splitting into individual messages *)
type t =
{ ic: in_channel;
zipped: bool;
mutable start: string;
buf: Buffer.t }
let open_mbox_file filename =
if Filename.check_suffix filename ".gz" then
{ ic = Unix.open_process_in ("gunzip -c " ^filename);
zipped = true;
start = "";
buf = Buffer.create 50000 }
else
{ ic = open_in filename;
zipped = false;
start = "";
buf = Buffer.create 50000 }
let open_mbox_channel ic =
{ ic = ic;
zipped = false;
start = "";
buf = Buffer.create 50000 }
let read_msg t =
Buffer.clear t.buf;
Buffer.add_string t.buf t.start;
let rec read () =
let line = input_line t.ic in
if String.length line >= 5
&& String.sub line 0 5 = "From "
&& Buffer.length t.buf > 0 then begin
t.start <- (line ^ "\n");
Buffer.contents t.buf
end else begin
Buffer.add_string t.buf line;
Buffer.add_char t.buf '\n';
read ()
end in
try
read()
with End_of_file ->
if Buffer.length t.buf > 0 then begin
t.start <- "";
Buffer.contents t.buf
end else
raise End_of_file
let close_mbox t =
if t.zipped
then ignore(Unix.close_process_in t.ic)
else close_in t.ic
let mbox_file_iter filename fn =
let ic = open_mbox_file filename in
try
while true do fn(read_msg ic) done
with End_of_file ->
close_mbox ic
(** [mbox_file_fold fn inchan acc]: fold the function [fn] over the messages in the mbox file open on [inchan] with [acc] as initial accumulator. *)
let mbox_file_fold fn inchan acc = (* KW *)
let ic = open_mbox_channel inchan in
let rec loop acc =
match try Some (read_msg ic) with End_of_file -> None with
| Some msg -> loop (fn acc msg)
| None -> acc
in
loop acc
(** [mbox_file_map fn filename]: map the function [fn] over the messages in the mbox file [filename]. *)
let mbox_file_map fn filename = (* KW *)
let ic = open_in filename in
try
let result = List.rev (mbox_file_fold (fun acc msg -> fn msg::acc) ic []) in
close_in ic;
result
with exn -> close_in ic; raise exn
let mbox_channel_iter inchan fn =
let ic = open_mbox_channel inchan in
try
while true do fn(read_msg ic) done
with End_of_file ->
close_mbox ic
let read_single_msg inchan =
let res = Buffer.create 10000 in
let buf = Bytes.create 1024 in
let rec read () =
let n = input inchan buf 0 (Bytes.length buf) in
if n > 0 then begin
Buffer.add_subbytes res buf 0 n;
read ()
end in
read ();
Buffer.contents res
Currently, to switch between Mr. Mime and Ocamlnet as backends, all you have to do is change this line in convert.ml
:
module Converter = Ocamlnet_Converter
To this:
module Converter = Mrmime_Converter
For this issue, make the email backend an option in the configuration file (or in a command line flag, or both) and set the value of the Converter
module according to what is in the configuration file. In the config file case, it could look something like this:
%backend mrmime
In the command line flag case, it could look approximately like this:
$ attc --backend=mrmime < input.mbox > output.mbox
We currently have an incomplete definition of the mime type to extension function in lib/configuration.ml
. This should be fleshed out completely at least for the most common mime types, and then we need to make sure to log the fact that we defaulted to using no extension in the case that we see an uncommon mime type.
Move conversion code to it's own file, better mirroring the structure of the rest of the code.
We've talked about the possibility that we might want to do the same kind of conversion that has already been done, but with a different tool. It also seems generally easier if we have an identifier for the conversions themselves, independent of the shell command. I think the most natural way to do this is to include a required id
entry in the config:
%source_type application/pdf
%target_type application/pdf
%shell_command /Users/nmmull/Developer/Repositories/attachment-converter/conversion-scripts/soffice-wrapper.sh -i pdf -o pdf
%id soffice-pdf-to-pdfa
%source_type application/pdf
%target_type text/plain
%shell_command /Users/nmmull/Developer/Repositories/attachment-converter/conversion-scripts/pdftotext-wrapper.sh
%id pdftotext-pdf-to-txt
...
In this issue, we will create a database of emails for testing. 'Database' in quotes, of course; as this will involve no actual databases. Rather, the idea is simply to have a list of emails that are version-controlled, in this repository, that we can use for testing. (And which any developers who want to try the project out can use for testing.) What we want, for each source MIME type, is an email that contains an attachment of that source type.
To an extent, this follows the paradigm we follow on the UChicago Library Wagtail site, where there is a development database that looks similar to the production database, except that rather than having information about actual library staff stored in it, it has information about fictional Star Trek characters in it. It is notably smaller, because rather than having a development database that's comparable in size to the actual development database, the goal is to have one fictional staff member represent every 'staff member possibility' our application logic is meant to cover. Similarly here, our testing database need not be large, but we'd like it to have at least one example of every type of email we expect to encounter in the wild. Then we'll have something to write a lot of our unit test suite against.
My recommendation is to base each test email on an actual email from one of our collections, but scrub all personal information from it. That is:
.doc
attachment, remove that from the email and replace it with a .doc
of your creation, containing whatever dummy text you see fit to have it contain)The emails should be in individual files, in a subdirectory of the tests/
directory in the project. Could be tests/test_emails
, or whatever the assignee would like to call it. I'll leave the exact naming scheme up to the preference of the assignee as well: it could be email1
, email2
, etc. or something else.
When Owen Price Skelly began work on this project, the library we were using for an email parsing backend was Mr. Mime. In the fall of 2021, we switched to using OCamlnet as our email parsing backend, mainly due to challenges getting everything up and running with Mr. Mime.
The main challenge was that Mr. Mime, astonishingly, does not come with a function to serialize parsetrees back into email strings. We learned the reason for that after conversing with the author of the library, and it was interesting. The reason is that Mr. Mime mainly exists in order to help developers write mail user agents in OCaml, in a way that is compatible with the Mirage unikernel ecosystem. But a mail user agent just needs to parse emails; once it has the data it needs, it works with that directly. There's no need to serialize a parsetree back into a well-formed email. So coincidentally, our use case never really came up.
Getting back to Attachment Converter, we would ideally like it to have multiple email parsing backends. There are a few reasons for this. One is that that should give us the flexibility to try parsing each email twice. Since the email specification is so complex, it is inevitable that any two email parsing libraries will differ at least slightly in what emails they consider to be syntactically well-formed. If Attachment Converter could try parsing with one backend and then with another in the event that that fails, it would be able to convert a larger range of attachments than otherwise. Who knows what we will come across in the wild? It's probably best to be as prepared as we can.
In the original Winter 2020 practicum version, the project did not yet have the structure it currently has. However, for your reference, I have put the original Mr. Mime code into a new branch called [mrmime-starter-code](https://github.com/uchicago-library/attachment-converter/tree/mrmime-starter-code)
. mrmime_starter_code.ml
contains the source for the email parsing code using Mr. Mime and mrmime_todos.org
contains notes on how far the Mr. Mime version of the email parsing backend progressed.
Here's a quick summary of what those notes say the starter code can do:
The remaining core functionality would involve:
In other words, it would involve implementing a module inhabiting the CONVERT
signature for Mr. Mime. Maybe call it Conversion_mrmime
, to follow our prior naming scheme?
Romain Calascibetta was gravious enough to provide us with example code that creates a new email from scratch. In principle, we should be able to use this to build the original back email up again from the converted attachment.
Here is a link to that code:
https://github.com/mirage/mrmime/blob/master/examples/attachment.ml
Currently, Attachment Converter looks for the config file (.config
) in the current working directory, i.e. the directory where it is run from. This is ok for simple testing, but in practice, the user is going to be running it from wherever they run it from---and where they run it from certainly shouldn't affect the configuration it uses!
The standard approach in these situations is for a command-line tool to look for the configuration file in a couple canonical places. As a first stab, perhaps we could have Attachment Converter check for these locations in descending order:
attachment-converter --single-email --config=~/my-config-files/.acrc
)$AC_CONFIG
.config
directory, which is where a lot of UNIX applications keep their configuration files, so:~/.config/attachment-converter/acrc
~/.acrc
configuration.ml
Looks like we're ready to begin testing! Not bad. We're going to select some emails at random from our collection and give them a good poke.
The email collections can be found in your file share here:
/copies
For any given mailbox in an email backup, the plaintext versions of the emails for most slash all of those collections should be located at this path:
/copies/ICU.SPCL.NAME-OF-PERSON/NUMERIC-ID/Data/Working - Copy/maildir/[email protected]/[email protected] (Primary)/Top of Information Store/NAME-OF-MAILBOX
Note that the folder is called maildir
but the mailbox is not in fact in Maildir format—this is due to an old mistake that Matt made. However, that is where you'll find each email as a separate file, converted from the .pst
original.
I guess it's finally time to be minimally less informal about documentation for the project. Go ahead and create a folder called doc
in the root of the Attachment Converter project, and move file-format-conversions.md
(which is currently our one and only doc) in there.
Then, create a new document called doc/preliminary_tests1.md
. That will be where you write up the results of your testing.
Based on the executable code from #21, you should be able to test Attachment Converter out on a given email as follows:
$ dune exec -- ./main.exe < PATH-TO-EMAIL
For this issue, please select a sequence of emails at random, from a mix of different collections, and try running the above command on them with the default configuration. See if you can choose a range of emails that cover the different formats we're looking to convert by default. If Attachment Converter gives you reasonable shell output, use to_mbox
to make that output into an MBOX and see whether it loads in Apple Mail, looks the way we want it to look, and so forth. More specifically, for each email you look at:
Our 'deal with a problem' options are these:
doc/preliminary_tests1.md
doc/preliminary_tests1.md
and note that we will plan to address it in a future issueRoughly, if the bug you've discovered seems like it's going to require some big rewriting/redesign, you can feel free to take the second option. If it seems like it's easily fixed, feel free to take the first option. We'll leave it up to you what you decide is best in each case.
You've done enough testing for this issue once either of these things happens:
We might want to adjust those numbers as we go, but it seems like a good starting point for now.
convert
, which will call out to UNIXIssue #1 left convert
in the module instantiating the CONVERT
signature with OCamlnet as a backend undefined; this issue will supply a definition for it.
convert
should be of the following type:
val convert : filepath -> string -> string
We may fancy this up later, but for now you can just assume that in the relevant module filepath
is a type alias for string
:
type filepath = string
convert
should take a full path to a command line executable as an input and output a function that calls out to that command line executable to map one string
to another. OCaml strings are just sequences of bytes—no character encoding information present in the value—so really, the output of convert
is just a 'raw data transformation' function.
The goal for this issue will be to re-implement the functionality of @OwenPriceSkelly's code written for the Winter 2021 Masters Program in Computer Science practicum, using ocamlnet
rather than mrmime
as a backend.
I expect us to adjust the details of this spec quite a bit as the project continues, but here's something to get the ball rolling:
module type CONVERT =
sig
type filepath
type parsetree
val parse : string -> parsetree
val amap : ('a -> 'b) -> parsetree -> parsetree
val acopy : ('a -> 'b) -> parsetree -> parsetree
val to_string : parsetree -> string
val convert : filepath -> string -> string
val acopy_email : string -> (string -> string) -> string
end
The filepath
and parsetree
types can be set to something distinct every time we implement a new module with this signature. As for the functions:
parse
will map the string containing an entire email into a parsetree
.amap
will return a new email parsetree
that is the result of applying an input function f
to every attachment in the input parsetree.acopy
will return a new email parsetree
that adds the result of applying an input function f
to every given attachment, next to the original attachmentto_string
will serialize an email parsetree
back out into a syntactically well-formed stringconvert
will map the filepath for a command line utility fitting the specification of a UNIX pipe to a function that transforms data in one format to data in the other format (for example, Word .doc
to PDF-A-1b).acopy_email
will map an email string to a new email string with all of its attachments converted, as per the function of acopy
I think lib/lib.ml
is a good place for this code for now. You can put it in a module inside of that file called Convert_ocamlnet
, or something on those lines.
ocamlnet
Goal: implement a module with signature CONVERT
using ocamlnet
as a backend. Here's some example code that might be able to help you get started.
First, create a sample email:
$ cat - > /tmp/bill_clinton <<EOF
From: Bill Clinton <[email protected]>
To: Al (The Enforcer) Gore <[email protected]>
Subject: Argentina Junket
MIME-Version: 1.0
Hey Big Al,
We haven't been to Argentina, have we? Can we schedule that in the
next couple of months? I'd like to find where Dan Quayle found those
little dolls ...
Bill
EOF
Then, to parse that file in the OCaml toplevel:
# let c = open_in "/tmp/bill_clinton";;
val c : in_channel = <abstr>
# let netc = new Netchannels.input_channel c;;
val netc : Netchannels.input_channel = <obj>
# let data = new Netstream.input_stream netc;;
val data : Netstream.input_stream =
<NETSTREAM pos_in:0 window_length:327 eof=true>
# let ast = Netmime_channels.read_mime_message data;;
val ast : Netmime.complex_mime_message = (<obj>, `Body <obj>)
# close_in c;;
- : unit = ()
To implement convert
you'll need to do some Unix stuff. The unix
package is enabled in the dune config, but rather than working with the lower-level Unix
module directly, a good jumping off point is to use the higher-level functionality offered by Prelude
, much of which can be found in Prelude.Proc
:
https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Unix.Proc.html
UPDATE: for now, leave convert
undefined: let convert = assert false
. I am pulling the implementation of convert
out into its own issue. (Issue #3)
In line with creating a testing pipeline for generated emails. It would also be useful for these emails to have actual attachments. We should create a small collection of simple pdfs, image, etc. that can be added to these generated emails. so we can get more realistic behaviors from the application.
We currently put a timestamp in the X-Attachment-Converter
header field, but we should probably include a human readable form as well so that users can quickly see when they last converted with a given tool. Long term goal, it would be great to add a feature which reads the converted mbox and creates a doc of all the conversion tools used and when.
We would eventually like to be able to run a test suite that will make sure the data Attachment Converter outputs matches the MIME type specified for the relevant conversion in the config file.
For example, if the config file says that Attachment Converter should convert a Word .doc
to plaintext, we would like to run a file format detection utility on the output to determine that it is indeed plaintext rather than some other format, such as .pdf
.
Matt's current top pick for a file format identification utility is:
This seems promising, insofar as it exists not only as a shell utility but as an OCaml library. It is also, coincidentally, authored by the same developer who gave us Mr. Mime, which is one of our two email parsing backends.
Some more mainstream utilities for identifying file types include:
For this issue, there are two goals.
First, write a up some ideas for unit tests---could be 1 or 2, but possibly more---that we could run to try to catch bugs like the file format bug described above. Those ideas can go into a markdown (or org) file in the doc/
directory of this repository.
We haven't tried running Cram tests yet, but apparently dune
has the ability to do that. Before we start implementing actual Cram tests, let's try writing a trivial one, say that running attc --help
prints the following help message:
> attc --help
ATTC(1) Attc Manual ATTC(1)
NAME
attc - Converts email attachments.
SYNOPSIS
attc [OPTION]… [ARG]
OPTIONS
--config=PATH
Sets the absolute path PATH to be checked for a configuration file.
-r, --report
Provides a list of all attachment types in a given mailbox.
--report-params
Prints a list of all MIME types in the input along with all header
and field parameters that go with it.
--single-email
Converts email attachments assuming the input is a single plain
text email.
COMMON OPTIONS
--help[=FMT] (default=auto)
Show this help in format FMT. The value FMT must be one of auto,
pager, groff or plain. With auto, the format is pager or plain
whenever the TERM env var is dumb or undefined.
EXIT STATUS
attc exits with:
0 on success.
123 on indiscriminate errors reported on standard error.
124 on command line parsing errors.
125 on unexpected internal errors (bugs).
Attc ATTC(1)
Here is a guide to writing a Cram test using dune
:
https://dune.readthedocs.io/en/latest/tests.html#cram-tests
Once we have the world's simplest Cram test working, we can flesh our test suite out with a wider range of tests.
We're at the point that we want to make sure it is possible to apply attachment-converter
twice to an mbox without updating it. There are basically 4 behaviors we want to be possible.
acopy
twice only converts the original attachment, and only if there is a change in the conversion tool used.acopy
converts everything it sees.amap
only converts things that are not converted files.amap
converts everything it sees.Using the document from #10 and the configuration datatype (Formats.t
) from #9, define a value of type Formats.t
to be used as the default configuration for the project. It can go in lib/lib.ml
.
val default_config : Formats.t
Maybe it'll kick in as a default if there's an issue reading the configuration file; maybe it'll kick in as a default in some other circumstances. We'll figure all that out later on while writing the executable code.
For now, the idea is to slightly revise the type of acopy
and amap
so that they have an optional configuration parameter and default to this value when none is provided. This ought to make testing in the REPL a good deal easier.
This is primarily in pursuit of idempotence of acopy
and amap
, but I thought I'd make a separate issue. We want to add a new header in the attachment of the form
X-Attachment-Converter: converted;
source-type="application/pdf";
target-type="application/pdf";
original-file-name="name.pdf";
timestamp="...";
...
Currently, Attachment Converter takes a long time to process an MBOX. For example, here is an 8 MB MBOX file:
> ls -lah example.mbox
Permissions Size User Date Modified Name
.rw-r--r-- 8.8M me 5 May 15:58 example.mbox
It contains a pretty decent number of attachments in the formats we are looking for:
> attc -r example.mbox
Content Types:
application/msword : 24
application/octet-stream : 3
application/pdf : 10
application/rtf : 32
image/jpeg : 7
message/rfc822 : 29
multipart/alternative : 1
multipart/mixed : 315
text/plain : 311
And it takes 74 minutes to process:
> time attc < example.mbox > example_converted.mbox 2> example_errors.mbox
________________________________________________________
Executed in 73.57 mins fish external
usr time 54.65 secs 0.00 millis 54.65 secs
sys time 11.61 secs 1.34 millis 11.61 secs
We will undoubtedly learn more after profiling the code, but a quick eyeball running that same command with progress bar output shows that each conversion to a PDF-A is taking a long time. This is likely due to the fact that we are currently using LibreOffice to do the following conversions:
It will probably be a good idea at some point to explore utilities that can perform these conversions faster, since LibreOffice has introduced other inconveniences as well (needing to create a profile in order to be run on the command line, requiring a running X session when run on Linux, etc.). However, we will postpone that to a future issue and focus here on parallelizing the code.
This part of the issue will theoretically get fleshed out once we learn more about the joys of parallel code in the modern era. However, here are three starting points to look into. The parmap
package is set up to do CPU-bound list map calculations in parallel. It is pre-multicore OCaml. The parany
package, following the release of multicore OCaml, has been re-implemented using domainslib
. Finally, there is domainslib
itself, which is lower-level than either of the previous libraries, but which does expose a "parallel for loop" function which it may be adaptable to our use case.
cmdliner
for command line interfaceCurrently, we have a simple command line interface that splits the command line inputs on whitespace and examines them. It works great for present purposes. But as one last Spring-quarter gesture towards making this application Unix-tastic, we would like to use Daniel Buenzli's cmdliner
library. Moving to cmdliner
will get us the following things:
-d
, -r
), verbose (e.g. --delete
, --recursive
), and grouped-together form (e.g. -dr
)Given that our goal for the initial Email Archives: Building Capacity and Community grant period is to have a working command line application in the UNIX style, this seems like a good step in the direction of our destination app architecture.
Our initial discussions seem to have landed on creating three modes that Attachment Converter can be run in:
Attachment Converter will default to the first of these three when run without any switches. All three of these modes will work in filter style, reading from standard in by default when no command line arguments are supplied, and read from an input filepath when a filepath is supplied.
In MBOX mode, Attachment Converter will assume the input is in the form of an MBOX, and it will perform all the conversions the configration file said to perform. Its behavior should be as follows:
In Single-Email mode, Attachment Converter will assume it is being given a plaintext email rather than a mailbox as an input. It will convert all attachments in the single email as per what is indicated in the configuration. This mode will activate when Attachment Converter is given the following command line switch:
--single-email
Its behavior is as follows:
Report Mode provides a list of all the attachment types in a given mailbox. See issue #15 for details on how that feature turned out. There are two ways of running report mode: a normal version and a verbose version. The command line switch for for normal reporting is:
--report
This prints a list of all the MIME types that occur in the input, alphabetically sorted and deduped.
The command line switch for verbose reporting is:
--report-params
Thsi prints a list of all the MIME types that occur in the input MBOX, along with all the header field parameters that go along with it. (The MIME type in an email is indicated in the Content-Type
header, and that header, like all headers, is allowed to have an arbitrary number of header field parameters.) The typical example of a header field parameter we've come across in this context gives the character encoding.
Regardless of whether or not the app is run in report mode verbosely or non-verbosely, it reads from standard in and writes to standard out, the same way as it does in MBOX and Single-Email modes.
The code for this interface can most likely go into main.ml
in the root of the project. I'll leave it to the developer's discretion about whether to break it out into a separate file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.