post
is a program for processing structured data files in bulk.
It was originally intended as an automation tool for generating LaTeX
graphs from functionObject
data generated by OpenFOAM® simulations,
but has since evolved such that it can be used as a general structured data
processor with optional graph generation support.
It's primary use is processing and formatting data spread over multiple files and/or archives. The main benefit being that the entire process is defined through one or more YAML formatted run files, hence, automating data processing pipelines is fairly simple, while no programming is necessary.
If Go is installed locally, the following command will compile and
install the latest version of post
:
$ go install github.com/Milover/post@latest
Precompiled binaries for Linux, Windows and Mac OS (Apple silicon) are also available under releases.
Finally, post
can also be built from source, assuming Go is
available locally, by running the following commands:
$ git clone https://github.com/Milover/post
$ cd post
$ go install
Usage:
post [run file] [flags]
post [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
graphfile Generate graph file stub(s)
help Help about any command
runfile Generate a run file stub
Flags:
--dry-run check runfile syntax and exit
-h, --help help for post
--log-mem log memory usage at the end of each pipeline
--no-graph don't write or generate graphs
--no-graph-generate don't generate graphs
--no-graph-write don't write graph files
--no-output don't output data
--no-process don't process data
--only-graphs only write and generate graphs, skip input, processing and output
--skip strings a list of pipeline IDs to be skipped during processing
-v, --verbose verbose log output
post
is controlled by a run file in YAML format file supplied as a CLI parameter.
The run file usually consists of a list of pipelines, each defining 4 sections:
input
, process
, output
and graph
. The input
section defines input
files and formats from which data is read; the process
section defines
operations which are applied to the data; the output
section defines how
the processed data will be output/stored; and the graph
section defines
how the data will be graphed.
Note: All file paths within the run file are evaluated using the run file's parent directory as the current working directory.
All sections are optional and can be omitted, defined by themselves, or
as part of a pipeline. A special case is the template
section,
which cannot be defined as part of a pipeline.
See Templates for a breakdown of their use.
A single pipeline has the following fields:
- id:
input:
type:
fields:
type_spec:
process:
- type:
type_spec:
output:
- type:
type_spec:
graph:
type:
graphs:
id
: the pipeline tag, used to reference the pipeline on the CLI; optionalinput
: the input sectiontype
: input type; see Input for type descriptionsfields
: field (column) names of the input data; optionaltype_spec
: input type specific configuration
process
: the process sectiontype
: process type; see Processing for type descriptionstype_spec
: process type specific configuration
output
: the output sectiontype
: output type; see Output for type descriptionstype_spec
: output type specific configuration
graph
: the graph sectiontype
: graph type; see Graphing for type descriptionsgraphs
: a list of graph type specific graph configurations
A simple run file example is shown below.
- input:
type: dat
fields: [x, y]
type_spec:
file: 'xy.dat'
process:
- type: expression
type_spec:
expression: '100*y'
result: 'result'
output:
- type: csv
type_spec:
file: 'output/data.csv'
graph:
type: tex
graphs:
- name: xy.tex
directory: output
table_file: 'output/data.csv'
axes:
- x:
min: 0
max: 1
label: '$x$'
y:
min: 0
max: 100
label: '$100 y$'
tables:
- x_field: x
y_field: result
legend_entry: 'result'
The example run file instructs post
to do the following:
- read data from a DAT formatted file
xy.dat
and rename the fields (columns) tox
andy
- evaluate the expression
100*y
and store the result to a field namedresult
- output the data, now containing the fields
x
,y
andresult
to a CSV formatted fileoutput/data.csv
, if the directoryoutput
does not exist, it will be created - generate a graph using TeX in the
output
directory, usingoutput/data.csv
as the table (data) file, withx
as the abscissa andresult
as the ordinate
For more examples see the examples/ directory.
A generic run file stub, which can be a useful starting point, can be created by running:
$ post runfile
The following is a list of available input types and their descriptions along with their run file configuration stubs:
archive
reads input from an archive. The archive format is inferred from
the file name extension. The following archive formats are supported:
TAR
, TAR-GZ
, TAR-BZIP2
, TAR-XZ
, ZIP
. Note that archive
input wraps
one or more input types, i.e., the archive
configuration only specifies
how to read some data from an archive, the wrapped input type reads the
actual data.
Another important note is that the contents of the archive are stored into
memory the first time it is read, so if the same archive is used multiple
times as an input source, it will be read from disk only once, each subsequent
read will read directly from RAM. Hence it is beneficial to use the archive
input type when the data consists of a large amount of input files,
e.g., a large time-series
.
Warning: it is currently faster to read regularly from the filesystem than using the
archive
on most machines due to a poorly optimized implementation ofarchive
, so use with caution.
The clear_after_read
flag can be used to clear all archive
memory
after reading the data.
type: archive
type_spec:
file: # file path of the archive
clear_after_read: # clear memory after reading; 'false' by default
format_spec: # input type configuration, e.g., a CSV input type
csv
reads from a CSV formatted file. If the file contains a header line
the header
field should be set to true
and the header column names will
be used as the field names for the data. If no header line is present the
header
field must be set to false
.
type: csv
type_spec:
file: # file path of the CSV file
header: # determines if the CSV file has a header; default 'true'
comment: # character to denote comments; default '#'
delimiter: # character to use as the field delimiter; default ','
dat
reads from a white-space-separated-value file. The type and amount of
white space between columns is irrelevant, as are leading and trailing white
spaces, as long as the number of columns (non-white space fields) is
consistent in each row.
type: dat
type_spec:
file: # file path of the DAT file
multiple
is a wrapper for multiple input types. Data is read from
each input type specified and once all inputs have been read, the data from
each input is merged into a single data instance containing all fields
(columns) from all inputs. The number and type of input types specified is
arbitrary, but each input must yield data with the same number of rows.
type: multiple
type_spec:
format_specs: # a list of input type configurations
ram
reads data from an in-memory store. For the data to be read it must
have been stored previously, e.g., a previous output
section defines a ram
output.
The clear_after_read
flag can be used to clear all ram
memory
after reading the data.
type: ram
type_spec:
name: # key under which the data is stored
clear_after_read: # clear memory after reading; 'false' by default
time-series
reads data from a time-series of structured data files in
the following format:
.
├── 0.0
│ ├── data_0.csv
│ ├── data_1.dat
│ └── ...
├── 0.1
│ ├── data_0.csv
│ ├── data_1.dat
│ └── ...
└── ...
where each data_*.*
file contains the data in some format at the moment in
time specified by the directory name.
Each series dataset must be output into a different file, i.e., the
data_0.csv
files contain one dataset, data_1.dat
another one, and so on.
type: time-series
type_spec:
file: # file name (base only) of the time-series data files
directory: # path to the root directory of the time-series
time_name: # the time field name; default is 'time'
format_spec: # input type configuration, e.g., a CSV input type
The following is a list of available processor types and their descriptions along with their run file configuration stubs:
assert-equal
asserts that all fields
are equal element-wise,
up to precision
. All field types must be the same.
If all fields are equal then no error is returned, otherwise
a non-nil error is returned, i.e., the program will stop execution.
The data remains unchanged in either case.
type: assert-equal
type_spec:
fields: # field names for which to assert equality
precision: # optional; machine precision by default
average-cycle
mutates the data by computing the enesemble average of a cycle
for all numeric fields. The ensemble average is computed as:
Φ(ωt) = 1/N Σ ϕ[ω(t+j)T], j = 0...N-1
where ϕ
is the slice of values to be averaged, ω
the angular velocity,
t
the time and T
the period.
The resulting data will contain the cycle average of all numeric fields and a
time field (named time
), containing times for each row of cycle average
data, in the range (0, T]. The time field will be the last field (column),
while the order of the other fields is preserved.
Time matching can be optionally specified, as well as the match precision,
by setting time_field
and time_precision
respectively in the configuration.
This checks whether the time (step) is uniform and whether there is a
mismatch between the expected time of the averaged value, as per the number
of cycles defined in the configuration and the supplied data, and the read time.
The read time is the one read from the field named time_field
.
Note that in this case the output time field will be named after time_field
,
i.e., the time field name will remain unchanged.
Warning: It is assumed that data is sorted chronologically, i.e., by ascending time, even if
time_field
is not specified or does not exist.
type: average-cycle
type_spec:
n_cycles: # number of cycles to average over
time_field: # time field name; optional
time_precision: # time-matching precision; optional
bin
mutates the data by dividing all numeric fields into n_bins
and setting the field values to bin-mean-values.
Warning: Each bin must contain the same number of field values, i.e.,
len(field) % n_fields == 0
. This might change in the future.
type: bin
type_spec:
n_bins: # number of bins into which the data is divided
expression
evaluates an arithmetic expression and appends the resulting
field (column) to the data. The expression operands can be scalar values or
fields (columns) present in the data, which are referenced by their names.
Note that at least one of the operands must be a field present in the data.
Each operation involving a field is applied element-wise. The following
arithmetic operations are supported: +
-
*
/
**
type: expression
type_spec:
expression: # an arithmetic expression
result: # field name of the resulting field
filter
mutates the data by applying a set of row filters as defined
in the configuration. The filter behaviour is described by providing
the field name field
to which the filter is applied, the comparison
operator op
and a comparison value value
. Rows satisfying the comparison
are kept, while others are discarded. The following comparison operators
are supported: ==
!=
>
>=
<
<=
All defined filters are applied at the same time. The way in which they
are aggregated is controlled by setting the aggregation
field in
the configuration, and
and or
aggregation modes are available.
The or
mode is the default if the aggregation
field is unset.
type: filter
type_spec:
aggregation: # aggregration mode; defaults to 'or'
filters:
- field: # field name to which the filter is applied
op: # filtering operation
value: # comparison value
regexp-rename
mutates the data by replacing field names which
match the regular expression src with repl.
See https://golang.org/s/re2syntax for the
regexp syntax description.
type: regexp-rename
type_spec:
src: # regular expression to use in matching
repl: # replacement string
rename
mutates the data by renaming fields (columns).
type: rename
type_spec:
fields: # map of old-to-new name key-value pairs
resample
mutates the data by linearly interpolating all numeric fields,
such that the resulting fields have n_points
values, at uniformly
distributed values of the field x_field
.
If x_field
is not set, a uniform resampling is performed, i.e., as if
the values of each field were given at a uniformly distributed x,
where x ∈ [0,1].
The first and last values of a field are preserved in the resampled field.
type: resample
type_spec:
n_points: # number of resampling points
x_field: # field name of the independent variable; optional
select
mutates the data by keeping or removing 'fields' (columns).
If 'remove' is true, the fields are removed, otherwise only the selected
fields are kept in the order specified.
type: select
type_spec:
fields: # a list of field names
remove: # remove/keep selected fields; 'false' by default
sort
sorts the data by field
in ascending or descending,
if descending == true
, order. The processor takes a list of fields and
orderings and applies them in sequence. The order in which the fields
are listed defines the sorting precedence, hence it is possible for some
constraints to not be satisfied.
type: sort
type_spec:
- field: # field by which to sort
descending: # sort in descending order; 'false' by default
- field:
descending:
The following is a list of available output types and their descriptions along with their run file configuration stubs.
csv
writes CSV formatted data to a file. If header
is set to true
the file will contain a header line with the field names as the column names.
Note that, if necessary, directories will be created so as to ensure that
file
specifies a valid path.
type: csv
type_spec:
file: # file path of the CSV file
header: # determines if the CSV file has a header; default 'true'
comment: # character to denote comments; default '#'
delimiter: # character to use as the field delimiter; default ','
ram
stores data in an in-memory store. Once data is stored, any subsequent
ram
input type can access the data.
type: ram
type_spec:
name: # key under which the data is stored
Only TeX graphing, via tikz
and pgfplots
, is supported currently. Hence
for the graph generation to work, TeX needs to be installed along with any
dependent packages.
Graphing consists of two steps: generating TeX graph files from templates, and generating the graphs from TeX files. To see the default template files run:
$ post graphfile --outdir=templates
The templates can be user supplied by setting template_directory
and
template_main
(if necessary) in the run file configuration. The templates
use Go template syntax, see the package documentation
for more information.
A tex
graph configuration stub is given below, note that several fields expect
raw TeX as input.
type: tex
graphs:
- name: # used as a basename for all graph related files
directory: # optional; output directory name, created if not present
table_file: # optional; needed if 'tables.table_file' is undefined
template_directory: # optional; template directory
template_main: # optional; root template file name
template_delims: # optional; go template delimiters; ['__{','}__'] by default
tex_command: # optional; 'pdflatex' by default
axes:
- x:
min:
max:
label: # raw TeX
y:
min:
max:
label: # raw TeX
width: # optional; raw TeX, axis width option
height: # optional; raw TeX, axis height option
legend_style: # optional; raw TeX, axis legend style option
raw_options: # optional; raw TeX, if defined all other options are ignored
tables:
- x_field:
y_field:
legend_entry: # raw TeX
col_sep: # optional; 'comma' by default
table_file: # optional; needed if 'graphs.table_file' is undefined
Templates reduce boilerplate when it is necessary to process different sources of data but use the same processing pipeline.
For example, consider the case when we would like to extract data at specific times from some time series. The run file would look something like this:
- input: # extract data at t = 0.1
type: dat
fields: [time, value]
type_spec:
file: 'data.dat'
process:
- type: filter
type_spec:
filters:
- field: 'time'
op: '=='
value: 0.1
output:
- type: csv
type_spec:
file: 'output/data_0.1.csv'
- input: # extract data at t = 0.2
...
- input: # extract data at t = 0.3
...
A new pipeline has to be defined for each time we would like to extract since
the filter
uses a different time value and the extracted data is written
to a different file each time. This is both cumbersome and error prone.
So we use a template
to simplify this:
- template:
params:
t: [0.1, 0.2, 0.3]
src: |
- input:
type: dat
fields: [time, value]
type_spec:
file: 'data.dat'
process:
- type: filter
type_spec:
filters:
- field: 'time'
op: '=='
value: {{ .t }}
output:
- type: csv
type_spec:
file: 'output/data_{{ .t }}.csv'
Now we only have to define the pipeline once and, in this case, parametrize it by time.
A template
consists of the following fields:
- template:
params: # a map of parameters used in the template
src: # YAML formatted string for the pipeline to template
For a template
definition to be the following must be true:
- the
template
must be defined as part of a sequence (!!seq
) - the definition can contain only one mapping,
which must have the tag
template
, i.e., it cannot be defined as part of a pipeline
The params
field is a map of parameters and their values. The values can
be of any type, including a mapping, but must be given as a list, even if
only one value is given. Here are some examples:
- template:
params:
o: [0] # a single integer, but must be a list
p: [0, 1, 2] # a list of integers
q: ['ab', 'cd'] # a list of strings
r: # a list of maps
- tag: a
val: 0
- tag: b
val: 1
The src
field is a string containing the pipeline template, using
Go template syntax, i.e., the string within src
is
expanded directly into the run file using parameter values defined in params
.
If multiple parameters are defined, the template
is executed for all
combinations of parameters. For example, the following template
:
- template:
params:
typ: [dat, csv]
ind: [0, 1]
src: |
input:
type: ram
type_spec:
name: 'data_{{ .ind }}_{{ .typ }}'
output:
- type: {{ .typ }}
type_spec:
file: 'data_{{ .ind }}.{{ .typ }}'
will generate 4 files: data_0.dat
, data_1.dat
, data_0.csv
and data_1.csv
,
although not necessarily in that order since the execution order of
multi-parameter templates is undefined, and so shouldn't be relied upon.
Warning: YAML aliases currently cannot be used within the
src
field. This might change in the future.
See the examples/ directory for more usage examples.