yeslogic / doodle Goto Github PK

Rust 99.50% JavaScript 0.29% HTML 0.03% CSS 0.12% Haskell 0.06%

doodle's Introduction

Binary format compilation experiment

An experiment into compiling binary format descriptions to decoders with a fixed amount of lookahead. This is related work on our data description language, Fathom.

Usage

Decoding files using the CLI:

cargo run file test2.jpg

Viewing decoded data on the web frontend (requires Python):

cargo run format --output=json > frontend/format.json
cargo run file --output=json test2.jpg > frontend/test.json
python3 -m http.server --directory frontend

doodle's People

Contributors

Stargazers

Watchers

Forkers

archaephyrryx

doodle's Issues

Pretty-print all String-like formats as strings, rather than sequences of characters

Currently, the ASCII string types we want to display as ascii are:

Any compound Format whose leaf-values are all base.ascii-char or base.ascii-char.* (glob, not regex)
- (With the following exceptions: ???)
"base.asciiz-string" :=
record([("string", repeat(not_byte(0x00))), ("null", is_byte(0x00))])
"tar.ascii-string" :=
record([("string", repeat(not_byte(0x00))), ("padding", repeat1(is_byte(0x00)))])
"tar.ascii-string.opt0" :=
record([("string", repeat(not_byte(0x00))), ("padding", repeat(is_byte(0x00)))])
"tar.ascii-string.opt0.nonempty" :=
record([("string", repeat1(not_byte(0x00))), ("padding", repeat(is_byte(0x00)))])

Possible classifications

By direct properties of leaf elements
By name of interior elements (e.g. Seq, Repeat, Tuple, or Record whose non-implicit elements are all base.ascii-char(.*)?)
By name pattern-match (e.g. /ascii/ && /string/ or /ascii-string/ || /asciiz-string/)
By structural pattern match (e.g. simply-nested or top-level record with a field named "string" satisfying certain ascii-like properties)

Possible workarounds:
- Move format-specific string types to Base and enforce similar prefixing logic to base.ascii-char and derivatives. This may require renaming base.asciiz-string to base.ascii-string.cstr or similar

EDIT:

This should extend to Unicode strings as well, as utf8.string = repeat utf8.char is currently handled in sequences rather than string-form. This may be good enough for now, but if text formats are ever unified (i.e. ascii as a subset of UTF-8), this may be more relevant than it is now.

Implement proper support non-UStar tar archives

At some point, it might be beneficial to look at other versions of tar, e.g. pax, and gauge how difficult it would be to support such versions as first-class alternations over ustar, rather than successfully parsing them on an incidental basis, when they happen to be compatible with the UStar-based format definition.

[lang-model]: First-Class Option-type at all layers

First-Class Option-model (`simple`, `narrow`, `cons`)

We currently have several layers that implicate the std::option::Option type for the ValueType of Formats, specifically around the Format::Maybe type. However, it is not currently possible to work with this
as a first-class type in the model.

Example: deflate.rs

In the above example, we are forced to construct a monomorphic microcosm of the Option type instead of being able to access it natively. Along with the inconvenience of this workaround, we also lose the convenient API of the
Option type in generated code.

We also lack Value::Option (and ParsedValue::Option), Pattern::Some(Box<Pattern>) and Pattern::None, as well as other, possibly obscure locations where a first-class Option model would be useful/convenient.

[lang-model]: Format primitive for working with buffer-offsets (`simple`, `narrow`, `expr`)

In formats like ELF that use absolute (or rather, start-of-buffer relative) offsets to designate where certain blocks of data are to appear within a buffer, there is a need to provide some way of either parsing something at an offset relative to some other location than 'where we are right now/. The easiest way of doing so would be to allow for a Format primitive that parses zero bytes of input but returns an appropriately-wide unsigned integer type indicating how far into the buffer we are at the site where it was trivially parsed.

Prospectively, we could call this Format::Pos, and use it as follows, with the accompanying expected results of parse-evaluation:

decode_buffer_as_format(tuple(vec![Format::Pos, ...]), buf); // Pos should yield  0
decode_buffer_as_format(tuple(vec![Format::Byte(...), Format::Pos, ...]), buf); // Pos should yield 1

As Format::Pos is specifically designed to consume zero bytes, it is idempotent, and repeatedly calling it without having read any data in between calls should always yield a consistent answer. Calling it with no preceding formats at the start of the buffer should yield 0, and otherwise it should yield the number of bytes subsumed by all previous format-tokens up until that point.

Alternatively, we could approach the design with a model of

Format::BindOffset(Label, Box<Format>)

where we expect the following behavior:

Decode(BindOffset(varname, inner))
==
Decode(
    map(
        record(
            vec![(varname, Format::Pos), ("evaluated", inner)]
        ),
        lambda("x", tuple_proj(var("x"), "evaluated"))
    )
)

And eschew the need for an explicit Pos token by either fusing its binding and usage, or creating a persistent binding by applying it in the following fashion:

Format::BindOffset("x", Compute(var("x")))

which would emulate the behavior of the Format::Pos defined in the previous half of this proposal.

[lang-model]: Numerically-predicated repeat-intercalation (`moderate`, `niche`, `spec`)

Our current definition in doodle_formats::format:;jpeg gives the following specification for thge initial field "scan-data" in the sub-definition of "jpeg.scan-data"

            (
                "scan-data",
                repeat(alts([
                    // FIXME: Extract into separate ECS repetition
                    ("mcu", mcu.call()), // TODO: repeat(mcu),
                    // FIXME: Restart markers should cycle in order from rst0-rst7
                    ("rst0", rst0.call()),
                    ("rst1", rst1.call()),
                    ("rst2", rst2.call()),
                    ("rst3", rst3.call()),
                    ("rst4", rst4.call()),
                    ("rst5", rst5.call()),
                    ("rst6", rst6.call()),
                    ("rst7", rst7.call()),
                ])),
            ),

What we ideally want is some way of specifying the following format:

scan-data = scan-data-segments ;

scan-data-segments = segment (rst0 segment (rst1 segment (rst2 segment (rst3 segment (rst4 segment (rst5 segment (rst6 segment (rst7 scan-data-segments)? )? )? )? )? )? )? )? ;

segment = mcu+;

We currently have no way of calling a self-referential format, nor do we have any way of specifying an intercalated repeat that injects particular format-elements in between repetitions of another format.

A proposed language extension that would solve this problem would be IntercalateRepeatIndex(varname, X, F)

with the following semantics:

decode(IntercalateRepeatIndex(varname, X, F)):
    sequence = []
    index = 0
    loop:
         if let elem = decode(X):
              push elem to end of sequence
              set scope-local variable "<varname>" to the current value of index
              if let some marker = decode(F):
                  index = index + 1
              else:
                  break
         else:
             break
    return sequence

Example using such an abstraction:

Format::IntercalateRepeatIndex("reset-counter", repeat1(mcu.call()), Format::Match(
     Expr::Arith(Arith::Mod, Box::new(var("reset-counter")), Box::new(Expr::U32(8))),
     vec![
         (Pattern::U32(0), rst0.call()),
         /* ... */
         (Pattern::U32(7), rst1.call()),
     ]
))

[lang-model]: ForEach format over sequence-valued expr using fused lambda (`moderate`, `narrow`, `expr`)

Proposal for Format primitive ForEach that takes

ForEach(Expr, Label, Box<Format>)

that parses the boxed inner Format once for each element in the sequence-valued Expr, binding the sequence-element to the consistent variable-name given by the Label parameter for each iteration.

[lang-model]: `const-enum` Format (`simple`, `common`, `sem`)

While achievable through a modest amount of boilerplating with our current model, it may be useful going forward to have a first-class Format-variant for const enum values: an ad-hoc type consisting of nullary (C-Style enum) variants that are each associated with a specific non-negative integer constant.

This proposal is for something of the form:

pub enum Format {
    /* ... */
    ConstEnum(Box<Format>, BTreeMap<usize, Label>)
}

where the first argument specifies the 'raw' parsing of the integer value, and
the BTreeMap encapsulates the associations between each valid integer value
and the name of the variant to be constructed in-place the Value-level stand-in for the raw magic number.

To provide an easier API, we propose the following helper-function:

pub fn const_enum(raw_format: Format, variants: impl IntoIter<Item = (usize, Label)>) {
    Format::ConstEnum(Box::new(raw_format), variants.into_iter().collect())
}

Usage Example (from WIP ELF):

const_enum(base.u8(), [(0, "ClassNone"), (1, "Class32"), (2, "Class64")])

The intended interpretation of this primitive is that it should adhere to the rough operational semantics illustrated below:

Format::ConstEnum(F, Map) :~
   map(
        where_lambda(F, "x", <key "x" is in Map>),
        lambda("y", Expr::Variant(<associated value with key "y")
    )

Tracking Issue: Core Language Extension Checklist

Core Language Extension Checklist

This document is designed to act as an ongoing tracker for any specification-level problems
that our Format/Expr/Pattern model in doodle is not well-enough-equipped to handle at present.

Certain issues may be labeled with the binary format that most notably benefits from their
resolution, but may apply more broadly than one format.

Furthermore, each problem will be given values across several metrics, indicating various aspects of solving it:

Complexity: How difficult we estimate it will be to implement an appropriate solution
- trivial, simple, moderate, complex
Commonness: How many formats, whether currently-implemented and not-yet-implemented, the issue may appear in
- niche, narrow, common
Category: The metric on which our definitions will 'improve' as a result of having an on-hand solution
- spec - Specification compliance
- sem - Semantic transparency
- cons - Model-internal consistency
- expr - Expressivity of novel constructions

Proposals

No-Issue-Assigned

Repetition-Scope abstractions (Format::LocalScope / Format::UniqueInScope

Designated fields in a repetition that are supposed to be unique within that repetition

E.g.

Format::LocalScope("#unique-id", repeat(record(vec![("unique-id", Format::UniqueInScope(base.u8(), "#unique-id")), ...other fields...])))

Format::ValueInSet

Values that belong to a set of values accessed by a relative projective lens

E.g.

Format::record(
   vec![
      ("original", Format::LocalScope("#orig-values",
            repeat(Format::UniqueInScope(base.u8(), "#orig-values")))).
      ("permutation", Format:LocalScope("#permutation", 
            repeat(Format::UniqueInScope(Format::ValueInSet(Lens::Var("original").set_from_seq()), "#permutation"))))
   ]
)

[lang-model] Decode an `Expr` of type `Seq(U8)` using a given Format

Display both original value and mapped value when `collapse_mapped_values` is unset

The current behavior, when collapse_mapped_values in tree.rs in unset, is to show only the original value. In many cases, however, the mapped value is more semantically interesting to the end-user, as it may often be a bitwise interpretation of otherwise opaque subordinate values. We should show either the mapped value if the flag is set, or both otherwise.

(moved out of merged pr #109 for visibility)

Deprecate ASCII-strict text and use UTF-8 based text instead

[lang-model]: Dynamic-order parsing based on absolute offset linkage (`complex`, `narrow`, `expr`)

Certain binary files, notably ELF, may propose a nominal order for certain mid-level blocks of data (in the case of elf, 'Program Headers' | 'Section headers' | 'Sections') without enforcing a requirement that they appear in any given order necessarily; along with this, in the case of ELF, there may be absolute-offset fields that indicate where a given block might be found within the entire stream.

In the current expression language and parsing model, there is no native support for indicating an absolute offset, nor is there a way of recording the fact that a given range of bytes was already parsed out-of-order and should be skipped once they would be encountered during standard in-order parsing that reaches the already-processed offset range.

There are a number of potential model adjustments that could be made to address this gap, a few of which are suggested below, by no means exhaustively:

Introduce a Format primitive AtAbsoluteOffset that behaves as a variation of WithRelativeOffset, albeit computed based on the start-of-buffer rather than the current buffer-index, and which does not exhibit Peek semantics (i.e. bytes consumed using AtAbsoluteOffset` will be 'remembered' as having been read and will be silently skipped over during any other mode of subsequent parsing). To this end, we might store either a buffer-length bitmask, or more simply, a container of sorted intervals that are automatically merged when possible when they are contiguous (or overlapping, which we expect far less often and might even complain about)
- Pros: Keeps extra state-information requirements to a minimum
- Cons: forces the data at any given absolute-file-offset to be logically grouped with its first forward-reference, and complicates the parsing model by jumping back and forth in unexpected ways
Allow 'earmarking' of given offsets that indicate what we will parse when the offset naturally reaches any one of a given set of positions; keeps a natural processing order for the buffer without jumping back and forth. For example, having seen in processing and ELF header that the start of the section header table is at offset 0x1234, we might record this fact as an association (0x1234, <format level for elf_shdr>) at runtime. This suggests a construct like a LinkedRecord(...) Format primitive that acts as a record whose fields aren't implicitly in sequential linear order, but rather are processed by just-in-time ascription of offset-length pairs to each field prior to it being parsed. This would also require a first-class primitive for signaling the intensional semantics of any given field meant to carry absolute file offset information for a given block, such as Format::StoreBlockOffsetLength(Expr, Expr, Label, Label) with the sematics StoreBlockOffsetLength(AbsoluteOffset, LengthInBytes, ScopeNameOfLinkedRecord, FieldNameInLinkedRecord)
- Pros: keeps a strictly linear traversal order for the processing model, does not require associated data to be bundled with its first reference
- Cons: Requires additional, indefinitely many state-entries to be tracked, much harder to check/infer in-model Type-Information, unclear what to do when offsets are back-references, or out-of-bounds, or seen more than once. Also breaks down if the inter-section links cannot be followed forward-only (e.g. if we have *X -> *Y -> *Z but the order in the buffer is X Z Y).