onekey-sec / unblob Goto Github PK

View Code? Open in Web Editor NEW

2.1K 20.0 75.0 3.72 MB

Extract files from any kind of container formats

Home Page: https://unblob.org

License: Other

Python 97.87% Arc 0.02% Roff 0.02% Dockerfile 0.09% Nix 1.81% Shell 0.19%

extraction filesystem archive compression python

unblob's Issues

Improve ARC handling

The ARC handler fails when run against randomly generated content:

2021-12-02 15:25.12 [info     ] Calculating chunk for YARA match identifier=$arc_magic real_offset=0x554816 start_offset=0x554816
2021-12-02 15:25.12 [debug    ] Header parsed                  header=
00000000  1a 03 1b da e2 92 33 be  29 e5 72 b0 7b 71 fe c8   ......3.).r.{q..
00000010  80 11 57 34 68 96 3e 58  2a 9a 0a cc 26            ..W4h.>X*...&

struct heads:
- archive_marker: 0x1a
- header_type: 0x3
- name: b'\x1b\xda\xe2\x923\xbe)\xe5r\xb0{q\xfe'
- size: 0x571180c8
- date: 0x6834
- time: 0x3e96
- crc: 0x2a58
- length: 0x26cc0a9a
2021-12-02 15:25.12 [error    ] Unhandled Exception during chunk calculation 
Traceback (most recent call last):
  File "/home/quentin/unblob/unblob/strategies.py", line 50, in search_chunks_by_priority
    chunk = handler.calculate_chunk(limited_reader, real_offset)
  File "/home/quentin/unblob/unblob/handlers/archive/arc.py", line 67, in calculate_chunk
    header = self.parse_header(file)
  File "/home/quentin/unblob/unblob/models.py", line 119, in parse_header
    header = self._struct_parser.parse(self.HEADER_STRUCT, file, endian)
  File "/home/quentin/unblob/unblob/file_utils.py", line 143, in parse
    return struct_parser(file)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 16, in __call__
    return self.read(*args, **kwargs)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 62, in read
    return self._read(obj)
  File "<compiled heads>", line 14, in _read
EOFError

We should capture the EOFError early in the handler and simply not return a chunk.

We need to measure how fast unblob as a whole can operate and what strategy can speed up extraction significantly.
Example question we want to answer: Which is faster? Matching on all YARA patterns at once or iterating on the file multiple times with less patterns?

Measure different scenarios:

One big file with few smaller files inside
Lots of small files concatenated and inside
Multiple big files concatenated and inside
Refact the priority handling by concatenating all YARA rules and handle the match results by priority instead of scanning a file multiple times. Measure the difference on various files.

Improve documentation

Documentation available in the wiki.

TODO:

adapt examples to the new Hyperscan based API
rewrite the introduction
cleanup the README
squash wiki history

Run handlers based on priority

To speed extraction up.

GitHub action enforcing up-to-date topic branch with clean history

E.g. merge base should be the top of main, and the PR shouldn't contain merge commits.

YARA Python segmentation fault

If you download the latest ubuntu LTS, unblob will crash with segmentation fault.

Notice truncated or corrupted files

When we found a corrupt chunk in a file, we should be able to extract until that point and return the corrupted chunk as UnknownChunk and issue a warning log message.

Make unblob usable as a library

#98 (comment)

Make something with the exit_code_var so global state will not be stored
Check the logging (noformat, ...)

Unit tests depends on filesystem enumeration order

When ran in tmpfs on my machine:

tests/test_extractor.py::TestCarveUnknownChunks::test_multiple_chunks FAILED                                                                                                                                                                                                                                                                                                                                                     [  1%]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured stdout >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>2021-12-08 21:16.50 [warning  ] Found unknown Chunks           chunks=[0x0-0x4, 0x4-0x9]
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x0-0x4 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x4-0x9 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = <test_extractor.TestCarveUnknownChunks object at 0x7f5be8134040>, tmp_path = PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0')

    def test_multiple_chunks(self, tmp_path: Path):
        content = b"test file"
        test_file = io.BytesIO(content)
        chunks = [UnknownChunk(0, 4), UnknownChunk(4, 9)]
        carve_unknown_chunks(tmp_path, test_file, chunks)
        written_path1 = tmp_path / "0-4.unknown"
        written_path2 = tmp_path / "4-9.unknown"
>       assert list(tmp_path.iterdir()) == [written_path1, written_path2]
E       AssertionError: assert [PosixPath('/...0-4.unknown')] == [PosixPath('/...4-9.unknown')]
E         At index 0 diff: PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown') != PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown')
E         Full diff:
E           [
E         +  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E            PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown'),
E         -  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E           ]

tests/test_extractor.py:30: AssertionError

Metadata file

Store the metadata file in extract_root in one JSON file.

We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.

For example:


class Metadata:
    filename: Optional[str]
    size: Optional[int]
    perms: Optional[int]
    endianness: Optional[str]
    uid: Optional[int]
    username: Optional[str]
    gid: Optional[int]
    groupname: Optional[str]
    inode: Optional[int]
    vnode: Optional[int]

@attr.define
class Chunk:
    """Chunk of a Blob, have start and end offset, but still can be invalid."""

    start_offset: int
    # This is the last byte included
    end_offset: int
    handler: "Handler" = attr.ib(init=False, eq=False)
    metadata: Optional[Metadata]

Compile YARA rules only once

Currently we are dynamically compiling the YARA rules, and not caching the results. The compilation is also printed multiple times, which is very noisy in the verbose output.

Be able to only emit information about Chunks

Show metadata about found Chunks.

Register handlers by using decorators

When merging the formats, we have to solve merge conflicts all the time. A decorator-based approach would be less problematic.
Something like:

@register(priority=0)
class SomeHandler:
    ...

Or search for subclasses of Handler.

Don't forget we have to import them, so it might be necessary to use something like venusian?

Also we need to think about how this will fit into the strategy implementation.

Use Git LFS to store integration test files

There are some file formats which would be too big to commit in the Git repo. We need to generate these files or Use Git-LFS for these: https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage

FAT 32
Apple DMG format on Mac OS X

Or just use Git LFS for ALL the integration test files.

Don't read the whole file into the memory when carving out chunks

We are currently reading the whole file into memory when carving out chunks, which is obviously terrible. There is a function for that in standard library: shutil.copyfileobj, we should use that instead.

We should stop using UnkownChunk for matched structures with invalid headers

Right now we return an UnknownChunk for any YARA match where its handler failed to compute a ValidChunk. This can be due to YARA matching on something that is not an archive format, the archive header being corrupted, etc.

To me, an unknown chunk should represent chunks in-between valid chunks for which we did not match. We should not emit structures for something that our handlers failed to process.

This can be problematic when running unblob on a 2.5Gb ISO image of Ubuntu. We will have a valid ISO9660 chunk, followed by hundreds of UnknownChunk being printed out on the console. Mostly "fake" ZIP files in my case.

My solution would be to simply return None when a format handler cannot properly identify a chunk.

Write Dockerfile

A simple Dockerfile which has all required external dependencies installed.

RomFS extractor

RomFS filesystem format is not handled by 7zip and there is no standalone extractor on the market. That means we need to mount the filesystem somewhere, copy its content to our output directory, and then unmount the filesystem.

This is what most automated tools do (e.g. uClinux extract-romfs ).

This is a problem because mount requires root privileges.

This is what a naive implementation would do:

mkdir -p ${outdir}
mkdir -p "${outdir}_tmp"
sudo mount -o loop -t romfs ${infile} "${outdir}_tmp"
cp "${outdir}_tmp"/* ${outdir}/
sudo umount "${outdir}_tmp"

We need to either:

write our own standalone extractor
find a way to mount cleanly without elevated privileges (udisksctl and the like)
not support RomFS
sandbox it

Make proper release plan

Establish a clear way of doing releases from this project.

Some ideas (without any kind of order / importance whatsoever):

Push docker image to public registry with proper tagging / labeling? (#36 follow-up)
Same with standalone binary? (#102 follow-up)
Publish (pypi) package? (#101 follow-up)
etc?!

Add possibility to cstruct.dumpstruct

Either on errors or with debug output, add a flag which prints the specified chunks with cstruct.dumpstruct to make it easier to debug file formats.

Improve extraction output

My ideas:

0x0000-0xff00 format for chunks instead of just 0-ff00 (less ambiguous)
Create less directories. We put extracted chunks in chunk offset even if that covers the whole file. Should be not necessary?

Exception when running unblob --help

Usage: unblob [OPTIONS] [FILES]...

Options:
  -e, --extract-dir DIRECTORY  Extract the files to this directory. Will be
                               created if doesn't exist.
  -d, --depth INTEGER          Recursion depth. How deep should we extract
                               containers.
  -v, --verbose                Verbose mode, enable debug logs.
  --help                       Show this message and exit.
Traceback (most recent call last):
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/bin/unblob", line 5, in <module>
    main()
  File "/home/walkman/Projects/unblob/unblob/cli.py", line 46, in main
    ctx = cli.make_context("unblob", sys.argv[1:])
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 914, in make_context
    self.parse_args(ctx, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1370, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2347, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2309, in process_value
    value = self.callback(ctx, self, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1271, in show_help
    ctx.exit()
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 681, in exit
    raise Exit(code)
click.exceptions.Exit: 0

Add unar in Github Workflows

We should add the unar dependency to our Github workflow definition in a separate branch and then rebase on it in branches that requires it.

Currently done in the RAR branch (see https://github.com/IoT-Inspector/unblob/pull/33/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957f).

But will be required by PR #33 and #51.

Doing it separately will make the history cleaner I think.

Relative path logging bug

unblob fails when receiving a path outside its own root:

poetry run unblob -v /tmp/lzo     
2021-12-03 10:35.22 [info     ] Start processing files         count=0x1
2021-12-03 10:35.22 [info     ] Start processing file          path=/tmp/lzo
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/quentin/unblob/unblob/cli.py", line 54, in main
    cli.invoke(ctx)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/quentin/unblob/unblob/cli.py", line 41, in cli
    process_file(path.parent, path, extract_root, max_depth=depth)
  File "/home/quentin/unblob/unblob/processing.py", line 28, in process_file
    log.info("Found directory")
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_log_levels.py", line 118, in meth
    return self._proxy_to_logger(name, event, **kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 202, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 159, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/home/quentin/unblob/unblob/logging.py", line 31, in convert_type
    rel_path = value.relative_to(extract_root)
  File "/usr/lib/python3.8/pathlib.py", line 908, in relative_to
    raise ValueError("{!r} does not start with {!r}"
ValueError: '/tmp/lzo' does not start with '/home/quentin/unblob'

We should fix it here https://github.com/IoT-Inspector/unblob/blob/main/unblob/logging.py#L31

Assigning it to @kissgyorgy given that he worked on f565ed2, which looks similar.

Parallelize file processing

We want to run the processing in a multiprocessing.pool.Pool.

In order to simplify the implementation, and for measuring things, we would do this in 2 steps:

We can run unblob in for every new process_file call.
We can look into yara.match callbacks and start the handling in the pool right away. (For the current code, this would need more complex changes.)

Inexact padding calculation on specific TAR files

The offset computation we came up with in the TarHandler at _get_tar_end_offset is inexact for specific files.

I surmise it has to do with block padding and block alignment. Without a fix, unblob finds the end offset past the file and discard the chunk.

I can provide a sample internally if someone wants to work on this:

sha1sum /tmp/firmware.img 
fc910cc004254fda5eef27c88f943296d3df967e  /tmp/firmware.img

cache compiled YARA rules

Save the compiled YARA rules during build and just load it at runtime, no need to compile it every single time.
https://yara.readthedocs.io/en/stable/yarapython.html

Standalone executable

We want to make it possible to run unblob in many environments easily.
PyOxidizer is a great tool to making fully standalone ELF binaries: https://github.com/indygreg/PyOxidizer

Support out-of-tree extensions for handling additional file formats

As a security researcher, I need a quick way to add support additional extractors, so that I can analyze currently unsupported firmware images.

An extension is just a Python file stored in a well-known and-or configurable location that should be picked-up without any hassle.

Stop running handlers when a chunk matches the whole file.

Whenever a chunk matching the whole file is found, we should just return it and stop processing others.

Create github actions for formatting + linting

We need to have github actions for running:

isort
black
flake8
pyright

Don't emit UnknownChunks for whole files

When we extracted a file, it makes no sense to issue a warning about it or trying to carve it out.
Handlers in later priority might pick up those files.
We wanted to have the UnknownChunk represent gaps in known files anyway.

Make test files reproducible

Have a scripts dir or a Justfile in the repo, which we can use to reproduce all the test files in case we need to change them a bit, or just making them reproducible.

Multiple handler should be able to use the same file at the same time

We must ensure they can seek and read from the same file independently, so they can be run in parallel.

Carve out unknown chunks

At the moment we only emit a warning about unknown chunks, but we should save them to files for further investigation.

Implement a size overflow check in a base handler class.

All format handlers calculate the expected size of a chunk. We should implement a generic check that verifies if the calculated end_offset points after the actual end of a file.

Something along those lines:

def overflow(file: io.BufferedReader, end_offset: int):
    file.seek(0, os.SEEK_END)
    size = file.tell()
    return end_offset > size

This is an easy win in terms of heuristics, to get rid of invalid and corrupted headers.

Inexact seek back on error when parsing malformed AR archives

So I'm starting to fuzz each formats at the moment.

A really dumb technique to fuzz the AR format:

#!/bin/bash 
while true; do
    printf '\x21\x3C\x61\x72\x63\x68\x3E\x0A' > /tmp/file.bin;
    dd if=/dev/random count=58 bs=1 >> /tmp/file.bin;
    printf '\x60\x0A' >> /tmp/file.bin
    dd if=/dev/random count=1 bs=1M >> /tmp/file.bin;
    poetry run unblob /tmp/file.bin;
    rm -rf file.bin*
done

The idea is to craft a file that will trigger a Yara match, with random bytes wherever we can put them.

The errors are well handled and do not lead to crashes, however we noticed an error in the exception handling. On error, the handler seek back of HEADER_LENGTH which is set to 60, but the exact size of header is actually 68.

except arpy.ArchiveFormatError as exc:
    logger.debug(
        "Hit an ArchiveFormatError, we've probably hit some other kind of data",
        exc_info=exc,
    )
    # Since arpy has tried to read another file header, we need to wind the cursor back the
    # length of the header, so it points to the end of the AR chunk.
    ar.file.seek(-HEADER_LENGTH, os.SEEK_CUR)

This is obvious when unknown chunk is logged:

2021-12-15 15:57.37 [warning  ] Found unknown Chunks           chunks=[0x8-0x100044]

We should fix this by changing the HEADER_LENGTH value.

Fix depth calculation

In process_file the depth comparison is bad, only works properly, when the depth is the same as default.
Counting from 0 to depth might be more intuitive.

Integration tests trigger unknown chunks warning messages.

Our integration tests trigger warnings for unknown chunks when parsing CPIO and TAR formats. This should not happen.

It's probably an end offset miscalculation due to padding, or test files not respecting the standard.

AR archive handler emits empty valid chunks on malformed AR archives

The AR archive handler emits empty valid chunks with start offset 0 and end offset 0 when an exception is triggered by malformed archive.

This is due to the current implementation that emits valid chunks regardless of the exception state:

try:
    ar.read_all_headers()
except arpy.ArchiveFormatError as exc:
    # debug log
    ...
offset = ar.file.tell()
return ValidChunk(
    start_offset=start_offset,
    end_offset=start_offset + offset,
)

Fail unblob with non-0 exit status when there were errors

We are skipping some errors happened during the run, but we should keep it and exit at the end with non-0.
For example when we couldn't extract an archive, unblob process itself should exit with an error code.

Calculate entropy for UnknownChunks

calculate shannon_entropy
log record
Pretty graph with entropy calculated for chunks
calculate entropy for input files if we couldn't find anything in them

Exception when firmware is in different folder than where we want to extract it

When the firmware is in a different folder, e.g. /tmp/firmware.bin we get an Exception

Show external dependencies

We need a function which can list all the required third party dependencies which needed to extract all file types we support.

--show-external-dependencies list all required scripts with new lines (maybe suggest Ubuntu package names to install?)
Put a checkmark or a cross next to every dependency, exit with exit code 1 if something is missing, 0 if we have every dependency.
(eager option with a separate function in Click)
Put a comma separated list to the end of the help "You also need these commands to be able to extract the supported file types"
Run this command in the Docker build GitHub action
Put the command name in the error message when the command not found 😄

onekey-sec / unblob Goto Github PK

unblob's Issues

Recommend Projects

Recommend Topics

Recommend Org