onekey-sec / unblob Goto Github PK

View Code? Open in Web Editor NEW

2.1K 21.0 75.0 3.84 MB

Extract files from any kind of container formats

Home Page: https://unblob.org

License: Other

Python 97.87% Arc 0.02% Roff 0.02% Dockerfile 0.09% Nix 1.81% Shell 0.19%

extraction filesystem archive compression python

unblob's Introduction

unblob

unblob is an accurate, fast, and easy-to-use extraction suite. It parses unknown binary blobs for more than 30 different archive, compression, and file-system formats, extracts their content recursively, and carves out unknown chunks that have not been accounted for.

Unblob is free to use, licensed with the MIT license. It has a Command Line Interface and can be used as a Python library.
This turns unblob into the perfect companion for extracting, analyzing, and reverse engineering firmware images.

See more at https://unblob.org.

Demo

unblob's People

Contributors

Stargazers

Watchers

Forkers

vlaci martonilles minkione m8e ptruser nightlark doytsujin timb-machine-mirrors zined silky joshuawalcher nanderoo sea-conch spread0x hhy5277 y8765 crackzhazha hxlxmjxbbxs bronxc andrzejrpiotrowski dorpvom jrcribb mageweig emba-support-repos drtangta chanpu9 bbhunter anshit2121162nv8 harrperson gg-big-org c0axial capuanob soulofmischief jonasagx crackercat mistobaan dophist thearchiver cudeso daviewales orguetta kissgyorgy 5l1v3r1 mayhemheroes justinforbes gavz ardiey21 nyuware chaoticenigma jamestiotio ognz chinayuan palantir555 neobytz ycmint gmh5225 frostedspikes noorahsmith techris45 fuckualreadytaken l1053727938 zhaowynn orosam kxynos 0rshemesh zenhumany asdyxcyxc songofhack rehosting likeshan168 cheyunhua ljrk0 roboschmied geosaffer revinous

unblob's Issues

Standalone executable

We want to make it possible to run unblob in many environments easily.
PyOxidizer is a great tool to making fully standalone ELF binaries: https://github.com/indygreg/PyOxidizer

GitHub action enforcing up-to-date topic branch with clean history

E.g. merge base should be the top of main, and the PR shouldn't contain merge commits.

cache compiled YARA rules

Save the compiled YARA rules during build and just load it at runtime, no need to compile it every single time.
https://yara.readthedocs.io/en/stable/yarapython.html

Inexact padding calculation on specific TAR files

The offset computation we came up with in the TarHandler at _get_tar_end_offset is inexact for specific files.

I surmise it has to do with block padding and block alignment. Without a fix, unblob finds the end offset past the file and discard the chunk.

I can provide a sample internally if someone wants to work on this:

sha1sum /tmp/firmware.img 
fc910cc004254fda5eef27c88f943296d3df967e  /tmp/firmware.img

Stop running handlers when a chunk matches the whole file.

Whenever a chunk matching the whole file is found, we should just return it and stop processing others.

Make proper release plan

Establish a clear way of doing releases from this project.

Some ideas (without any kind of order / importance whatsoever):

Push docker image to public registry with proper tagging / labeling? (#36 follow-up)
Same with standalone binary? (#102 follow-up)
Publish (pypi) package? (#101 follow-up)
etc?!

Store the extractor output in case of errors

When an extraction failed, store the error messages (maybe the whole standard error) of the extractor, because we want to show that for the users.

Look into yara.match parallelization with callbacks

Only after this: #71
start the handling in the pool right away. (For the current code, this would need more complex changes.)

Improve extraction output

My ideas:

0x0000-0xff00 format for chunks instead of just 0-ff00 (less ambiguous)
Create less directories. We put extracted chunks in chunk offset even if that covers the whole file. Should be not necessary?

YARA Python segmentation fault

If you download the latest ubuntu LTS, unblob will crash with segmentation fault.

Implement a size overflow check in a base handler class.

All format handlers calculate the expected size of a chunk. We should implement a generic check that verifies if the calculated end_offset points after the actual end of a file.

Something along those lines:

def overflow(file: io.BufferedReader, end_offset: int):
    file.seek(0, os.SEEK_END)
    size = file.tell()
    return end_offset > size

This is an easy win in terms of heuristics, to get rid of invalid and corrupted headers.

We should stop using UnkownChunk for matched structures with invalid headers

Right now we return an UnknownChunk for any YARA match where its handler failed to compute a ValidChunk. This can be due to YARA matching on something that is not an archive format, the archive header being corrupted, etc.

To me, an unknown chunk should represent chunks in-between valid chunks for which we did not match. We should not emit structures for something that our handlers failed to process.

This can be problematic when running unblob on a 2.5Gb ISO image of Ubuntu. We will have a valid ISO9660 chunk, followed by hundreds of UnknownChunk being printed out on the console. Mostly "fake" ZIP files in my case.

My solution would be to simply return None when a format handler cannot properly identify a chunk.

Inexact seek back on error when parsing malformed AR archives

So I'm starting to fuzz each formats at the moment.

A really dumb technique to fuzz the AR format:

#!/bin/bash 
while true; do
    printf '\x21\x3C\x61\x72\x63\x68\x3E\x0A' > /tmp/file.bin;
    dd if=/dev/random count=58 bs=1 >> /tmp/file.bin;
    printf '\x60\x0A' >> /tmp/file.bin
    dd if=/dev/random count=1 bs=1M >> /tmp/file.bin;
    poetry run unblob /tmp/file.bin;
    rm -rf file.bin*
done

The idea is to craft a file that will trigger a Yara match, with random bytes wherever we can put them.

The errors are well handled and do not lead to crashes, however we noticed an error in the exception handling. On error, the handler seek back of HEADER_LENGTH which is set to 60, but the exact size of header is actually 68.

except arpy.ArchiveFormatError as exc:
    logger.debug(
        "Hit an ArchiveFormatError, we've probably hit some other kind of data",
        exc_info=exc,
    )
    # Since arpy has tried to read another file header, we need to wind the cursor back the
    # length of the header, so it points to the end of the AR chunk.
    ar.file.seek(-HEADER_LENGTH, os.SEEK_CUR)

This is obvious when unknown chunk is logged:

2021-12-15 15:57.37 [warning  ] Found unknown Chunks           chunks=[0x8-0x100044]

We should fix this by changing the HEADER_LENGTH value.

Use Git LFS to store integration test files

There are some file formats which would be too big to commit in the Git repo. We need to generate these files or Use Git-LFS for these: https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage

FAT 32
Apple DMG format on Mac OS X

Or just use Git LFS for ALL the integration test files.

Don't catch EOFError

Click is catching EOFError and quitting with an Aborted! message.

We are quite often getting this Exception during development, because of the nature of the project, but
we need the traceback very much to make our life easier.
https://click.palletsprojects.com/en/7.x/exceptions/?highlight=standalone_mode

AR archive handler end_offset computation is wrong.

The AR archive handler calculates the end offset as if the AR archive always starts at the beginning of the file being analyzed. This is wrong.

RomFS extractor

RomFS filesystem format is not handled by 7zip and there is no standalone extractor on the market. That means we need to mount the filesystem somewhere, copy its content to our output directory, and then unmount the filesystem.

This is what most automated tools do (e.g. uClinux extract-romfs ).

This is a problem because mount requires root privileges.

This is what a naive implementation would do:

mkdir -p ${outdir}
mkdir -p "${outdir}_tmp"
sudo mount -o loop -t romfs ${infile} "${outdir}_tmp"
cp "${outdir}_tmp"/* ${outdir}/
sudo umount "${outdir}_tmp"

We need to either:

write our own standalone extractor
find a way to mount cleanly without elevated privileges (udisksctl and the like)
not support RomFS
sandbox it

Add possibility to cstruct.dumpstruct

Either on errors or with debug output, add a flag which prints the specified chunks with cstruct.dumpstruct to make it easier to debug file formats.

Fail unblob with non-0 exit status when there were errors

We are skipping some errors happened during the run, but we should keep it and exit at the end with non-0.
For example when we couldn't extract an archive, unblob process itself should exit with an error code.

Improve documentation

Documentation available in the wiki.

TODO:

adapt examples to the new Hyperscan based API
rewrite the introduction
cleanup the README
squash wiki history

Set up performance testing

We need to measure how fast unblob as a whole can operate and what strategy can speed up extraction significantly.
Example question we want to answer: Which is faster? Matching on all YARA patterns at once or iterating on the file multiple times with less patterns?

Measure different scenarios:

One big file with few smaller files inside
Lots of small files concatenated and inside
Multiple big files concatenated and inside
Refact the priority handling by concatenating all YARA rules and handle the match results by priority instead of scanning a file multiple times. Measure the difference on various files.

Exception when running unblob --help

Usage: unblob [OPTIONS] [FILES]...

Options:
  -e, --extract-dir DIRECTORY  Extract the files to this directory. Will be
                               created if doesn't exist.
  -d, --depth INTEGER          Recursion depth. How deep should we extract
                               containers.
  -v, --verbose                Verbose mode, enable debug logs.
  --help                       Show this message and exit.
Traceback (most recent call last):
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/bin/unblob", line 5, in <module>
    main()
  File "/home/walkman/Projects/unblob/unblob/cli.py", line 46, in main
    ctx = cli.make_context("unblob", sys.argv[1:])
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 914, in make_context
    self.parse_args(ctx, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1370, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2347, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2309, in process_value
    value = self.callback(ctx, self, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1271, in show_help
    ctx.exit()
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 681, in exit
    raise Exit(code)
click.exceptions.Exit: 0

Register handlers by using decorators

When merging the formats, we have to solve merge conflicts all the time. A decorator-based approach would be less problematic.
Something like:

@register(priority=0)
class SomeHandler:
    ...

Or search for subclasses of Handler.

Don't forget we have to import them, so it might be necessary to use something like venusian?

Also we need to think about how this will fit into the strategy implementation.

Don't read the whole file into the memory when carving out chunks

We are currently reading the whole file into memory when carving out chunks, which is obviously terrible. There is a function for that in standard library: shutil.copyfileobj, we should use that instead.

Be able to only emit information about Chunks

Show metadata about found Chunks.

Make test files reproducible

Have a scripts dir or a Justfile in the repo, which we can use to reproduce all the test files in case we need to change them a bit, or just making them reproducible.

Relative path logging bug

unblob fails when receiving a path outside its own root:

poetry run unblob -v /tmp/lzo     
2021-12-03 10:35.22 [info     ] Start processing files         count=0x1
2021-12-03 10:35.22 [info     ] Start processing file          path=/tmp/lzo
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/quentin/unblob/unblob/cli.py", line 54, in main
    cli.invoke(ctx)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/quentin/unblob/unblob/cli.py", line 41, in cli
    process_file(path.parent, path, extract_root, max_depth=depth)
  File "/home/quentin/unblob/unblob/processing.py", line 28, in process_file
    log.info("Found directory")
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_log_levels.py", line 118, in meth
    return self._proxy_to_logger(name, event, **kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 202, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 159, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/home/quentin/unblob/unblob/logging.py", line 31, in convert_type
    rel_path = value.relative_to(extract_root)
  File "/usr/lib/python3.8/pathlib.py", line 908, in relative_to
    raise ValueError("{!r} does not start with {!r}"
ValueError: '/tmp/lzo' does not start with '/home/quentin/unblob'

We should fix it here https://github.com/IoT-Inspector/unblob/blob/main/unblob/logging.py#L31

Assigning it to @kissgyorgy given that he worked on f565ed2, which looks similar.

Exception when firmware is in different folder than where we want to extract it

When the firmware is in a different folder, e.g. /tmp/firmware.bin we get an Exception

Add unar in Github Workflows

We should add the unar dependency to our Github workflow definition in a separate branch and then rebase on it in branches that requires it.

Currently done in the RAR branch (see https://github.com/IoT-Inspector/unblob/pull/33/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957f).

But will be required by PR #33 and #51.

Doing it separately will make the history cleaner I think.

Calculate entropy for UnknownChunks

calculate shannon_entropy
log record
Pretty graph with entropy calculated for chunks
calculate entropy for input files if we couldn't find anything in them

Fix depth calculation

In process_file the depth comparison is bad, only works properly, when the depth is the same as default.
Counting from 0 to depth might be more intuitive.

Carve out unknown chunks

At the moment we only emit a warning about unknown chunks, but we should save them to files for further investigation.

Support out-of-tree extensions for handling additional file formats

As a security researcher, I need a quick way to add support additional extractors, so that I can analyze currently unsupported firmware images.

An extension is just a Python file stored in a well-known and-or configurable location that should be picked-up without any hassle.

Unit tests depends on filesystem enumeration order

When ran in tmpfs on my machine:

tests/test_extractor.py::TestCarveUnknownChunks::test_multiple_chunks FAILED                                                                                                                                                                                                                                                                                                                                                     [  1%]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured stdout >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>2021-12-08 21:16.50 [warning  ] Found unknown Chunks           chunks=[0x0-0x4, 0x4-0x9]
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x0-0x4 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x4-0x9 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = <test_extractor.TestCarveUnknownChunks object at 0x7f5be8134040>, tmp_path = PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0')

    def test_multiple_chunks(self, tmp_path: Path):
        content = b"test file"
        test_file = io.BytesIO(content)
        chunks = [UnknownChunk(0, 4), UnknownChunk(4, 9)]
        carve_unknown_chunks(tmp_path, test_file, chunks)
        written_path1 = tmp_path / "0-4.unknown"
        written_path2 = tmp_path / "4-9.unknown"
>       assert list(tmp_path.iterdir()) == [written_path1, written_path2]
E       AssertionError: assert [PosixPath('/...0-4.unknown')] == [PosixPath('/...4-9.unknown')]
E         At index 0 diff: PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown') != PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown')
E         Full diff:
E           [
E         +  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E            PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown'),
E         -  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E           ]

tests/test_extractor.py:30: AssertionError

Don't emit UnknownChunks for whole files

When we extracted a file, it makes no sense to issue a warning about it or trying to carve it out.
Handlers in later priority might pick up those files.
We wanted to have the UnknownChunk represent gaps in known files anyway.

Make unblob usable as a library

#98 (comment)

Make something with the exit_code_var so global state will not be stored
Check the logging (noformat, ...)

Multiple handler should be able to use the same file at the same time

We must ensure they can seek and read from the same file independently, so they can be run in parallel.

Show external dependencies

We need a function which can list all the required third party dependencies which needed to extract all file types we support.

--show-external-dependencies list all required scripts with new lines (maybe suggest Ubuntu package names to install?)
Put a checkmark or a cross next to every dependency, exit with exit code 1 if something is missing, 0 if we have every dependency.
(eager option with a separate function in Click)
Put a comma separated list to the end of the help "You also need these commands to be able to extract the supported file types"
Run this command in the Docker build GitHub action
Put the command name in the error message when the command not found 😄

Notice truncated or corrupted files

When we found a corrupt chunk in a file, we should be able to extract until that point and return the corrupted chunk as UnknownChunk and issue a warning log message.

Improve ARC handling

The ARC handler fails when run against randomly generated content:

2021-12-02 15:25.12 [info     ] Calculating chunk for YARA match identifier=$arc_magic real_offset=0x554816 start_offset=0x554816
2021-12-02 15:25.12 [debug    ] Header parsed                  header=
00000000  1a 03 1b da e2 92 33 be  29 e5 72 b0 7b 71 fe c8   ......3.).r.{q..
00000010  80 11 57 34 68 96 3e 58  2a 9a 0a cc 26            ..W4h.>X*...&

struct heads:
- archive_marker: 0x1a
- header_type: 0x3
- name: b'\x1b\xda\xe2\x923\xbe)\xe5r\xb0{q\xfe'
- size: 0x571180c8
- date: 0x6834
- time: 0x3e96
- crc: 0x2a58
- length: 0x26cc0a9a
2021-12-02 15:25.12 [error    ] Unhandled Exception during chunk calculation 
Traceback (most recent call last):
  File "/home/quentin/unblob/unblob/strategies.py", line 50, in search_chunks_by_priority
    chunk = handler.calculate_chunk(limited_reader, real_offset)
  File "/home/quentin/unblob/unblob/handlers/archive/arc.py", line 67, in calculate_chunk
    header = self.parse_header(file)
  File "/home/quentin/unblob/unblob/models.py", line 119, in parse_header
    header = self._struct_parser.parse(self.HEADER_STRUCT, file, endian)
  File "/home/quentin/unblob/unblob/file_utils.py", line 143, in parse
    return struct_parser(file)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 16, in __call__
    return self.read(*args, **kwargs)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 62, in read
    return self._read(obj)
  File "<compiled heads>", line 14, in _read
EOFError

We should capture the EOFError early in the handler and simply not return a chunk.

Parallelize file processing

We want to run the processing in a multiprocessing.pool.Pool.

In order to simplify the implementation, and for measuring things, we would do this in 2 steps:

We can run unblob in for every new process_file call.
We can look into yara.match callbacks and start the handling in the pool right away. (For the current code, this would need more complex changes.)

Fail on missing integration tests

When we have a handler which don't have integration tests in the folder, the test suite should fail.
Currently we are only collecting existing tests.

Metadata file

Store the metadata file in extract_root in one JSON file.

We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.

For example:


class Metadata:
    filename: Optional[str]
    size: Optional[int]
    perms: Optional[int]
    endianness: Optional[str]
    uid: Optional[int]
    username: Optional[str]
    gid: Optional[int]
    groupname: Optional[str]
    inode: Optional[int]
    vnode: Optional[int]

@attr.define
class Chunk:
    """Chunk of a Blob, have start and end offset, but still can be invalid."""

    start_offset: int
    # This is the last byte included
    end_offset: int
    handler: "Handler" = attr.ib(init=False, eq=False)
    metadata: Optional[Metadata]

Run handlers based on priority

To speed extraction up.

Create github actions for formatting + linting

We need to have github actions for running:

isort
black
flake8
pyright

try:
    ar.read_all_headers()
except arpy.ArchiveFormatError as exc:
    # debug log
    ...
offset = ar.file.tell()
return ValidChunk(
    start_offset=start_offset,
    end_offset=start_offset + offset,
)