Code Monkey home page Code Monkey logo

unblob's Introduction

unblob

unblob is an accurate, fast, and easy-to-use extraction suite. It parses unknown binary blobs for more than 30 different archive, compression, and file-system formats, extracts their content recursively, and carves out unknown chunks that have not been accounted for.

Unblob is free to use, licensed with the MIT license. It has a Command Line Interface and can be used as a Python library.
This turns unblob into the perfect companion for extracting, analyzing, and reverse engineering firmware images.

See more at https://unblob.org.

Demo

unblob's People

Contributors

0rshemesh avatar andrewfasano avatar david-filipidisz avatar dependabot[bot] avatar dorpvom avatar e3krisztian avatar ffontaine avatar flukavsky avatar github-actions[bot] avatar kissgyorgy avatar kukovecz avatar kxynos avatar ljrk0 avatar martonilles avatar martonivan avatar mucoze avatar nyuware avatar qkaiser avatar tests00 avatar vlaci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unblob's Issues

Inexact padding calculation on specific TAR files

The offset computation we came up with in the TarHandler at _get_tar_end_offset is inexact for specific files.

I surmise it has to do with block padding and block alignment. Without a fix, unblob finds the end offset past the file and discard the chunk.

I can provide a sample internally if someone wants to work on this:

sha1sum /tmp/firmware.img 
fc910cc004254fda5eef27c88f943296d3df967e  /tmp/firmware.img

Make proper release plan

Establish a clear way of doing releases from this project.

Some ideas (without any kind of order / importance whatsoever):

  • Push docker image to public registry with proper tagging / labeling? (#36 follow-up)
  • Same with standalone binary? (#102 follow-up)
  • Publish (pypi) package? (#101 follow-up)
  • etc?!

Improve extraction output

My ideas:

  • 0x0000-0xff00 format for chunks instead of just 0-ff00 (less ambiguous)
  • Create less directories. We put extracted chunks in chunk offset even if that covers the whole file. Should be not necessary?

Implement a size overflow check in a base handler class.

All format handlers calculate the expected size of a chunk. We should implement a generic check that verifies if the calculated end_offset points after the actual end of a file.

Something along those lines:

def overflow(file: io.BufferedReader, end_offset: int):
    file.seek(0, os.SEEK_END)
    size = file.tell()
    return end_offset > size

This is an easy win in terms of heuristics, to get rid of invalid and corrupted headers.

We should stop using UnkownChunk for matched structures with invalid headers

Right now we return an UnknownChunk for any YARA match where its handler failed to compute a ValidChunk. This can be due to YARA matching on something that is not an archive format, the archive header being corrupted, etc.

To me, an unknown chunk should represent chunks in-between valid chunks for which we did not match. We should not emit structures for something that our handlers failed to process.

This can be problematic when running unblob on a 2.5Gb ISO image of Ubuntu. We will have a valid ISO9660 chunk, followed by hundreds of UnknownChunk being printed out on the console. Mostly "fake" ZIP files in my case.

My solution would be to simply return None when a format handler cannot properly identify a chunk.

Inexact seek back on error when parsing malformed AR archives

So I'm starting to fuzz each formats at the moment.

A really dumb technique to fuzz the AR format:

#!/bin/bash 
while true; do
    printf '\x21\x3C\x61\x72\x63\x68\x3E\x0A' > /tmp/file.bin;
    dd if=/dev/random count=58 bs=1 >> /tmp/file.bin;
    printf '\x60\x0A' >> /tmp/file.bin
    dd if=/dev/random count=1 bs=1M >> /tmp/file.bin;
    poetry run unblob /tmp/file.bin;
    rm -rf file.bin*
done

The idea is to craft a file that will trigger a Yara match, with random bytes wherever we can put them.

The errors are well handled and do not lead to crashes, however we noticed an error in the exception handling. On error, the handler seek back of HEADER_LENGTH which is set to 60, but the exact size of header is actually 68.

except arpy.ArchiveFormatError as exc:
    logger.debug(
        "Hit an ArchiveFormatError, we've probably hit some other kind of data",
        exc_info=exc,
    )
    # Since arpy has tried to read another file header, we need to wind the cursor back the
    # length of the header, so it points to the end of the AR chunk.
    ar.file.seek(-HEADER_LENGTH, os.SEEK_CUR)

This is obvious when unknown chunk is logged:

2021-12-15 15:57.37 [warning  ] Found unknown Chunks           chunks=[0x8-0x100044]

We should fix this by changing the HEADER_LENGTH value.

RomFS extractor

RomFS filesystem format is not handled by 7zip and there is no standalone extractor on the market. That means we need to mount the filesystem somewhere, copy its content to our output directory, and then unmount the filesystem.

This is what most automated tools do (e.g. uClinux extract-romfs ).

This is a problem because mount requires root privileges.

This is what a naive implementation would do:

mkdir -p ${outdir}
mkdir -p "${outdir}_tmp"
sudo mount -o loop -t romfs ${infile} "${outdir}_tmp"
cp "${outdir}_tmp"/* ${outdir}/
sudo umount "${outdir}_tmp"

We need to either:

  • write our own standalone extractor
  • find a way to mount cleanly without elevated privileges (udisksctl and the like)
  • not support RomFS
  • sandbox it

Add possibility to cstruct.dumpstruct

Either on errors or with debug output, add a flag which prints the specified chunks with cstruct.dumpstruct to make it easier to debug file formats.

Improve documentation

Documentation available in the wiki.

TODO:

  • adapt examples to the new Hyperscan based API
  • rewrite the introduction
  • cleanup the README
  • squash wiki history

Set up performance testing

We need to measure how fast unblob as a whole can operate and what strategy can speed up extraction significantly.
Example question we want to answer: Which is faster? Matching on all YARA patterns at once or iterating on the file multiple times with less patterns?

Measure different scenarios:

  • One big file with few smaller files inside
  • Lots of small files concatenated and inside
  • Multiple big files concatenated and inside
  • Refact the priority handling by concatenating all YARA rules and handle the match results by priority instead of scanning a file multiple times. Measure the difference on various files.

Exception when running unblob --help

Usage: unblob [OPTIONS] [FILES]...

Options:
  -e, --extract-dir DIRECTORY  Extract the files to this directory. Will be
                               created if doesn't exist.
  -d, --depth INTEGER          Recursion depth. How deep should we extract
                               containers.
  -v, --verbose                Verbose mode, enable debug logs.
  --help                       Show this message and exit.
Traceback (most recent call last):
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/bin/unblob", line 5, in <module>
    main()
  File "/home/walkman/Projects/unblob/unblob/cli.py", line 46, in main
    ctx = cli.make_context("unblob", sys.argv[1:])
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 914, in make_context
    self.parse_args(ctx, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1370, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2347, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2309, in process_value
    value = self.callback(ctx, self, value)
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1271, in show_help
    ctx.exit()
  File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 681, in exit
    raise Exit(code)
click.exceptions.Exit: 0

Register handlers by using decorators

When merging the formats, we have to solve merge conflicts all the time. A decorator-based approach would be less problematic.
Something like:

@register(priority=0)
class SomeHandler:
    ...

Or search for subclasses of Handler.

Don't forget we have to import them, so it might be necessary to use something like venusian?

Also we need to think about how this will fit into the strategy implementation.

Make test files reproducible

Have a scripts dir or a Justfile in the repo, which we can use to reproduce all the test files in case we need to change them a bit, or just making them reproducible.

Relative path logging bug

unblob fails when receiving a path outside its own root:

poetry run unblob -v /tmp/lzo     
2021-12-03 10:35.22 [info     ] Start processing files         count=0x1
2021-12-03 10:35.22 [info     ] Start processing file          path=/tmp/lzo
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/quentin/unblob/unblob/cli.py", line 54, in main
    cli.invoke(ctx)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/quentin/unblob/unblob/cli.py", line 41, in cli
    process_file(path.parent, path, extract_root, max_depth=depth)
  File "/home/quentin/unblob/unblob/processing.py", line 28, in process_file
    log.info("Found directory")
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_log_levels.py", line 118, in meth
    return self._proxy_to_logger(name, event, **kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 202, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 159, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/home/quentin/unblob/unblob/logging.py", line 31, in convert_type
    rel_path = value.relative_to(extract_root)
  File "/usr/lib/python3.8/pathlib.py", line 908, in relative_to
    raise ValueError("{!r} does not start with {!r}"
ValueError: '/tmp/lzo' does not start with '/home/quentin/unblob'

We should fix it here https://github.com/IoT-Inspector/unblob/blob/main/unblob/logging.py#L31

Assigning it to @kissgyorgy given that he worked on f565ed2, which looks similar.

Calculate entropy for UnknownChunks

  • calculate shannon_entropy
  • log record
  • Pretty graph with entropy calculated for chunks
  • calculate entropy for input files if we couldn't find anything in them

Fix depth calculation

In process_file the depth comparison is bad, only works properly, when the depth is the same as default.
Counting from 0 to depth might be more intuitive.

Carve out unknown chunks

At the moment we only emit a warning about unknown chunks, but we should save them to files for further investigation.

Support out-of-tree extensions for handling additional file formats

As a security researcher, I need a quick way to add support additional extractors, so that I can analyze currently unsupported firmware images.

An extension is just a Python file stored in a well-known and-or configurable location that should be picked-up without any hassle.

Unit tests depends on filesystem enumeration order

When ran in tmpfs on my machine:

tests/test_extractor.py::TestCarveUnknownChunks::test_multiple_chunks FAILED                                                                                                                                                                                                                                                                                                                                                     [  1%]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured stdout >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>2021-12-08 21:16.50 [warning  ] Found unknown Chunks           chunks=[0x0-0x4, 0x4-0x9]
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x0-0x4 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown
2021-12-08 21:16.50 [info     ] Extracting unknown chunk       chunk=0x4-0x9 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = <test_extractor.TestCarveUnknownChunks object at 0x7f5be8134040>, tmp_path = PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0')

    def test_multiple_chunks(self, tmp_path: Path):
        content = b"test file"
        test_file = io.BytesIO(content)
        chunks = [UnknownChunk(0, 4), UnknownChunk(4, 9)]
        carve_unknown_chunks(tmp_path, test_file, chunks)
        written_path1 = tmp_path / "0-4.unknown"
        written_path2 = tmp_path / "4-9.unknown"
>       assert list(tmp_path.iterdir()) == [written_path1, written_path2]
E       AssertionError: assert [PosixPath('/...0-4.unknown')] == [PosixPath('/...4-9.unknown')]
E         At index 0 diff: PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown') != PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown')
E         Full diff:
E           [
E         +  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E            PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown'),
E         -  PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E           ]

tests/test_extractor.py:30: AssertionError

Don't emit UnknownChunks for whole files

When we extracted a file, it makes no sense to issue a warning about it or trying to carve it out.
Handlers in later priority might pick up those files.
We wanted to have the UnknownChunk represent gaps in known files anyway.

Show external dependencies

We need a function which can list all the required third party dependencies which needed to extract all file types we support.

  • --show-external-dependencies list all required scripts with new lines (maybe suggest Ubuntu package names to install?)
    Put a checkmark or a cross next to every dependency, exit with exit code 1 if something is missing, 0 if we have every dependency.
    (eager option with a separate function in Click)
  • Put a comma separated list to the end of the help "You also need these commands to be able to extract the supported file types"
  • Run this command in the Docker build GitHub action
  • Put the command name in the error message when the command not found ๐Ÿ˜„

Notice truncated or corrupted files

When we found a corrupt chunk in a file, we should be able to extract until that point and return the corrupted chunk as UnknownChunk and issue a warning log message.

Improve ARC handling

The ARC handler fails when run against randomly generated content:

2021-12-02 15:25.12 [info     ] Calculating chunk for YARA match identifier=$arc_magic real_offset=0x554816 start_offset=0x554816
2021-12-02 15:25.12 [debug    ] Header parsed                  header=
00000000  1a 03 1b da e2 92 33 be  29 e5 72 b0 7b 71 fe c8   ......3.).r.{q..
00000010  80 11 57 34 68 96 3e 58  2a 9a 0a cc 26            ..W4h.>X*...&

struct heads:
- archive_marker: 0x1a
- header_type: 0x3
- name: b'\x1b\xda\xe2\x923\xbe)\xe5r\xb0{q\xfe'
- size: 0x571180c8
- date: 0x6834
- time: 0x3e96
- crc: 0x2a58
- length: 0x26cc0a9a
2021-12-02 15:25.12 [error    ] Unhandled Exception during chunk calculation 
Traceback (most recent call last):
  File "/home/quentin/unblob/unblob/strategies.py", line 50, in search_chunks_by_priority
    chunk = handler.calculate_chunk(limited_reader, real_offset)
  File "/home/quentin/unblob/unblob/handlers/archive/arc.py", line 67, in calculate_chunk
    header = self.parse_header(file)
  File "/home/quentin/unblob/unblob/models.py", line 119, in parse_header
    header = self._struct_parser.parse(self.HEADER_STRUCT, file, endian)
  File "/home/quentin/unblob/unblob/file_utils.py", line 143, in parse
    return struct_parser(file)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 16, in __call__
    return self.read(*args, **kwargs)
  File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 62, in read
    return self._read(obj)
  File "<compiled heads>", line 14, in _read
EOFError

We should capture the EOFError early in the handler and simply not return a chunk.

Parallelize file processing

We want to run the processing in a multiprocessing.pool.Pool.

In order to simplify the implementation, and for measuring things, we would do this in 2 steps:

  1. We can run unblob in for every new process_file call.
  2. We can look into yara.match callbacks and start the handling in the pool right away. (For the current code, this would need more complex changes.)

Fail on missing integration tests

When we have a handler which don't have integration tests in the folder, the test suite should fail.
Currently we are only collecting existing tests.

Metadata file

Store the metadata file in extract_root in one JSON file.

We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.

For example:


class Metadata:
    filename: Optional[str]
    size: Optional[int]
    perms: Optional[int]
    endianness: Optional[str]
    uid: Optional[int]
    username: Optional[str]
    gid: Optional[int]
    groupname: Optional[str]
    inode: Optional[int]
    vnode: Optional[int]

@attr.define
class Chunk:
    """Chunk of a Blob, have start and end offset, but still can be invalid."""

    start_offset: int
    # This is the last byte included
    end_offset: int
    handler: "Handler" = attr.ib(init=False, eq=False)
    metadata: Optional[Metadata]

Compile YARA rules only once

Currently we are dynamically compiling the YARA rules, and not caching the results. The compilation is also printed multiple times, which is very noisy in the verbose output.

Write Dockerfile

A simple Dockerfile which has all required external dependencies installed.

Refactor Chunk related functions

We don't return UnknownChunks from search_chunks_by_priority and we don't use the base Chunk model anymore, so we can get rid of the abstraction.

AR archive handler emits empty valid chunks on malformed AR archives

The AR archive handler emits empty valid chunks with start offset 0 and end offset 0 when an exception is triggered by malformed archive.

This is due to the current implementation that emits valid chunks regardless of the exception state:

try:
    ar.read_all_headers()
except arpy.ArchiveFormatError as exc:
    # debug log
    ...
offset = ar.file.tell()
return ValidChunk(
    start_offset=start_offset,
    end_offset=start_offset + offset,
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.