onekey-sec / unblob Goto Github PK
View Code? Open in Web Editor NEWExtract files from any kind of container formats
Home Page: https://unblob.org
License: Other
Extract files from any kind of container formats
Home Page: https://unblob.org
License: Other
The ARC handler fails when run against randomly generated content:
2021-12-02 15:25.12 [info ] Calculating chunk for YARA match identifier=$arc_magic real_offset=0x554816 start_offset=0x554816
2021-12-02 15:25.12 [debug ] Header parsed header=
00000000 1a 03 1b da e2 92 33 be 29 e5 72 b0 7b 71 fe c8 ......3.).r.{q..
00000010 80 11 57 34 68 96 3e 58 2a 9a 0a cc 26 ..W4h.>X*...&
struct heads:
- archive_marker: 0x1a
- header_type: 0x3
- name: b'\x1b\xda\xe2\x923\xbe)\xe5r\xb0{q\xfe'
- size: 0x571180c8
- date: 0x6834
- time: 0x3e96
- crc: 0x2a58
- length: 0x26cc0a9a
2021-12-02 15:25.12 [error ] Unhandled Exception during chunk calculation
Traceback (most recent call last):
File "/home/quentin/unblob/unblob/strategies.py", line 50, in search_chunks_by_priority
chunk = handler.calculate_chunk(limited_reader, real_offset)
File "/home/quentin/unblob/unblob/handlers/archive/arc.py", line 67, in calculate_chunk
header = self.parse_header(file)
File "/home/quentin/unblob/unblob/models.py", line 119, in parse_header
header = self._struct_parser.parse(self.HEADER_STRUCT, file, endian)
File "/home/quentin/unblob/unblob/file_utils.py", line 143, in parse
return struct_parser(file)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 16, in __call__
return self.read(*args, **kwargs)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/dissect/cstruct/types/base.py", line 62, in read
return self._read(obj)
File "<compiled heads>", line 14, in _read
EOFError
We should capture the EOFError early in the handler and simply not return a chunk.
We need to measure how fast unblob as a whole can operate and what strategy can speed up extraction significantly.
Example question we want to answer: Which is faster? Matching on all YARA patterns at once or iterating on the file multiple times with less patterns?
Measure different scenarios:
Documentation available in the wiki.
TODO:
To speed extraction up.
E.g. merge base should be the top of main
, and the PR shouldn't contain merge commits.
If you download the latest ubuntu LTS, unblob will crash with segmentation fault.
When we found a corrupt chunk in a file, we should be able to extract until that point and return the corrupted chunk as UnknownChunk
and issue a warning log message.
exit_code_var
so global state will not be storednoformat
, ...)When ran in tmpfs on my machine:
tests/test_extractor.py::TestCarveUnknownChunks::test_multiple_chunks FAILED [ 1%]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured stdout >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>2021-12-08 21:16.50 [warning ] Found unknown Chunks chunks=[0x0-0x4, 0x4-0x9]
2021-12-08 21:16.50 [info ] Extracting unknown chunk chunk=0x0-0x4 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown
2021-12-08 21:16.50 [info ] Extracting unknown chunk chunk=0x4-0x9 path=/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = <test_extractor.TestCarveUnknownChunks object at 0x7f5be8134040>, tmp_path = PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0')
def test_multiple_chunks(self, tmp_path: Path):
content = b"test file"
test_file = io.BytesIO(content)
chunks = [UnknownChunk(0, 4), UnknownChunk(4, 9)]
carve_unknown_chunks(tmp_path, test_file, chunks)
written_path1 = tmp_path / "0-4.unknown"
written_path2 = tmp_path / "4-9.unknown"
> assert list(tmp_path.iterdir()) == [written_path1, written_path2]
E AssertionError: assert [PosixPath('/...0-4.unknown')] == [PosixPath('/...4-9.unknown')]
E At index 0 diff: PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown') != PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown')
E Full diff:
E [
E + PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/0-4.unknown'),
E - PosixPath('/run/user/1000/pytest-of-vlaci/pytest-2/test_multiple_chunks0/4-9.unknown'),
E ]
tests/test_extractor.py:30: AssertionError
Store the metadata file in extract_root
in one JSON file.
We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.
For example:
class Metadata:
filename: Optional[str]
size: Optional[int]
perms: Optional[int]
endianness: Optional[str]
uid: Optional[int]
username: Optional[str]
gid: Optional[int]
groupname: Optional[str]
inode: Optional[int]
vnode: Optional[int]
@attr.define
class Chunk:
"""Chunk of a Blob, have start and end offset, but still can be invalid."""
start_offset: int
# This is the last byte included
end_offset: int
handler: "Handler" = attr.ib(init=False, eq=False)
metadata: Optional[Metadata]
Currently we are dynamically compiling the YARA rules, and not caching the results. The compilation is also printed multiple times, which is very noisy in the verbose output.
Show metadata about found Chunks.
When merging the formats, we have to solve merge conflicts all the time. A decorator-based approach would be less problematic.
Something like:
@register(priority=0)
class SomeHandler:
...
Or search for subclasses of Handler.
Don't forget we have to import them, so it might be necessary to use something like venusian?
Also we need to think about how this will fit into the strategy implementation.
There are some file formats which would be too big to commit in the Git repo. We need to generate these files or Use Git-LFS for these: https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage
Or just use Git LFS for ALL the integration test files.
We are currently reading the whole file into memory when carving out chunks, which is obviously terrible. There is a function for that in standard library: shutil.copyfileobj
, we should use that instead.
Right now we return an UnknownChunk for any YARA match where its handler failed to compute a ValidChunk. This can be due to YARA matching on something that is not an archive format, the archive header being corrupted, etc.
To me, an unknown chunk should represent chunks in-between valid chunks for which we did not match. We should not emit structures for something that our handlers failed to process.
This can be problematic when running unblob on a 2.5Gb ISO image of Ubuntu. We will have a valid ISO9660 chunk, followed by hundreds of UnknownChunk being printed out on the console. Mostly "fake" ZIP files in my case.
My solution would be to simply return None
when a format handler cannot properly identify a chunk.
A simple Dockerfile which has all required external dependencies installed.
RomFS filesystem format is not handled by 7zip and there is no standalone extractor on the market. That means we need to mount the filesystem somewhere, copy its content to our output directory, and then unmount the filesystem.
This is what most automated tools do (e.g. uClinux extract-romfs ).
This is a problem because mount requires root privileges.
This is what a naive implementation would do:
mkdir -p ${outdir}
mkdir -p "${outdir}_tmp"
sudo mount -o loop -t romfs ${infile} "${outdir}_tmp"
cp "${outdir}_tmp"/* ${outdir}/
sudo umount "${outdir}_tmp"
We need to either:
Establish a clear way of doing releases from this project.
Some ideas (without any kind of order / importance whatsoever):
Either on errors or with debug output, add a flag which prints the specified chunks with cstruct.dumpstruct to make it easier to debug file formats.
My ideas:
Usage: unblob [OPTIONS] [FILES]...
Options:
-e, --extract-dir DIRECTORY Extract the files to this directory. Will be
created if doesn't exist.
-d, --depth INTEGER Recursion depth. How deep should we extract
containers.
-v, --verbose Verbose mode, enable debug logs.
--help Show this message and exit.
Traceback (most recent call last):
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/bin/unblob", line 5, in <module>
main()
File "/home/walkman/Projects/unblob/unblob/cli.py", line 46, in main
ctx = cli.make_context("unblob", sys.argv[1:])
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 914, in make_context
self.parse_args(ctx, args)
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1370, in parse_args
value, args = param.handle_parse_result(ctx, opts, args)
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2347, in handle_parse_result
value = self.process_value(ctx, value)
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 2309, in process_value
value = self.callback(ctx, self, value)
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 1271, in show_help
ctx.exit()
File "/home/walkman/Projects/unblob/.direnv/python-3.8.12/lib/python3.8/site-packages/click/core.py", line 681, in exit
raise Exit(code)
click.exceptions.Exit: 0
We should add the unar
dependency to our Github workflow definition in a separate branch and then rebase on it in branches that requires it.
Currently done in the RAR branch (see https://github.com/IoT-Inspector/unblob/pull/33/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957f).
But will be required by PR #33 and #51.
Doing it separately will make the history cleaner I think.
unblob fails when receiving a path outside its own root:
poetry run unblob -v /tmp/lzo
2021-12-03 10:35.22 [info ] Start processing files count=0x1
2021-12-03 10:35.22 [info ] Start processing file path=/tmp/lzo
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/quentin/unblob/unblob/cli.py", line 54, in main
cli.invoke(ctx)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/quentin/unblob/unblob/cli.py", line 41, in cli
process_file(path.parent, path, extract_root, max_depth=depth)
File "/home/quentin/unblob/unblob/processing.py", line 28, in process_file
log.info("Found directory")
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_log_levels.py", line 118, in meth
return self._proxy_to_logger(name, event, **kw)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 202, in _proxy_to_logger
args, kw = self._process_event(method_name, event, event_kw)
File "/home/quentin/.cache/pypoetry/virtualenvs/unblob-U71Ryf-L-py3.8/lib/python3.8/site-packages/structlog/_base.py", line 159, in _process_event
event_dict = proc(self._logger, method_name, event_dict)
File "/home/quentin/unblob/unblob/logging.py", line 31, in convert_type
rel_path = value.relative_to(extract_root)
File "/usr/lib/python3.8/pathlib.py", line 908, in relative_to
raise ValueError("{!r} does not start with {!r}"
ValueError: '/tmp/lzo' does not start with '/home/quentin/unblob'
We should fix it here https://github.com/IoT-Inspector/unblob/blob/main/unblob/logging.py#L31
Assigning it to @kissgyorgy given that he worked on f565ed2, which looks similar.
We want to run the processing in a multiprocessing.pool.Pool
.
In order to simplify the implementation, and for measuring things, we would do this in 2 steps:
yara.match
callbacks and start the handling in the pool right away. (For the current code, this would need more complex changes.)The offset computation we came up with in the TarHandler at _get_tar_end_offset
is inexact for specific files.
I surmise it has to do with block padding and block alignment. Without a fix, unblob finds the end offset past the file and discard the chunk.
I can provide a sample internally if someone wants to work on this:
sha1sum /tmp/firmware.img
fc910cc004254fda5eef27c88f943296d3df967e /tmp/firmware.img
Save the compiled YARA rules during build and just load it at runtime, no need to compile it every single time.
https://yara.readthedocs.io/en/stable/yarapython.html
We want to make it possible to run unblob in many environments easily.
PyOxidizer is a great tool to making fully standalone ELF binaries: https://github.com/indygreg/PyOxidizer
As a security researcher, I need a quick way to add support additional extractors, so that I can analyze currently unsupported firmware images.
An extension is just a Python file stored in a well-known and-or configurable location that should be picked-up without any hassle.
Whenever a chunk matching the whole file is found, we should just return it and stop processing others.
We need to have github actions for running:
When we extracted a file, it makes no sense to issue a warning about it or trying to carve it out.
Handlers in later priority might pick up those files.
We wanted to have the UnknownChunk
represent gaps in known files anyway.
Have a scripts dir or a Justfile in the repo, which we can use to reproduce all the test files in case we need to change them a bit, or just making them reproducible.
We must ensure they can seek and read from the same file independently, so they can be run in parallel.
At the moment we only emit a warning about unknown chunks, but we should save them to files for further investigation.
All format handlers calculate the expected size of a chunk. We should implement a generic check that verifies if the calculated end_offset
points after the actual end of a file.
Something along those lines:
def overflow(file: io.BufferedReader, end_offset: int):
file.seek(0, os.SEEK_END)
size = file.tell()
return end_offset > size
This is an easy win in terms of heuristics, to get rid of invalid and corrupted headers.
So I'm starting to fuzz each formats at the moment.
A really dumb technique to fuzz the AR format:
#!/bin/bash
while true; do
printf '\x21\x3C\x61\x72\x63\x68\x3E\x0A' > /tmp/file.bin;
dd if=/dev/random count=58 bs=1 >> /tmp/file.bin;
printf '\x60\x0A' >> /tmp/file.bin
dd if=/dev/random count=1 bs=1M >> /tmp/file.bin;
poetry run unblob /tmp/file.bin;
rm -rf file.bin*
done
The idea is to craft a file that will trigger a Yara match, with random bytes wherever we can put them.
The errors are well handled and do not lead to crashes, however we noticed an error in the exception handling. On error, the handler seek back of HEADER_LENGTH
which is set to 60, but the exact size of header is actually 68.
except arpy.ArchiveFormatError as exc:
logger.debug(
"Hit an ArchiveFormatError, we've probably hit some other kind of data",
exc_info=exc,
)
# Since arpy has tried to read another file header, we need to wind the cursor back the
# length of the header, so it points to the end of the AR chunk.
ar.file.seek(-HEADER_LENGTH, os.SEEK_CUR)
This is obvious when unknown chunk is logged:
2021-12-15 15:57.37 [warning ] Found unknown Chunks chunks=[0x8-0x100044]
We should fix this by changing the HEADER_LENGTH value.
In process_file
the depth comparison is bad, only works properly, when the depth is the same as default.
Counting from 0 to depth might be more intuitive.
Our integration tests trigger warnings for unknown chunks when parsing CPIO and TAR formats. This should not happen.
It's probably an end offset miscalculation due to padding, or test files not respecting the standard.
The AR archive handler emits empty valid chunks with start offset 0 and end offset 0 when an exception is triggered by malformed archive.
This is due to the current implementation that emits valid chunks regardless of the exception state:
try:
ar.read_all_headers()
except arpy.ArchiveFormatError as exc:
# debug log
...
offset = ar.file.tell()
return ValidChunk(
start_offset=start_offset,
end_offset=start_offset + offset,
)
We are skipping some errors happened during the run, but we should keep it and exit at the end with non-0.
For example when we couldn't extract an archive, unblob
process itself should exit with an error code.
When the firmware is in a different folder, e.g. /tmp/firmware.bin
we get an Exception
We need a function which can list all the required third party dependencies which needed to extract all file types we support.
--show-external-dependencies
list all required scripts with new lines (maybe suggest Ubuntu package names to install?)When an extraction failed, store the error messages (maybe the whole standard error) of the extractor, because we want to show that for the users.
The AR archive handler calculates the end offset as if the AR archive always starts at the beginning of the file being analyzed. This is wrong.
We don't return UnknownChunk
s from search_chunks_by_priority
and we don't use the base Chunk
model anymore, so we can get rid of the abstraction.
Only after this: #71
start the handling in the pool right away. (For the current code, this would need more complex changes.)
When we have a handler which don't have integration tests in the folder, the test suite should fail.
Currently we are only collecting existing tests.
Click is catching EOFError
and quitting with an Aborted!
message.
We are quite often getting this Exception during development, because of the nature of the project, but
we need the traceback very much to make our life easier.
https://click.palletsprojects.com/en/7.x/exceptions/?highlight=standalone_mode
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.