cmustrudel / dirty Goto Github PK
View Code? Open in Web Editor NEWDIRTY: Augmenting Decompiler Output with Learned Variable Names and Types
License: MIT License
DIRTY: Augmenting Decompiler Output with Learned Variable Names and Types
License: MIT License
Hello. Thanks for you're kindness to share such a good project.
Could you please explain me why does we need
encode_memory
@staticmethod
def encode_memory(mems):
"""Encode memory to ids
<pad>: 0
<SEP>: 1
<unk>: 2
mem_id: mem_offset + 3
"""
ret = []
for mem in mems[: VocabEntry.MAX_MEM_LENGTH]:
if mem == "<SEP>":
ret.append(1)
elif mem > VocabEntry.MAX_STACK_SIZE:
ret.append(2)
else:
ret.append(3 + mem)
return ret
this function, and why does it attempt to compare integers with '' ?
I can't get send of appending int token to array of int tokens instead of int token.
and second question is about this one:
def var_loc_in_func(loc):
print(" TODO: fix the magic number for computing vocabulary idx")
if isinstance(loc, Register):
return 1030 + self.vocab.regs[loc.name]
else:
from utils.vocab import VocabEntry
return (
3 + stack_start_pos - loc.offset
if stack_start_pos - loc.offset < VocabEntry.MAX_STACK_SIZE
else 2
)
what and why is 1030 constant do?
And in general, why we define tokens as that:
self.word2id["<pad>"] = PAD_ID
self.word2id["<s>"] = 1
self.word2id["</s>"] = 2
self.word2id["<unk>"] = 3
self.word2id[SAME_VARIABLE_TOKEN] = 4
but using as this:
<pad>: 0
<SEP>: 1
<unk>: 2
mem_id: mem_offset + 3
Sorry, if my questions in too much, I specialize on system programming, and math with ML is a hobby.
Forward Thanks =)
Hello.
Can somebody, please, explain what is expected behaviour of the next steps:
Decompilers code pass in debug arg only pairs of (loc: var) for the symbolized version of function, not the pseudocode.
Then in preprocessors we performs filtering by output of pseudocode for both stripped and debug versions, resulting in critically large value loss, and nonpredictable rename model.
So, should be both pseudocode versions been involved, or not?
If not - would it be correct to perform filterring by keys() comparsion?
Thanks.
Hi,
Recently I'm following up DIRTY. The work is great, and the code repo is impressive. Thank the developers for the efforts!
Just want to mention that there might be a typo in the following file: (dirty/model/model.py:496
), the variable in task_targets
should be targets
, instead of preds
.
Line 496 in 7a5514a
I received an email confused about the Void type in data that looked like a Union. It turns out there is a subtle bug in the serialization code here:
Lines 918 to 924 in f1f24f4
This caused Unions to be serialized with the Void meta-tag. During deserialization, these are treated as Void types:
Lines 1041 to 1065 in f1f24f4
The serialization bug is simple enough to fix, but this means that the current dataset has this specific bug. I will fix the current dataset, but if you've already downloaded the current one and/or don't want to wait, you'll have to modify the read_metadata
method to condition on d
having other fields, for example by replacing line 1065 in dire_types.py
with this (untested) code:
if d["T"] == 8:
if "m" in d:
return Union._from_json(d)
return Void._from_json(d)
return classes[d["T"]]._from_json(d)
Your research seems great to me!
How can I utilize the research in my IDA .idb file to enhance the decompiled output?
I tried to follow a content in README but actually had no idea on how to apply it.
I would appreciate it if you help me a little. Thanks
I am having an issue running the tox command. I have downloaded the requirements and their specific versions. I downloaded the model as well as the dataset from the links provided. Please let me know how to resolve it. I am obtaining the following error:
`
summary:
mypy: commands succeeded
pipcheck: commands succeeded
ERROR: safety: could not install deps [setuptools ~= 50.3.2, pip ~= 21.1.0, -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt]; v = InvocationError("/home/abhishek/Desktop/dirty/DIRTY-master/dirty/.tox/safety/bin/python -m pip install 'setuptools ~= 50.3.2' 'pip ~= 21.1.0' -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt", -9)
ERROR: pytest: could not install deps [setuptools ~= 50.3.2, pip ~= 21.1.0, -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt]; v = InvocationError("/home/abhishek/Desktop/dirty/DIRTY-master/dirty/.tox/pytest/bin/python -m pip install 'setuptools ~= 50.3.2' 'pip ~= 21.1.0' -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt", -9)
`
The exp.py
train executor interfaces with wandb
for saving and loading models, and takes checkpoints as optional parameters. It doesn't look like the model is saved anywhere on disk, there is also no naming convention used for checkpoints exampled in the code. How do we seperate out the saving and loading functionality from wandb
so that it can be executed completely locally?
I installed via the directions and get:
dirty-exp train --cuda --expname=eval_dirty_mt multitask.xfmr.jsonnet --eval-ckpt exp_runs/dirty_mt.ckpt
WARNING:root:unable to load [idaapi], stub loaded instead
WARNING:root:unable to load [ida_auto], stub loaded instead
WARNING:root:unable to load [ida_funcs], stub loaded instead
WARNING:root:unable to load [ida_hexrays], stub loaded instead
WARNING:root:unable to load [ida_kernwin], stub loaded instead
WARNING:root:unable to load [ida_lines], stub loaded instead
WARNING:root:unable to load [ida_pro], stub loaded instead
WARNING:root:unable to load [ida_typeinf], stub loaded instead
WARNING:root:unable to load [idautils], stub loaded instead
Main process id 29909
use random seed 0
Traceback (most recent call last):
File "/home/ed/Documents/DIRTY/env/bin/dirty-exp", line 33, in <module>
sys.exit(load_entry_point('cmu-dirty', 'console_scripts', 'dirty-exp')())
File "/home/ed/Documents/DIRTY/dirty/src/dirty/exp.py", line 131, in main
train(cmd_args)
File "/home/ed/Documents/DIRTY/dirty/src/dirty/exp.py", line 51, in train
percent=float(args["--percent"]),
File "/home/ed/Documents/DIRTY/dirty/src/dirty/utils/dataset.py", line 171, in __init__
super().__init__(urls)
File "/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/webdataset/fluid.py", line 41, in __init__
raise Exception("Dataset is deprecated; use webdataset.WebDataset instead")
Exception: Dataset is deprecated; use webdataset.WebDataset instead
Here are the versions of packages installed in my venv:
Package Version Editable project location
----------------------- --------- -----------------------------------------
absl-py 1.0.0
aiohttp 3.8.1
aiosignal 1.2.0
async-timeout 4.0.2
asynctest 0.13.0
attrs 21.4.0
braceexpand 0.1.7
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.12
click 8.0.4
cmu-dirty 0.0.0 /home/ed/Documents/DIRTY/dirty/src
configparser 5.2.0
csvnpm-utils 0.0.0 /home/ed/Documents/DIRTY/csvnpm-utils/src
dataclasses 0.8
docker-pycreds 0.4.0
docopt 0.6.2
frozenlist 1.2.0
fsspec 2022.1.0
future 0.18.2
gitdb 4.0.9
GitPython 3.1.18
google-auth 2.6.5
google-auth-oauthlib 0.4.6
grpcio 1.44.0
idna 3.3
idna-ssl 1.1.0
importlib-metadata 4.8.3
joblib 1.1.0
jsonlines 2.0.0
jsonnet 0.17.0
Markdown 3.3.6
multidict 5.2.0
numpy 1.19.5
oauthlib 3.2.0
packaging 21.3
pathtools 0.1.2
pip 21.3.1
pkg_resources 0.0.0
promise 2.3
protobuf 3.19.4
psutil 5.9.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
Pygments 2.9.0
pyparsing 3.0.8
python-dateutil 2.8.2
pytorch-lightning 1.2.10
PyYAML 6.0
requests 2.27.1
requests-oauthlib 1.3.1
rsa 4.8
scikit-learn 0.24.2
scipy 1.5.4
sentencepiece 0.1.96
sentry-sdk 1.5.10
setuptools 59.6.0
shortuuid 1.0.8
six 1.16.0
smmap 5.0.0
subprocess32 3.5.4
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
threadpoolctl 3.1.0
torch 1.8.1
torchmetrics 0.2.0
tqdm 4.60.0
typing_extensions 4.1.1
ujson 4.0.2
urllib3 1.26.9
wandb 0.10.33
webdataset 0.1.103
Werkzeug 2.0.3
wheel 0.37.1
yarl 1.7.2
zipp 3.6.0
Line 61 in 20f779b
I'm interested in obtaining the dataset.
When I attempt to process the test set, the script just hangs. Here is the output:
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: GPU available but not used. Set the --gpus flag when calling the script.
warnings.warn(*args, **kwargs)
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-processing data loading (e.g. batch size > 1), this can lead to unintended side effects since the samples will be duplicated.
warnings.warn(*args, **kwargs)
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/webdataset/dataset.py:403: UserWarning: num_workers 8 > num_shards 1
warnings.warn(f"num_workers {num_workers} > num_shards {len(urls)}")
Testing: 0it [00:00, ?it/s]
Hello, I have read your paper and noticed that you compared the DIRTY_light model with OSPREY in the paper, and achieved very good results in structural recovery. However, I did not see DIRTY_light in the source code repository. Could you please let me know how to obtain DIRTY_light?
I have personally attempted to use DIRTY in many configurations with varying levels of success.
I have run into the following issues:
I would love to see the documentation of prerequisite setup expanded upon, as these have been my biggest headaches.
I would also appreciate some clarification on the following:
That said, I'm really excited to finally try this out and would appreciate any help.
I am trying to figure out how to apply Dirty to my own binary files, rather than the provided DIRE dataset. I have attempted to build my own binary file into a new test.jar and modify the "test_file" path in multitask.xfmr.jsonnet, but I am struggling to understand the meaning of the content in the .jsonl files within the provided test.jar (e.g., some key names are t, n, u, etc.). I've checked the relevant information in the paper and it seems that I did not find the answer. I'm not sure if I am on the right track or if I am missing something crucial. Any help on this matter would be greatly appreciated.
Thanks so much for the great work!
I would like to know, What are the "disappear" types?
This sounds very interesting as the only reference about these types I found - "Assign type "Disappear" to variables not existing in the ground truth". Also, I have found a lot of "disappeared" types in the true and predicted labels as well, so much so that they dominate the "types".
Does this mean that these types are lost as IDA doesn't use the Dwarf information properly?
Thanks in advance.
Ruturaj
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.