dirty's Issues

Help Needed/Documentation Inquiries

I have personally attempted to use DIRTY in many configurations with varying levels of success.

I have run into the following issues:

  • Package incompatibilities and/or failures to install
  • DIRTY cannot locate IDA's python libraries, "Could not import ida_typeinf. Cannot parse IDA types."
  • Relative import files in the util folder are causing syntax errors (incorrect syntax on "../../") and require modifications in order to bypass
  • using the --CUDA switch causes the script to hang and never complete model testing.

I would love to see the documentation of prerequisite setup expanded upon, as these have been my biggest headaches.

I would also appreciate some clarification on the following:

  • Should a specific python patch version be used, for example, 3.7.7?
  • What specific combination of package versions allow this tool to work properly?
  • Should I be adding the DIRTY repository to my python path?
  • Was this developed specifically for use in either a Linux or Windows environment? (I have tried both, but I don't have IDA for Linux)
  • Are there any specific steps for integrating IDA/IDAPython which are not listed on the homepage?
  • Python 3.6 and 3.7 have reached end of life and Python 3.8 will soon follow. Will this be updated or maintained in the future?

That said, I'm really excited to finally try this out and would appreciate any help.

DIRTY_light access request

Hello, I have read your paper and noticed that you compared the DIRTY_light model with OSPREY in the paper, and achieved very good results in structural recovery. However, I did not see DIRTY_light in the source code repository. Could you please let me know how to obtain DIRTY_light?

Hang when evaluating test set

When I attempt to process the test set, the script just hangs. Here is the output:

GPU available: True, used: False
TPU available: False, using: 0 TPU cores
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/pytorch_lightning/utilities/ UserWarning: GPU available but not used. Set the --gpus flag when calling the script.
  warnings.warn(*args, **kwargs)
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/pytorch_lightning/utilities/ UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-processing data loading (e.g. batch size > 1), this can lead to unintended side effects since the samples will be duplicated.
  warnings.warn(*args, **kwargs)
/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/webdataset/ UserWarning: num_workers 8 > num_shards 1
  warnings.warn(f"num_workers {num_workers} > num_shards {len(urls)}")
Testing: 0it [00:00, ?it/s]

What are the "disappear" types?

Thanks so much for the great work!

I would like to know, What are the "disappear" types?

This sounds very interesting as the only reference about these types I found - "Assign type "Disappear" to variables not existing in the ground truth". Also, I have found a lot of "disappeared" types in the true and predicted labels as well, so much so that they dominate the "types".

Does this mean that these types are lost as IDA doesn't use the Dwarf information properly?

Thanks in advance.

Question on applying Dirty to custom binary files

I am trying to figure out how to apply Dirty to my own binary files, rather than the provided DIRE dataset. I have attempted to build my own binary file into a new test.jar and modify the "test_file" path in multitask.xfmr.jsonnet, but I am struggling to understand the meaning of the content in the .jsonl files within the provided test.jar (e.g., some key names are t, n, u, etc.). I've checked the relevant information in the paper and it seems that I did not find the answer. I'm not sure if I am on the right track or if I am missing something crucial. Any help on this matter would be greatly appreciated.

Need for local saving and loading of model.

The train executor interfaces with wandb for saving and loading models, and takes checkpoints as optional parameters. It doesn't look like the model is saved anywhere on disk, there is also no naming convention used for checkpoints exampled in the code. How do we seperate out the saving and loading functionality from wandb so that it can be executed completely locally?

Dataset is deprecated

I installed via the directions and get:

dirty-exp train --cuda --expname=eval_dirty_mt multitask.xfmr.jsonnet --eval-ckpt exp_runs/dirty_mt.ckpt 
WARNING:root:unable to load [idaapi], stub loaded instead
WARNING:root:unable to load [ida_auto], stub loaded instead
WARNING:root:unable to load [ida_funcs], stub loaded instead
WARNING:root:unable to load [ida_hexrays], stub loaded instead
WARNING:root:unable to load [ida_kernwin], stub loaded instead
WARNING:root:unable to load [ida_lines], stub loaded instead
WARNING:root:unable to load [ida_pro], stub loaded instead
WARNING:root:unable to load [ida_typeinf], stub loaded instead
WARNING:root:unable to load [idautils], stub loaded instead
Main process id 29909
use random seed 0
Traceback (most recent call last):
  File "/home/ed/Documents/DIRTY/env/bin/dirty-exp", line 33, in <module>
    sys.exit(load_entry_point('cmu-dirty', 'console_scripts', 'dirty-exp')())
  File "/home/ed/Documents/DIRTY/dirty/src/dirty/", line 131, in main
  File "/home/ed/Documents/DIRTY/dirty/src/dirty/", line 51, in train
  File "/home/ed/Documents/DIRTY/dirty/src/dirty/utils/", line 171, in __init__
  File "/home/ed/Documents/DIRTY/env/lib/python3.6/site-packages/webdataset/", line 41, in __init__
    raise Exception("Dataset is deprecated; use webdataset.WebDataset instead")
Exception: Dataset is deprecated; use webdataset.WebDataset instead

Here are the versions of packages installed in my venv:

Package                 Version   Editable project location
----------------------- --------- -----------------------------------------
absl-py                 1.0.0
aiohttp                 3.8.1
aiosignal               1.2.0
async-timeout           4.0.2
asynctest               0.13.0
attrs                   21.4.0
braceexpand             0.1.7
cachetools              4.2.4
certifi                 2021.10.8
charset-normalizer      2.0.12
click                   8.0.4
cmu-dirty               0.0.0     /home/ed/Documents/DIRTY/dirty/src
configparser            5.2.0
csvnpm-utils            0.0.0     /home/ed/Documents/DIRTY/csvnpm-utils/src
dataclasses             0.8
docker-pycreds          0.4.0
docopt                  0.6.2
frozenlist              1.2.0
fsspec                  2022.1.0
future                  0.18.2
gitdb                   4.0.9
GitPython               3.1.18
google-auth             2.6.5
google-auth-oauthlib    0.4.6
grpcio                  1.44.0
idna                    3.3
idna-ssl                1.1.0
importlib-metadata      4.8.3
joblib                  1.1.0
jsonlines               2.0.0
jsonnet                 0.17.0
Markdown                3.3.6
multidict               5.2.0
numpy                   1.19.5
oauthlib                3.2.0
packaging               21.3
pathtools               0.1.2
pip                     21.3.1
pkg_resources           0.0.0
promise                 2.3
protobuf                3.19.4
psutil                  5.9.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
Pygments                2.9.0
pyparsing               3.0.8
python-dateutil         2.8.2
pytorch-lightning       1.2.10
PyYAML                  6.0
requests                2.27.1
requests-oauthlib       1.3.1
rsa                     4.8
scikit-learn            0.24.2
scipy                   1.5.4
sentencepiece           0.1.96
sentry-sdk              1.5.10
setuptools              59.6.0
shortuuid               1.0.8
six                     1.16.0
smmap                   5.0.0
subprocess32            3.5.4
tensorboard             2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
threadpoolctl           3.1.0
torch                   1.8.1
torchmetrics            0.2.0
tqdm                    4.60.0
typing_extensions       4.1.1
ujson                   4.0.2
urllib3                 1.26.9
wandb                   0.10.33
webdataset              0.1.103
Werkzeug                2.0.3
wheel                   0.37.1
yarl                    1.7.2
zipp                    3.6.0

A typo in


Recently I'm following up DIRTY. The work is great, and the code repo is impressive. Thank the developers for the efforts!

Just want to mention that there might be a typo in the following file: (dirty/model/, the variable in task_targets should be targets, instead of preds.

f"{task}_targets": preds,

Serialization of Unions uses wrong type tag

I received an email confused about the Void type in data that looked like a Union. It turns out there is a subtle bug in the serialization code here:


Lines 918 to 924 in f1f24f4

def _to_json(self) -> t.Dict[str, t.Any]:
return {
"T": 8,
"m": [m._to_json() for m in self.members],
"p": self.padding,

This caused Unions to be serialized with the Void meta-tag. During deserialization, these are treated as Void types:


Lines 1041 to 1065 in f1f24f4

def read_metadata(d: t.Dict[str, t.Any]) -> "TypeLibCodec.CodecTypes":
classes: t.Dict[
t.Union[int, str],
] = {
"E": TypeLib.EntryList,
0: TypeLib,
1: TypeInfo,
2: Array,
3: Pointer,
4: UDT.Field,
5: UDT.Padding,
6: Struct,
7: Union,
8: Void,
9: FunctionPointer,
10: Disappear,
return classes[d["T"]]._from_json(d)

The serialization bug is simple enough to fix, but this means that the current dataset has this specific bug. I will fix the current dataset, but if you've already downloaded the current one and/or don't want to wait, you'll have to modify the read_metadata method to condition on d having other fields, for example by replacing line 1065 in with this (untested) code:

if d["T"] == 8:
    if "m" in d:
        return Union._from_json(d)
    return Void._from_json(d)
return classes[d["T"]]._from_json(d)

disagree between preprocessor and decompiler/pretokenizer


Can somebody, please, explain what is expected behaviour of the next steps:

Decompilers code pass in debug arg only pairs of (loc: var) for the symbolized version of function, not the pseudocode.

Then in preprocessors we performs filtering by output of pseudocode for both stripped and debug versions, resulting in critically large value loss, and nonpredictable rename model.

So, should be both pseudocode versions been involved, or not?
If not - would it be correct to perform filterring by keys() comparsion?


@huzecong @pcyin @clegoues @bvasiles @sophieball @qibinc

How can I apply `DIRTY` on .idb file?

Your research seems great to me!

How can I utilize the research in my IDA .idb file to enhance the decompiled output?
I tried to follow a content in README but actually had no idea on how to apply it.

I would appreciate it if you help me a little. Thanks

[???] encode_memory() dirtiness

Hello. Thanks for you're kindness to share such a good project.

Could you please explain me why does we need

    def encode_memory(mems):
        """Encode memory to ids

        <pad>: 0
        <SEP>: 1
        <unk>: 2
        mem_id: mem_offset + 3
        ret = []
        for mem in mems[: VocabEntry.MAX_MEM_LENGTH]:
            if mem == "<SEP>":
            elif mem > VocabEntry.MAX_STACK_SIZE:
                ret.append(3 + mem)
        return ret

this function, and why does it attempt to compare integers with '' ?
I can't get send of appending int token to array of int tokens instead of int token.

and second question is about this one:

            def var_loc_in_func(loc):
                print(" TODO: fix the magic number for computing vocabulary idx")
                if isinstance(loc, Register):
                    return 1030 + self.vocab.regs[]
                    from utils.vocab import VocabEntry

                    return (
                        3 + stack_start_pos - loc.offset
                        if stack_start_pos - loc.offset < VocabEntry.MAX_STACK_SIZE
                        else 2

what and why is 1030 constant do?

And in general, why we define tokens as that:

            self.word2id["<pad>"] = PAD_ID
            self.word2id["<s>"] = 1
            self.word2id["</s>"] = 2
            self.word2id["<unk>"] = 3
            self.word2id[SAME_VARIABLE_TOKEN] = 4

but using as this:

        <pad>: 0
        <SEP>: 1
        <unk>: 2
        mem_id: mem_offset + 3

Sorry, if my questions in too much, I specialize on system programming, and math with ML is a hobby.
Forward Thanks =)


Running tox command

I am having an issue running the tox command. I have downloaded the requirements and their specific versions. I downloaded the model as well as the dataset from the links provided. Please let me know how to resolve it. I am obtaining the following error:

mypy: commands succeeded
pipcheck: commands succeeded
ERROR: safety: could not install deps [setuptools ~= 50.3.2, pip ~= 21.1.0, -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt]; v = InvocationError("/home/abhishek/Desktop/dirty/DIRTY-master/dirty/.tox/safety/bin/python -m pip install 'setuptools ~= 50.3.2' 'pip ~= 21.1.0' -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt", -9)
ERROR: pytest: could not install deps [setuptools ~= 50.3.2, pip ~= 21.1.0, -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt]; v = InvocationError("/home/abhishek/Desktop/dirty/DIRTY-master/dirty/.tox/pytest/bin/python -m pip install 'setuptools ~= 50.3.2' 'pip ~= 21.1.0' -r/home/abhishek/Desktop/dirty/DIRTY-master/dirty/local_requirements.txt", -9)


