Hello. Thanks for you're kindness to share such a good project. Coul

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

[???] encode_memory() dirtiness about dirty HOT 5 OPEN

kotee4ko commented on August 22, 2024

[???] encode_memory() dirtiness

from dirty.

Comments (5)

qibinc commented on August 22, 2024 1

Hi @kotee4ko , thanks for your interest in this project! I apologize for the hacky implementation of the modeling part. Hope the following answers can help:

Could you please explain me why does we need
encode_memory

The encode_memory function is responsible for mapping a variable's size (red) and offsets (yellow) into vocab ids. Together with var_loc_in_func (green), it implements the encoding part of Figure 4 in the paper.

This function is required because Transformers have a fixed vocab size. Instead of directly passing mems to the model, we need an upper limit MAX_STACK_SIZE and convert the ones exceeding it to a special token <unk> and special id (which is 3 in this case).

this function, and why does it attempt to compare integers with '' ?

Since mems is an int array, the condition if mem == "<SEP>" is never met and can be safely deleted.

I can't get send of appending int token to array of int tokens instead of int token.

Could you elaborate more on this question?

what and why is 1030 constant do?

A variable can be on register or stack. In order to distinguish register x and stack position x, we assign them with different ids. In this case, vocab id range [0, 1027] (1027 = 3 + MAX_STACK_SIZE) represents stack postions, and [1030, 1030 + <vocab size of registers>] represents register positions

why we define tokens as that x and using as y

We have two Transformer encoders XfmrSequentialEncoder, responsible for code tokens, and XfmrMemEncoder, responsible for mem tokens (location, size, offsets). They have separate embeddings and vocabs. The first part on self.word2id is for the code vocab, while the second part with mem_id is for the mem vocab.

Feel free to follow up if you have more questions!

from dirty.

kotee4ko commented on August 22, 2024

@qibinc , Sir, I can't get few things.

If we always need to adjust each token by 3 (count of special tokens) in token list?
If for second "block" of tokens (registers) we need to adjust it TWICE?
Why we can't just let the constant registers quantity been first [3, 3+len(vocab.regs)]
and memory (stack) been second [len(vocab.regs) + (3 or 3+3 ?), len(vocab.regs) + (3 or 3+3 ?) + MAX_STACK_SIZE]
reg_name:56, reg_id:5 which one we are going to adjust by 3? Depending on encoder? (56 for code vocab and 5+3 for mem vocab)?
if this is what is expected?

============
New loc:Reg 56; src_var.typ.size:8 src_var.typ.start_offsets():(0,)
calculating variable location, loc__:Reg 56
variable type is register , will adjust position by MAX_STACK_SIZE+3=1027
reg_name:56, reg_id:5
mems:(8, 0)
ret:[11, 3]
vloc:1035 
tmem:[11, 3] 
var_sequence:[1035, 11, 3]

New loc:Stk 0xa0; src_var.typ.size:144 src_var.typ.start_offsets():(0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
calculating variable location, loc__:Stk 0xa0
variable type is stack, will adjust position by 0+3=3
VocabEntry.MAX_STACK_SIZE:1024, stack_start_pos:216, offset:160
mems:(144, 0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
ret:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
vloc:59 
tmem:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123] 
var_sequence:[59, 147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]

Why we can't just redefine special tokens as negative values, to avoid incrementing of each value in the list?
here in comments it sad about relative position for mem/stack and absolute location for regs. Generally, this value is just position/offset adjusted by container type constant, right?
And why we doesn't check bounds of the stack? I mean here it can be true, but variable total size (in case of array, or structure) could overflow this limit. And in that case we would have offsets which would be interpreteted as registers?

Where is the code, which process this encoded tensors?

Wow, this is real hardcore, Sir.

from dirty.

kotee4ko commented on August 22, 2024

@qibinc

Sir, I need to refactor code to be able to launch it on very specific AMD GPU.
Can you tell, if this code would be logically correct? The difference is in accuracy() method, which is behave a bit different that old one.

Thanks.

def tmaxu(t1, t2):
    tm1, tm2 = t1.unique().numel(), t2.unique().numel()
    # print(f"T1 (len = {t1.numel()}, uniq: {tm1} \n{t1}\n"
    #      f"T2 (len = {t2.numel()}, uniq: {tm2} \n{t2}\n"
    #      )
    ret = max(tm1, tm2, 2)
    return ret
    
    def _shared_epoch_end(self, outputs, prefix):
        final_ret = {}
        if self.retype:
            ret = self._shared_epoch_end_task(outputs, prefix, "retype")
            final_ret = {**final_ret, **ret}
        if self.rename:
            ret = self._shared_epoch_end_task(outputs, prefix, "rename")
            final_ret = {**final_ret, **ret}
        if self.retype and self.rename:
            # Evaluate rename accuracy on correctly retyped samples
            retype_preds = torch.cat([x[f"retype_preds"] for x in outputs])
            retype_targets = torch.cat([x[f"retype_targets"] for x in outputs])
            rename_preds = torch.cat([x[f"rename_preds"] for x in outputs])
            rename_targets = torch.cat([x[f"rename_targets"] for x in outputs])
            if (retype_preds == retype_targets).sum() > 0:
                binary_mask = retype_preds == retype_targets
                p_t = rename_preds[binary_mask]
                t_t = rename_targets[binary_mask]
                self.log(
                    f"{prefix}_rename_on_correct_retype_acc",
                    accuracy(
                        p_t,
                        t_t,
                        task='multiclass',
                        num_classes=tmaxu(p_t, t_t)
                    ),
                )

        return final_ret

from dirty.

kotee4ko commented on August 22, 2024

Wohoooo!
Seems, it is working?

@qibinc Thanks

One more question: how should I average name and type predictions?
Just like (name_loss + type_loss / 2), or anything else?

from dirty.

kotee4ko commented on August 22, 2024

Wow, what a dirty trick!

(Pdb) model.vocab.regs.id2word {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<IDENTITY>', 5: '56', 6: '64', 7: '16', 8: '8', 9: '24', 10: '72', 11: '80', 12: '32', 13: '496', 14: '104', 15: '48', 16: '120', 17: '112', 18: '512', 19: '128', 20: '528', 21: '544', 22: '560', 23: '88', 24: '576', 25: '592', 26: '608', 27: '40', 28: '440', 29: '432', 30: '96', 31: '1', 32: '1280', 33: '424', 34: '472', 35: '448', 36: '464', 37: '480', 38: '456', 39: '5', 40: '0', 41: '1288', 42: '624', 43: '1296', 44: '640', 45: '656', 46: '672', 47: '688', 48: '2', 49: '1284', 50: '144', 51: '704', 52: '192', 53: '720', 54: '736', 55: '1312', 56: '400', 57: '1344', 58: '1328', 59: '1360', 60: '1376', 61: '1408', 62: '1392', 63: '1424', 64: '3', 65: '1440', 66: '1456', 67: '1472', 68: '1488', 69: '1504', 70: '1520', 71: '1536', 72: '1552', 73: '1568', 74: '1584', 75: '1600', 76: '1616', 77: '1632', 78: '1648', 79: '1664', 80: '1696', 81: '176', 82: '368', 83: '384', 84: '1680', 85: '1712', 86: '1728', 87: '1760', 88: '1792', 89: '1824', 90: '500'}


{"name":"openWrite",
"code_tokens":["__int64","__fastcall","openWrite","(","const","char","*","@@a1@@",",","int","@@a2@@",")","{","int","@@v3@@",";","int","@@oflag@@",";","if","(","@@a2@@",")","@@oflag@@","=","Number",";","else","@@oflag@@","=","Number",";","@@v3@@","=","open","(","@@a1@@",",","@@oflag@@",",","Number","L",")",";","if","(","@@v3@@","<","Number",")","errnoAbort","(","String",",","@@a1@@",")",";","return","(","unsigned","int",")","@@v3@@",";","}"],
"source":
    {
        "s8":{"t":{"T":1,"n":"int","s":4},"n":"v3","u":false},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"oflag","u":true},
        "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},
        "target":{"s8":{"t":{"T":1,"n":"int","s":4},"n":"fd","u":true},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"flags","u":true},
        "r56":{"t":{"T":3,"t":"char"},"n":"fname","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"append","u":false}},
        "test_meta":{"function_name_in_train":false,"function_body_in_train":false}
}

dict_keys([
    'index', 
    'src_code_tokens', 
    'variable_mention_to_variable_id', 
    'variable_mention_mask', 
    'variable_mention_num', 
    'variable_encoding_mask', '
    target_type_src_mems', 
    'src_type_id', 
    'target_mask', 
    'target_submask', 
    'target_type_sizes'
])

(Pdb) model.vocab.names.id2word[5] = ''        # first non-spec elem in names vocab starts on offset +5
(Pdb) model.vocab.types.id2word[7] = '__int64' # first non-spec elem in types vocab starts on offset +7


(Pdb) input_dict['index'] = [
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a1@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a2@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@v3@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@oflag@@']]

(Pdb) input_dict['src_code_tokens'][0].numel() = 87
input_dict['src_code_tokens'][0] = 
tensor(
    [
        1, 2069, 2008, 2012, 2010, 2063, 3251, 3877, 1995, 2088, 2046, 2001,
        9917, 1226, 2007, 2021, 9917,  402, 1996, 2019, 2021, 9917, 1263, 1997,
        2021, 9917, 1316, 1997, 2029, 1995, 9917,  402, 1996, 9917, 1316, 2009,
        2004, 1997, 2082, 9917, 1316, 2009, 2004, 1997, 9917, 1263, 2009, 3251,
        1995, 9917, 1226, 2007, 9917, 1316, 2007, 2004, 2027, 1996, 1997, 2029,
        1995, 9917, 1263, 2038, 2004, 1996, 9917, 2506, 9981, 3733, 1995, 2065,
        2007, 9917, 1226, 1996, 1997, 2049, 1995, 2036, 2021, 1996, 9917, 1263,
        1997, 2020,    2
    ]
)

(Pdb) input_dict['variable_mention_to_variable_id'][0].numel() = 87
(Pdb) input_dict['variable_mention_to_variable_id'][0] =
tensor(
    [
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
        0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 2, 0, 0,
        0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0
    ]
)

(Pdb) input_dict['variable_mention_mask'][0].numel() = 87
(Pdb) input_dict['variable_mention_mask'][0] = 
tensor(
    [
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.
    ]
)

(Pdb) input_dict['variable_mention_num'] = tensor([   [3., 2., 4., 4.]   ])

(Pdb) input_dict['variable_encoding_mask'] = tensor([    [1., 1., 1., 1.]    ])

(Pdb) input_dict['target_type_src_mems'] = 
tensor(
    [ #3d (func)
        [ #2d (var)
            [1035,   11,    3], #1d (var_mem_repr)  offset if < 1024+3 else reg_num+3+3, size+3, field_offset+3  
            [1036,    7,    3], 
            [   3,    7,    3],
            [   7,    7,    3]
        ]
    ]
)
(Pdb) model.vocab.regs.word2id['56'] = 5
(Pdb) model.vocab.regs.word2id['64'] = 6

# so, 1035-1024 = 11; 11 - 3 - 3 = 5; 5 == 5;
# and 1035-1024 = 12; 12 - 3 - 3 = 6; 6 == 6;


(Pdb) input_dict['src_type_id'] = tensor(
    [
        [9, 5, 5, 5] # vars type ids
    ]
)
(Pdb) model.vocab.types.word2id['const char *'] = 9 # "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
(Pdb) model.vocab.types.word2id['int'] = 5          # "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},

# src_type_id --> model.vocab.types

from dirty.

[???] encode_memory() dirtiness about dirty HOT 5 OPEN

Comments (5)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent