Code Monkey home page Code Monkey logo

Comments (5)

qibinc avatar qibinc commented on August 22, 2024 1

Hi @kotee4ko , thanks for your interest in this project! I apologize for the hacky implementation of the modeling part. Hope the following answers can help:

Could you please explain me why does we need
encode_memory

The encode_memory function is responsible for mapping a variable's size (red) and offsets (yellow) into vocab ids. Together with var_loc_in_func (green), it implements the encoding part of Figure 4 in the paper.

image

This function is required because Transformers have a fixed vocab size. Instead of directly passing mems to the model, we need an upper limit MAX_STACK_SIZE and convert the ones exceeding it to a special token <unk> and special id (which is 3 in this case).

this function, and why does it attempt to compare integers with '' ?

Since mems is an int array, the condition if mem == "<SEP>" is never met and can be safely deleted.

I can't get send of appending int token to array of int tokens instead of int token.

Could you elaborate more on this question?

what and why is 1030 constant do?

A variable can be on register or stack. In order to distinguish register x and stack position x, we assign them with different ids. In this case, vocab id range [0, 1027] (1027 = 3 + MAX_STACK_SIZE) represents stack postions, and [1030, 1030 + <vocab size of registers>] represents register positions

why we define tokens as that x and using as y

We have two Transformer encoders XfmrSequentialEncoder, responsible for code tokens, and XfmrMemEncoder, responsible for mem tokens (location, size, offsets). They have separate embeddings and vocabs. The first part on self.word2id is for the code vocab, while the second part with mem_id is for the mem vocab.

Feel free to follow up if you have more questions!

from dirty.

kotee4ko avatar kotee4ko commented on August 22, 2024

@qibinc , Sir, I can't get few things.

  1. If we always need to adjust each token by 3 (count of special tokens) in token list?

  2. If for second "block" of tokens (registers) we need to adjust it TWICE?

  3. Why we can't just let the constant registers quantity been first [3, 3+len(vocab.regs)]
    and memory (stack) been second [len(vocab.regs) + (3 or 3+3 ?), len(vocab.regs) + (3 or 3+3 ?) + MAX_STACK_SIZE]

  4. reg_name:56, reg_id:5 which one we are going to adjust by 3? Depending on encoder? (56 for code vocab and 5+3 for mem vocab)?

  5. if this is what is expected?

============
New loc:Reg 56; src_var.typ.size:8 src_var.typ.start_offsets():(0,)
calculating variable location, loc__:Reg 56
variable type is register , will adjust position by MAX_STACK_SIZE+3=1027
reg_name:56, reg_id:5
mems:(8, 0)
ret:[11, 3]
vloc:1035 
tmem:[11, 3] 
var_sequence:[1035, 11, 3]

New loc:Stk 0xa0; src_var.typ.size:144 src_var.typ.start_offsets():(0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
calculating variable location, loc__:Stk 0xa0
variable type is stack, will adjust position by 0+3=3
VocabEntry.MAX_STACK_SIZE:1024, stack_start_pos:216, offset:160
mems:(144, 0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
ret:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
vloc:59 
tmem:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123] 
var_sequence:[59, 147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
  1. Why we can't just redefine special tokens as negative values, to avoid incrementing of each value in the list?
  2. here in comments it sad about relative position for mem/stack and absolute location for regs. Generally, this value is just position/offset adjusted by container type constant, right?
    And why we doesn't check bounds of the stack? I mean here it can be true, but variable total size (in case of array, or structure) could overflow this limit. And in that case we would have offsets which would be interpreteted as registers?

Where is the code, which process this encoded tensors?

Wow, this is real hardcore, Sir.

from dirty.

kotee4ko avatar kotee4ko commented on August 22, 2024

@qibinc

Sir, I need to refactor code to be able to launch it on very specific AMD GPU.
Can you tell, if this code would be logically correct? The difference is in accuracy() method, which is behave a bit different that old one.

Thanks.

def tmaxu(t1, t2):
    tm1, tm2 = t1.unique().numel(), t2.unique().numel()
    # print(f"T1 (len = {t1.numel()}, uniq: {tm1} \n{t1}\n"
    #      f"T2 (len = {t2.numel()}, uniq: {tm2} \n{t2}\n"
    #      )
    ret = max(tm1, tm2, 2)
    return ret
    
    def _shared_epoch_end(self, outputs, prefix):
        final_ret = {}
        if self.retype:
            ret = self._shared_epoch_end_task(outputs, prefix, "retype")
            final_ret = {**final_ret, **ret}
        if self.rename:
            ret = self._shared_epoch_end_task(outputs, prefix, "rename")
            final_ret = {**final_ret, **ret}
        if self.retype and self.rename:
            # Evaluate rename accuracy on correctly retyped samples
            retype_preds = torch.cat([x[f"retype_preds"] for x in outputs])
            retype_targets = torch.cat([x[f"retype_targets"] for x in outputs])
            rename_preds = torch.cat([x[f"rename_preds"] for x in outputs])
            rename_targets = torch.cat([x[f"rename_targets"] for x in outputs])
            if (retype_preds == retype_targets).sum() > 0:
                binary_mask = retype_preds == retype_targets
                p_t = rename_preds[binary_mask]
                t_t = rename_targets[binary_mask]
                self.log(
                    f"{prefix}_rename_on_correct_retype_acc",
                    accuracy(
                        p_t,
                        t_t,
                        task='multiclass',
                        num_classes=tmaxu(p_t, t_t)
                    ),
                )

        return final_ret

from dirty.

kotee4ko avatar kotee4ko commented on August 22, 2024

-5474258040539699684_121
Wohoooo!
Seems, it is working?

@qibinc Thanks

One more question: how should I average name and type predictions?
Just like (name_loss + type_loss / 2), or anything else?

from dirty.

kotee4ko avatar kotee4ko commented on August 22, 2024

Wow, what a dirty trick!

(Pdb) model.vocab.regs.id2word {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<IDENTITY>', 5: '56', 6: '64', 7: '16', 8: '8', 9: '24', 10: '72', 11: '80', 12: '32', 13: '496', 14: '104', 15: '48', 16: '120', 17: '112', 18: '512', 19: '128', 20: '528', 21: '544', 22: '560', 23: '88', 24: '576', 25: '592', 26: '608', 27: '40', 28: '440', 29: '432', 30: '96', 31: '1', 32: '1280', 33: '424', 34: '472', 35: '448', 36: '464', 37: '480', 38: '456', 39: '5', 40: '0', 41: '1288', 42: '624', 43: '1296', 44: '640', 45: '656', 46: '672', 47: '688', 48: '2', 49: '1284', 50: '144', 51: '704', 52: '192', 53: '720', 54: '736', 55: '1312', 56: '400', 57: '1344', 58: '1328', 59: '1360', 60: '1376', 61: '1408', 62: '1392', 63: '1424', 64: '3', 65: '1440', 66: '1456', 67: '1472', 68: '1488', 69: '1504', 70: '1520', 71: '1536', 72: '1552', 73: '1568', 74: '1584', 75: '1600', 76: '1616', 77: '1632', 78: '1648', 79: '1664', 80: '1696', 81: '176', 82: '368', 83: '384', 84: '1680', 85: '1712', 86: '1728', 87: '1760', 88: '1792', 89: '1824', 90: '500'}


{"name":"openWrite",
"code_tokens":["__int64","__fastcall","openWrite","(","const","char","*","@@a1@@",",","int","@@a2@@",")","{","int","@@v3@@",";","int","@@oflag@@",";","if","(","@@a2@@",")","@@oflag@@","=","Number",";","else","@@oflag@@","=","Number",";","@@v3@@","=","open","(","@@a1@@",",","@@oflag@@",",","Number","L",")",";","if","(","@@v3@@","<","Number",")","errnoAbort","(","String",",","@@a1@@",")",";","return","(","unsigned","int",")","@@v3@@",";","}"],
"source":
    {
        "s8":{"t":{"T":1,"n":"int","s":4},"n":"v3","u":false},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"oflag","u":true},
        "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},
        "target":{"s8":{"t":{"T":1,"n":"int","s":4},"n":"fd","u":true},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"flags","u":true},
        "r56":{"t":{"T":3,"t":"char"},"n":"fname","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"append","u":false}},
        "test_meta":{"function_name_in_train":false,"function_body_in_train":false}
}

dict_keys([
    'index', 
    'src_code_tokens', 
    'variable_mention_to_variable_id', 
    'variable_mention_mask', 
    'variable_mention_num', 
    'variable_encoding_mask', '
    target_type_src_mems', 
    'src_type_id', 
    'target_mask', 
    'target_submask', 
    'target_type_sizes'
])

(Pdb) model.vocab.names.id2word[5] = ''        # first non-spec elem in names vocab starts on offset +5
(Pdb) model.vocab.types.id2word[7] = '__int64' # first non-spec elem in types vocab starts on offset +7


(Pdb) input_dict['index'] = [
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a1@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a2@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@v3@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@oflag@@']]

(Pdb) input_dict['src_code_tokens'][0].numel() = 87
input_dict['src_code_tokens'][0] = 
tensor(
    [
        1, 2069, 2008, 2012, 2010, 2063, 3251, 3877, 1995, 2088, 2046, 2001,
        9917, 1226, 2007, 2021, 9917,  402, 1996, 2019, 2021, 9917, 1263, 1997,
        2021, 9917, 1316, 1997, 2029, 1995, 9917,  402, 1996, 9917, 1316, 2009,
        2004, 1997, 2082, 9917, 1316, 2009, 2004, 1997, 9917, 1263, 2009, 3251,
        1995, 9917, 1226, 2007, 9917, 1316, 2007, 2004, 2027, 1996, 1997, 2029,
        1995, 9917, 1263, 2038, 2004, 1996, 9917, 2506, 9981, 3733, 1995, 2065,
        2007, 9917, 1226, 1996, 1997, 2049, 1995, 2036, 2021, 1996, 9917, 1263,
        1997, 2020,    2
    ]
)

(Pdb) input_dict['variable_mention_to_variable_id'][0].numel() = 87
(Pdb) input_dict['variable_mention_to_variable_id'][0] =
tensor(
    [
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
        0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 2, 0, 0,
        0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0
    ]
)

(Pdb) input_dict['variable_mention_mask'][0].numel() = 87
(Pdb) input_dict['variable_mention_mask'][0] = 
tensor(
    [
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.
    ]
)

(Pdb) input_dict['variable_mention_num'] = tensor([   [3., 2., 4., 4.]   ])

(Pdb) input_dict['variable_encoding_mask'] = tensor([    [1., 1., 1., 1.]    ])

(Pdb) input_dict['target_type_src_mems'] = 
tensor(
    [ #3d (func)
        [ #2d (var)
            [1035,   11,    3], #1d (var_mem_repr)  offset if < 1024+3 else reg_num+3+3, size+3, field_offset+3  
            [1036,    7,    3], 
            [   3,    7,    3],
            [   7,    7,    3]
        ]
    ]
)
(Pdb) model.vocab.regs.word2id['56'] = 5
(Pdb) model.vocab.regs.word2id['64'] = 6

# so, 1035-1024 = 11; 11 - 3 - 3 = 5; 5 == 5;
# and 1035-1024 = 12; 12 - 3 - 3 = 6; 6 == 6;


(Pdb) input_dict['src_type_id'] = tensor(
    [
        [9, 5, 5, 5] # vars type ids
    ]
)
(Pdb) model.vocab.types.word2id['const char *'] = 9 # "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
(Pdb) model.vocab.types.word2id['int'] = 5          # "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},

# src_type_id --> model.vocab.types

from dirty.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.