awf / functional-transformer Goto Github PK

View Code? Open in Web Editor NEW

168.0 168.0 6.0 78 KB

A pure-functional implementation of a machine learning transformer model in Python/JAX

License: MIT License

Python 100.00%

functional-transformer's People

Contributors

Stargazers

Watchers

Forkers

jamesthesnake arthurperret ml-lab alexm-gc anas-zafar kaushalya

functional-transformer's Issues

Make value heads nonsquare and add back head concatenation

As noted in #6 the model does not match the original code, or indeed the original transformer paper. I therefore consider this a "transformer variant", but of course it would be sensible to make it match and check if that improves/disimproves performance.

Initialization of positional encodings?

Hi there. Great work!

I shared this with a colleague and they were concerned that your example does not seem to initialize the positional encodings beyond zeros.

Should there be a comment or an implementation of setting up the positional encoding?

Implementation of Attention is wrong.

Mistake 1. The attention heads are all summed together; they should be concatenated (see code and image of attention equation).

Mistake 2. After the missing concatenation there should have been another linear layer (see code and image of attention equation).

Note. Using a for-loop for heads doesn't seem to be efficiently compiled by Jax.

Note. The weight initialization is different to minGPT.

Code: https://github.com/awf/functional-transformer/blob/e44f4606efd663b0c6454d81010f536966dbd990/transformer.py#L161C9-L180C69

# Multi-head self-attention
for head in layer.heads:

    # Project into this head's query/key space
    query = linear(head.query, t1)                  # L x Dk
    key = linear(head.key, t1)                      # L x Dk

    # Compute L x L attention matrix
    score = query @ key.T + mask                    # L x L
    attn = jax.nn.softmax(cfg.tau * score, axis=1)  # L x L

    value = linear(head.value, t1)                  # L x Dm
    self_attn = attn @ value                        # L x Dm

    # Add this head's contribution into embeddings
    embeddings += self_attn                         # L x Dm     <---- sum instead of concatenate
    
# <-- after concatenating all attention heads there should be another linear layer here.

Bug? Loss curves show a distinct correlation with batch id.

WandB loss curves (e.g. here) show a sawtooth form, correlated with batch ID.

Batches are randomized and this occurs even with 1-bit gradients, so it's not Adam...

Add non-learned position encodings

From #4 (comment)_

See how we might include the sin(t) terms, rather than just 'learned' encodings.

awf / functional-transformer Goto Github PK

functional-transformer's People

Contributors

Stargazers

Watchers

Forkers

functional-transformer's Issues

Make value heads nonsquare and add back head concatenation

Initialization of positional encodings?

Implementation of Attention is wrong.

Bug? Loss curves show a distinct correlation with batch id.

Add non-learned position encodings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent