Following the example of HuggingFaceGenerationModelAdapter</

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

GPT2: sampling using transformers generate() significantly slower than using model.sample() about transformers-neuronx HOT 7 CLOSED

aws-neuron commented on August 24, 2024 1

GPT2: sampling using transformers generate() significantly slower than using model.sample()

from transformers-neuronx.

Comments (7)

mrnikwaws commented on August 24, 2024 1

Thanks @dacorvo we'll keep the ticket open in the interim, but will assume you have the next step

from transformers-neuronx.

mrnikwaws commented on August 24, 2024

Hi @dacorvo,

Could you also share the parameters you call the GPT2ForSampling.sample() and generate()?

The default behavior of sample and generate is significantly different:
*.sample() uses topk (k=50)

to use same topk in generate, it would expect generate(topk=50)

It would be really helpful if you could share the minimum script including these commands, and the instance type you used to reproduce your profiling results.

from transformers-neuronx.

dacorvo commented on August 24, 2024

Hi @mrnikwaws, thank you for your answer.

I think top_k is by default set to 50 in transformers. I set it explicitly anyway, but I got the same results.

Here is the link to my script: https://github.com/dacorvo/transformers-neuronx/blob/inference_tests/test_gpt2.py

from transformers-neuronx.

dacorvo commented on August 24, 2024

I did a few more tests on an 24xlarge instance, which has 6 devices / 12 cores.

Using a batch size of 16 and generating 1000 tokens, the test lasts long enough so that I can get meaningful numbers using neuron_top in parallel.

What I see is that when using the GPT2ForSampling.sample() method directly, the 12 cores are used up to 60 %, while when using the wrapper generate() method, they are only used up to 20 % (virtual CPU usage looks equivalent).

This is purely speculative, because I have no idea of what actually happens behind the scene, but in a previous life developing software for hardware accelerators, this was often the sign of too much latency in the host process feeding the hardware.

from transformers-neuronx.

dacorvo commented on August 24, 2024

I investigated a bit further and the top-k sampling code is indeed much slower in transformers, which creates latency between the hardware invocations. I think I might have identified why the sample_loop() is faster and I will see if I can align the transformers top-k implementation, and confirm this solves this issue.

from transformers-neuronx.

dacorvo commented on August 24, 2024

I have modified the transformers sampling loop in optimum-neuron:
huggingface/optimum-neuron#130
The performances of the generation through transformers generate() are now equivalent to those using the transformers-neuronx custom sampling loop.
FYI, I submitted the issue to the transformers team but at this stage it is not going to be integrated in the core library.
Said differently, you will need to go through the optimum-neuron package to get the optimized generation code: the change will not benefit to the transformers-neuronx HuggingFaceGenerationModelAdapter class.

from transformers-neuronx.

mrnikwaws commented on August 24, 2024

Thanks @dacorvo we're taking a look at what we can do to alert customers on this issue in our docs / through warning messages.

from transformers-neuronx.

GPT2: sampling using transformers generate() significantly slower than using model.sample() about transformers-neuronx HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent