Code Monkey home page Code Monkey logo

Comments (7)

mrnikwaws avatar mrnikwaws commented on August 24, 2024 1

Thanks @dacorvo we'll keep the ticket open in the interim, but will assume you have the next step

from transformers-neuronx.

mrnikwaws avatar mrnikwaws commented on August 24, 2024

Hi @dacorvo,

Could you also share the parameters you call the GPT2ForSampling.sample() and generate()?

The default behavior of sample and generate is significantly different:
*.sample() uses topk (k=50)

  • to use same topk in generate, it would expect generate(topk=50)

It would be really helpful if you could share the minimum script including these commands, and the instance type you used to reproduce your profiling results.

from transformers-neuronx.

dacorvo avatar dacorvo commented on August 24, 2024

Hi @mrnikwaws, thank you for your answer.

I think top_k is by default set to 50 in transformers. I set it explicitly anyway, but I got the same results.

Here is the link to my script: https://github.com/dacorvo/transformers-neuronx/blob/inference_tests/test_gpt2.py

from transformers-neuronx.

dacorvo avatar dacorvo commented on August 24, 2024

I did a few more tests on an 24xlarge instance, which has 6 devices / 12 cores.

Using a batch size of 16 and generating 1000 tokens, the test lasts long enough so that I can get meaningful numbers using neuron_top in parallel.

What I see is that when using the GPT2ForSampling.sample() method directly, the 12 cores are used up to 60 %, while when using the wrapper generate() method, they are only used up to 20 % (virtual CPU usage looks equivalent).

This is purely speculative, because I have no idea of what actually happens behind the scene, but in a previous life developing software for hardware accelerators, this was often the sign of too much latency in the host process feeding the hardware.

from transformers-neuronx.

dacorvo avatar dacorvo commented on August 24, 2024

I investigated a bit further and the top-k sampling code is indeed much slower in transformers, which creates latency between the hardware invocations. I think I might have identified why the sample_loop() is faster and I will see if I can align the transformers top-k implementation, and confirm this solves this issue.

from transformers-neuronx.

dacorvo avatar dacorvo commented on August 24, 2024

I have modified the transformers sampling loop in optimum-neuron:
huggingface/optimum-neuron#130
The performances of the generation through transformers generate() are now equivalent to those using the transformers-neuronx custom sampling loop.
FYI, I submitted the issue to the transformers team but at this stage it is not going to be integrated in the core library.
Said differently, you will need to go through the optimum-neuron package to get the optimized generation code: the change will not benefit to the transformers-neuronx HuggingFaceGenerationModelAdapter class.

from transformers-neuronx.

mrnikwaws avatar mrnikwaws commented on August 24, 2024

Thanks @dacorvo we're taking a look at what we can do to alert customers on this issue in our docs / through warning messages.

from transformers-neuronx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.