Code Monkey home page Code Monkey logo

Comments (7)

aws-donkrets avatar aws-donkrets commented on July 22, 2024

dacorvo Sorry you are having an issue with your batch size > 1 test runs. While we are looking into the problem, could you try compileing with a batch size of 4 and 8 to post the results to this ticket?

from transformers-neuronx.

dacorvo avatar dacorvo commented on July 22, 2024

Here is the failure with batch size 4:

$ gptj_demo run --batch_size 4 --n_positions 20 gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...2023-06-27T07:21:05Z ERROR 2286 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
 0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 1# 0x00007F52C6A91E96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 6# 0x00007F527859C3FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 8# 0x00007F52C6AD3130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
 9# 0x00007F527C894820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007F527C89F35E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007F527C8929EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007F527C8B471E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007F5311A50C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007F5311A652D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007F5311A50C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007F5311A60AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007F527C8A8BE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007F5311A7DC6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007F5311A80082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007F5311A50C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007F5311A652D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007F5311A50C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007F5311A60AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007F5311504ECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007F531153CBA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007F531150ACD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007F5311B5179C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007F5311B5D9AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007F5311B53CED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007F5311B53EC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007F5311B66DA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x000000000067DBF1 in /usr/bin/python3
67# 0x000000000067DC6F in /usr/bin/python3
68# 0x000000000067DD11 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-27T07:21:05Z ERROR 2286 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-27T07:21:05Z ERROR 2286 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB 
Instruction: I-25457
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[512,4],[512,1],[1,512]]
SymbolicAP
Memory Location: {_reshape_382_hlo_id_3499__mhlo.reshape_22_pftranspose_10864_set}@PSUM
...
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmp6stffmrw/Scribable.3484.1.pb', '--output=/tmp/tmp6stffmrw/Scribable.3484.1.pb.neff', '--verbose=35']' returned non-zero exit status 1.

from transformers-neuronx.

dacorvo avatar dacorvo commented on July 22, 2024

And for batch size 8:

gptj_demo run --batch_size 8 --n_positions 20 gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...2023-06-27T07:29:21Z ERROR 2586 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
 0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 1# 0x00007F5892B11E96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 6# 0x00007F58465233FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 8# 0x00007F5892B53130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
 9# 0x00007F584A99B820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007F584A9A635E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007F584A9999EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007F584A9BB71E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007F58DFB59C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007F58DFB6E2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007F58DFB59C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007F58DFB69AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007F584A9AFBE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007F58DFB86C6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007F58DFB89082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007F58DFB59C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007F58DFB6E2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007F58DFB59C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007F58DFB69AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007F58DF60DECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007F58DF645BA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007F58DF613CD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007F58DFC5A79C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007F58DFC669AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007F58DFC5CCED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007F58DFC5CEC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007F58DFC6FDA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x000000000067DBF1 in /usr/bin/python3
67# 0x000000000067DC6F in /usr/bin/python3
68# 0x000000000067DD11 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-27T07:29:21Z ERROR 2586 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-27T07:29:21Z ERROR 2586 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB 
Instruction: I-25625
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[512,8],[512,1],[1,512]]
SymbolicAP
Memory Location: {_reshape_382_hlo_id_3499__mhlo.reshape_22_pftranspose_10864_set}@PSUM
...
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmp4w_o2yf2/Scribable.3484.1.pb', '--output=/tmp/tmp4w_o2yf2/Scribable.3484.1.pb.neff', '--verbose=35']' returned non-zero exit status 1.

from transformers-neuronx.

aws-donkrets avatar aws-donkrets commented on July 22, 2024

dacorvo Thx for posting - seems the error is consistent regardless of batch size. We are still investigating why this is occurring.

from transformers-neuronx.

dacorvo avatar dacorvo commented on July 22, 2024

Can you confirm this is fixed with latest release ?

from transformers-neuronx.

aws-donkrets avatar aws-donkrets commented on July 22, 2024

dacorvo we have not made an explicit fix for this in the latest release, however you are welcome to try it to see if other compiler changes may have had an impact.

from transformers-neuronx.

dacorvo avatar dacorvo commented on July 22, 2024

It seems to be fixed in 0.5.58.

from transformers-neuronx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.