Code Monkey home page Code Monkey logo

Comments (9)

ThisCakeIsALie avatar ThisCakeIsALie commented on May 25, 2024 1

It works! I set my project up on colab and it works with the changes you mentioned (with tensorflow 1.15.2)! Now I just have to figure out what's wrong on my machine (maybe Anaconda messed up some installs?).

But thank you so much for the help. Much appreciated!

from posit.

Sonic714 avatar Sonic714 commented on May 25, 2024 1

Thank you so much for your help! I'll try training a CPU model on my machine. I cannot use GPU because TF 1.15 does not support CUDA 11 and RTX 30 Series GPU.

from posit.

PPPI avatar PPPI commented on May 25, 2024 1

Let me know if there are other issues or you get stuck and I'll try to help! Good luck!

from posit.

PPPI avatar PPPI commented on May 25, 2024

Could you double check if you are running tensorflow or tensorflow-gpu?
In case you are running tensorflow or tensorflow-gpu in CPU mode, the row-major/column-major format changes.

If this is the reason for the issue (I suspect so, but not sure), then please find line 91 in src/tagger/config.py, it should say

use_cpu = False

Just flip this line over to True.

Sadly, I didn't know of a reliable way to detect if I was running in CPU mode, so it's on the user to flip the flag in the config!

Hope this helps!
-Profir

from posit.

ThisCakeIsALie avatar ThisCakeIsALie commented on May 25, 2024

Thanks for replying so quickly! I was indeed using tensorflow (1.14 to be exact).

Sadly the config change doesn't quite seem to do the trick. The original error vanishes but once again there seems to be some kind of mismatch in dimensions it would seem.

input> This is a test a nd a call to test()
Traceback (most recent call last):
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [10,1], In[1]: [8,4]
         [[{{node feature/bidirectional_rnn/bw/bw/while/lstm_cell_3/BiasAdd_1}}]]

During handling of the above exception, another exception occurred:

...

I will try setting things up on colab (maybe this is a problem with my machine, who knows) and report back if anything different happens there.

from posit.

PPPI avatar PPPI commented on May 25, 2024

Reading for this error, it seems to not match the bias term in the LSTM cell. Not particularly sure what is the root cause on this.

For the saved model the config should be correct; however, I did change the config on the branch (and the code there is significantly changed).

Could you sanity check that you are using the code from master? I cannot validate the old model on the branch code before mid/late next week.

If you want to check data integraty:

results.zip: (SHA-256) 597764810B66E13F099216470E0BF4877A6EABBA2A24EDC41B5A6D9ABA771942

Also, not sure if my GH LFS covers enough data download, so just in case, here's the corpora: https://1drv.ms/u/s!AnfFX0y_EVFM1IwNFJDRK5EBV3UK4g?e=jQZie6 (OneDrive is easiest on my end to provide).

For colab, the caveat about CPU applies, make sure to have the flag the right way around depending on if you use a GPU or not.

I also sanity checked on my end (maybe something broke :) ) with the example line:

$ python src/evaluate.py ./results/test/SO_Freq_Id/7dec5e7f-9c9b-4e7b-a52a-cdb6183f83de_with_crf_with_chars_with_features_epochs_30_dropout_0.500_batch_16_opt_rmsprop_lr_0.0100_lrdecay_0.9500/model.weights

(On Windows 10x64 B.2004 with Python 3.7 and Tf-gpu 1.14)
as well as training. Both work on the master branch. I cannot give gurantees on the branch as the config format there is significantly changed.

I'll keep an eye here if there are issues on the colab.

from posit.

PPPI avatar PPPI commented on May 25, 2024

Quick update after investigating some things on my end. The model code was trained before I modeled/allowed UNK in chars. So even once you sort errors you will hit a index[m,n,l] = -1 error. I am pushing the update to revert this on master (still want it on my branch though).

You can also manually fix it on your end by editing src/tagger/data_utils.py:L166 to:

char_ids = vocab_chars.doc2idx(chars_, unknown_word_index=0)

The change is from UNK being -1 to UNK being ignores (i.e. 0).

Hope this helps, and I will keep an eye on the thread if you need other help.

-Profir

from posit.

Sonic714 avatar Sonic714 commented on May 25, 2024

Hi. I ran evaluate.py with sample command and got similar issue as mentioned in #3. I tried changing unknown_word_index but it didn't work. I wonder if there is anything I did wrong. The error is as follows.
I am using the saved model and the code on master branch with some modifications: changing the CPU flag to true in config.py as well as config.json in the saved model, and inserting sys.path in evaluate.py to avoid no module error. On Windows 10 with Python 3.7 and Tf 1.15.2(version 1.14.0 and 1.15.0 were also tried).
Thank you in advance.

Traceback (most recent call last):
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
         [[{{node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/evaluate.py", line 80, in <module>
    main()
  File "src/evaluate.py", line 75, in main
    model.evaluate(test)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\base_model.py", line 151, in evaluate
    metrics = self.run_evaluate(test)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 474, in run_evaluate
    labels_pred, labels_pred_l, sequence_lengths = self.predict_batch(words)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 391, in predict_batch
    feed_dict=fd)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
         [[node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3 (defined at D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]

Original stack trace for 'feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3':
  File "src/evaluate.py", line 80, in <module>
    main()
  File "src/evaluate.py", line 68, in main
    model = restore_model(config)
  File "src/evaluate.py", line 59, in restore_model
    model.build()
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 367, in build
    self.add_word_embeddings_op()
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 229, in add_word_embeddings_op
    sequence_length=feature_sizes, dtype=tf.float32)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 464, in bidirectional_dynamic_rnn
    scope=fw_scope)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 707, in dynamic_rnn
    dtype=dtype)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 916, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop
    return_same_structure)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 882, in _time_step
    skip_conditionals=True)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 283, in _rnn_step
    new_output, new_state = call_cell()
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 870, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\layers\recurrent.py", line 2241, in call
    x_o = K.bias_add(x_o, b_o)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\backend.py", line 5442, in bias_add
    x = nn.bias_add(x, bias)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2718, in bias_add
    return gen_nn_ops.bias_add(value, bias, data_format=data_format, name=name)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 760, in bias_add
    "BiasAdd", value=value, bias=bias, data_format=data_format, name=name)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

from posit.

PPPI avatar PPPI commented on May 25, 2024

Hey,

Sadly a GPU trained model will not work on the CPU and vice-versa. So this:
"changing the CPU flag to true in config.py as well as config.json in the saved model,"
changes the code path taken, while the actual parameters are still GPU-shaped.

The error is exactly what I expect to see in such a scenario.

I assume you need to run on the CPU and don't have time to train a model. I am uncertain how soon I can train a model on the master branch, but if you need a CPU model, I could get to that by the end of March (have other engagements that are saturating my compute and time).

If you can run the GPU model on the GPU, I would suggest going that route instead as that might be quicker than waiting on me to train a fresh CPU model.

Sorry if this answer is a bit disappointing.

For a slightly more technical explanation of what is happening, within the biLSTM library code, some tensors/matrices have different encoding (Row-first/Column-first) depending on the CPU/GPU execution. This is due to what each device expects to see. So a model trained on the other device has data that looks scrambled for the other. Technically, you could try to change from row- to column-first all parameters and rerun the model, but I have not worked that low level in tensorflow, and cannot advice how to do that, nor do I think that's a useful way to waste time ;)

Regards,
Profir

from posit.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.