Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

It is likely the error is related to changed input data. </blockquote

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I am just re-running to most recently failed <a href="https://github.com/crate/cratedb

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` about cratedb-examples HOT 18 CLOSED

amotl commented on May 28, 2024

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.`

from cratedb-examples.

Comments (18)

andnig commented on May 28, 2024 1

Hours and hours of debugging into dependencies of pycaret, googling the term transitive dependencies - just to find, that the test still ran on python 3.10 - life of a developer is fun😄
https://github.com/crate/cratedb-examples/actions/runs/7036786822/job/19150177672?pr=171

Can you confirm that the test is green?

To be honest I'm not sure if this issue is really resolved yet, as the pycaret timeseries notebook test was always green but the script version of it failed. Smells like flaky test or environment. Let's monitor the situation - but as the PR test is green for now, will not invest more time for now. Good for you?

from cratedb-examples.

andnig commented on May 28, 2024 1

Hi Andreas, if you look at the logs it's not a timeout error, it's the nan input error. As mentioned above I'd suggest to keep these two issues separated. The timeout issue is most probably related to the jupyter test runner. This input nan error however is not related to jupyter.

If I look at the failed run, I see the the esm model has an incredibly high MASE and RMSSE. This mostly indicates that the model is not very well suited for the data. I suggested it, as it is very lightweight, but well, too lightweight as it seems 😓

To go forward, you could:

Use a different model for the test run, one which has less MASE. Run the whole pycaret model suite locally and select one of the top 5 models instead of the exp_smooth one, for your test run.
If this does not help, can you provide some local reproduction steps? If you can reproduce it locally, I'm better able to help.

from cratedb-examples.

amotl commented on May 28, 2024

It is likely the error is related to changed input data.

Thinking about it once more, it is more likely that some dependency library of PyCaret was not pinned correctly, and that something changed in this area.

from cratedb-examples.

andnig commented on May 28, 2024

@amotl
All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.
Pycaret automatically interpolates nan values except if there are ONLY nan vals (or there are no values), which might indicate an issue with the testing infrastructure, connection or database.
Before I dig out my debug-rod, there were no changes in either the test runner or crate sqlalchemy which come to your mind which might prevent reading data via pandas?

from cratedb-examples.

amotl commented on May 28, 2024

All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.

That's true, but I am talking about transitive dependencies of PyCaret. I think it is the most likely reason, but sure it can also be different.

from cratedb-examples.

amotl commented on May 28, 2024

Thank you very much for your efforts. Sure, let's merge the PR, close this issue, and monitor the situation into the future for similar events.

from cratedb-examples.

amotl commented on May 28, 2024

I am just re-running to most recently failed https://github.com/crate/cratedb-examples/actions/runs/7027445018, in order to rule out that it is related to the time-of-day when the test is executed.

If it will fail again, it is likely that the upgrade to Python 3.11 resolved the situation in one way or another, and that your debugging efforts had a positive outcome.

from cratedb-examples.

amotl commented on May 28, 2024

Aha, it is green again, so it was actually just a fluke. However, it is an interesting one which can also easily hit production applications, depending on what the actual root cause was.

from cratedb-examples.

andnig commented on May 28, 2024

This is related to how the tests in our repo here are designed. The model training pipeline itself is not of concern - see some of the reasons for why this error happens above. I know this error quite well from my projects - it happens if the data are not available as expected.

from cratedb-examples.

amotl commented on May 28, 2024

Hi again.

I think the root cause for this is actual the venerable CellTimeoutError, i.e. the Notebook just runs too much system load, see, for example, ¹:

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

With kind regards,
Andreas.

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870
-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2872

https://github.com/crate/cratedb-examples/pull/174#issuecomment-1837265321 ↩

from cratedb-examples.

amotl commented on May 28, 2024

Another occurrance of the venerable CellTimeoutError. It also happens on a setup() call, but this time, on a different one.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           from pycaret.classification import setup, compare_models, tune_model, ensemble_model, blend_models, automl, \
E               evaluate_model, finalize_model, save_model, predict_model
E           
E           s = setup(
E               data,
E               target="Churn",
E               ignore_features=["customerID"],
E               log_experiment=True,
E               fix_imbalance=True,
E           )
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2746

from cratedb-examples.

amotl commented on May 28, 2024

We found the reason for this was mainly due to a misconfiguration of the MLFLOW_TRACKING_URL. It has been fixed on behalf of GH-174, unless further notice. Thanks for your support, @andnig!

from cratedb-examples.

amotl commented on May 28, 2024

Hi again. This issue is still present, and is constantly haunting us, which is unfortunate.

The most recent occurrance, just about two hours ago, happened after we tried to re-schedule the corresponding job to run on day times, as we figured it would work better. Turns out, it doesn't help.

Now, looking a bit closer at the error output, I am just now also spotting this warning:

  /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
  STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
  
  Increase the number of iterations (max_iter) or scale the data as shown in:
      https://scikit-learn.org/stable/modules/preprocessing.html
  Please also refer to the documentation for alternative solver options:
      https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

-- https://github.com/crate/cratedb-examples/actions/runs/7854158803/job/21434611151#step:6:1155

Could that actually be related to the job occasionally (50/50) stalling/freezing/timing out?

from cratedb-examples.

andnig commented on May 28, 2024

Hey Andreas, happy to chime in. 👋

Please think about separating these two topics for more clarity. CellTimeout and Input NAN are mostly two separate issues if they still occur both. Celltimeout is more often than not related to the jupyter test runner (or, well, simple timeouts), while input nan can mean multiple things, more common ones are either data not there, your infrastructure is about to get killed or training iterators running amok.
As your test infrastructure is quite limited in terms of CPU power, we added the PYTEST_CURRENT_TEST env variable which only runs 3 models which are also rather fast to train. If I remember correctly we used two ets model variants and a naive one.
From the logs you shared it seems however that all the models are trained. (also the non-converge error is related to a model which we excluded for test runs).

I would suggest utilizing the PYTEST_CURRENT_TEST environment variable for both, the ipynb and the py tests to reduce training time and potentially solve both issues related to how you test these nbs. Please just make sure that the env vars are "visible" for the jupyter notebooks as well. Exact config depends on which jupyter test runner you use.

I hope this helps so far, let me know, how it goes.

PS: As you mentioned that the tests fail 50/50 but on quick glance I was only able to find 2 failed tasks, would you mind checking if the input nan failures are always on notebook tests or also on .py file tests?

from cratedb-examples.

amotl commented on May 28, 2024

Hi Andreas, thanks for your quick reply.

From the logs you shared it seems however that all the models(!) are trained, [while we intended to only run a few of them]. [I can] also [spot] a non-converge error, which is related to a model which we excluded for test runs.
[Most probably, PYTEST_CURRENT_TEST is not getting evaluated properly.] Please just make sure that the env vars are "visible" for the jupyter notebooks as well.

That's to the point. I also had the suspicion that the measures we took last time, to bring down required compute resources, did not work well, or had flaws, but I did not analyze the log output yet about this topic. So, if you think this is the issue still tripping us, I now have a thing to hang on and investigate. Thank you so much!

With kind regards,
Andreas.

from cratedb-examples.

amotl commented on May 28, 2024

Hi again. We've explored the situation, and the outcome is that we can confirm that the call to compare_models works well, including its guard using a corresponding if "PYTEST_CURRENT_TEST" in os.environ clause.

The guard works when invoking pytest in the local directory, and it works when invoking ngr test from the repository root directory.
The guard also works in both files equally well, the pure .py file, and the .ipynb file, so it is apparently not obstructed by pytest / nbtest runners.

I wouldn't know why it should be different on GHA. So, maybe the selected algorithms ["arima", "ets", "exp_smooth"] / ["ets", "et_cds_dt", "naive"] are still too heavy on CPU and/or memory?

from cratedb-examples.

amotl commented on May 28, 2024

Thanks, and sorry that I mixed up those two different errors again. I've diverted those into separate issues now, so this one can be closed after carrying over the relevant information.

from cratedb-examples.

amotl commented on May 28, 2024

After splitting the issue up into different tickets, but without applying any other fixes, we are currently not facing any problems on nightly runs of the corresponding CI jobs.

Therefore, I am closing the issue now, for the time being. Thanks again, @andnig!

from cratedb-examples.

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` about cratedb-examples HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (18)

Footnotes

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org