Code Monkey home page Code Monkey logo

Comments (12)

jkhenning avatar jkhenning commented on June 24, 2024 1

It's using Elastic

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

docker.io/allegroai/clearml:1.14.1-448

similar problems
#89
#178

from clearml-server.

jkhenning avatar jkhenning commented on June 24, 2024

Hi @kzelias, what is your code doing, exactly?

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

It's just a task over hydra.

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager

from clearml import Task

CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"

@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):

    task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
    logger = task.get_logger()

    trainer = pl.Trainer(**cfg.trainer)
    exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)

    # Initialize the weights of the model from another model, if provided via config
    print("------INITING FROM PRETRAIN------")
    asr_model.maybe_init_from_pretrained_checkpoint(cfg)
    print("------INITED------")

    logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
    logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
    trainer.fit(asr_model)


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears.

from clearml-server.

jkhenning avatar jkhenning commented on June 24, 2024

This might be an issue with Elastic- can you check the Elastic docker container logs?

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

The error existed for one week. She disappeared today.
All that happened during this time was the restart of the apiserver a few hours ago.
Something strange. Is the apiserver related to elastic?

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

the situation repeated itself. this time, the api server rebooted quickly.
elastic log is not detailed
clearml-elastic-master.log

apiserver:

[2024-03-06 07:44:28,228] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.007s]
[2024-03-06 07:44:28,232] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.003s]
[2024-03-06 07:44:28,235] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]
[2024-03-06 07:44:28,238] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]

clearml-apiserver.log

from clearml-server.

jkhenning avatar jkhenning commented on June 24, 2024

Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

My code is here
#233 (comment)

Server deployed by helm
https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml
I only changed this repository for elastic. I haven't changed the version

- name: elasticsearch
  repository: https://charts.bitnami.com/bitnami
  version: 7.17.3

from clearml-server.

kzelias avatar kzelias commented on June 24, 2024

some more logs from apiserver

[2024-05-03 07:48:39,500] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms
[2024-05-03 07:48:39,846] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_single_value_metrics in 23ms
[2024-05-03 07:48:39,887] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.037s]
[2024-05-03 07:48:39,889] [9] [ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 60ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2024-05-03 07:48:39,921] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 113ms
[2024-05-03 07:48:39,994] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms

from clearml-server.

jkhenning avatar jkhenning commented on June 24, 2024

@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0?

from clearml-server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.