Comments (12)
It's using Elastic
from clearml-server.
docker.io/allegroai/clearml:1.14.1-448
from clearml-server.
Hi @kzelias, what is your code doing, exactly?
from clearml-server.
It's just a task over hydra.
import pytorch_lightning as pl
from omegaconf import OmegaConf
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager
from clearml import Task
CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"
@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):
task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
logger = task.get_logger()
trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)
# Initialize the weights of the model from another model, if provided via config
print("------INITING FROM PRETRAIN------")
asr_model.maybe_init_from_pretrained_checkpoint(cfg)
print("------INITED------")
logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
trainer.fit(asr_model)
if __name__ == '__main__':
main() # noqa pylint: disable=no-value-for-parameter
from clearml-server.
UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears.
from clearml-server.
This might be an issue with Elastic- can you check the Elastic docker container logs?
from clearml-server.
The error existed for one week. She disappeared today.
All that happened during this time was the restart of the apiserver a few hours ago.
Something strange. Is the apiserver related to elastic?
from clearml-server.
the situation repeated itself. this time, the api server rebooted quickly.
elastic log is not detailed
clearml-elastic-master.log
apiserver:
[2024-03-06 07:44:28,228] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.007s]
[2024-03-06 07:44:28,232] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.003s]
[2024-03-06 07:44:28,235] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]
[2024-03-06 07:44:28,238] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]
from clearml-server.
Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is
from clearml-server.
My code is here
#233 (comment)
Server deployed by helm
https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml
I only changed this repository for elastic. I haven't changed the version
- name: elasticsearch
repository: https://charts.bitnami.com/bitnami
version: 7.17.3
from clearml-server.
some more logs from apiserver
[2024-05-03 07:48:39,500] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms
[2024-05-03 07:48:39,846] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_single_value_metrics in 23ms
[2024-05-03 07:48:39,887] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.037s]
[2024-05-03 07:48:39,889] [9] [ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 60ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2024-05-03 07:48:39,921] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 113ms
[2024-05-03 07:48:39,994] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms
from clearml-server.
@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0?
from clearml-server.
Related Issues (20)
- clearml-webserver crashes when IPv6 is disabled on a k8s node HOT 1
- Could not find host server definition HOT 5
- Feature Request: Get server configuration parameters from AWS Secrets Manager [security]
- [Customising web-ui] - Projects are loading tasks in web ui of self hosting server but i want them to show datasets HOT 3
- generating clearml-reports HOT 13
- How to write artifacts to S3 from server side? HOT 1
- Nginx Not Loading Plotly.js Resource: ClearML Self-Hosted Docker HOT 7
- Failed Navigate From Overview to Experiments Details HOT 4
- Async Delete Always Failed when Removing Experiments (using Minio)
- nginx 0.6.x < 1.20.1 1-Byte Memory Overwrite RCE vulnerability HOT 2
- ElasticSearch UI and Redis UI? HOT 2
- Curl 7.69 < 8.4.0 Heap Buffer Overflow vulnerability HOT 2
- OpenSSL 1.1.1 < 1.1.1x Vulnerability HOT 1
- Elasticsearch image tag 7.17 does not exist HOT 4
- Git package is not installed by default in node:20-bookworm-slim HOT 1
- SERVER UNAVAILABLE HOT 4
- APP Credentials disapper in webapp HOT 20
- Scalar graphs legend is too narrow for experiments with long names HOT 7
- Update from 1.14.1 to 1.15.0 leads to several fatal issues when booting HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml-server.