Hi! I am running monstache with HA mode on, which goes fine for the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bulk insertion timeout about monstache HOT 15 CLOSED

rwynn commented on July 18, 2024

Bulk insertion timeout

from monstache.

Comments (15)

rwynn commented on July 18, 2024

@miklosbarabas,
Did a quick test with

./monstache --cluster-name c1 --verbose

Ran 2 processes like that and it looked like when I killed one the other took over and was still able to connect and send bulk requests.

There is a currently hard code 10s timeout on the http client given to use for elasticsearch

https://github.com/rwynn/monstache/blob/master/monstache.go#L1101

I'm not sure if that's the problem or not. Do you also see this issue on a simple local install of elasticsearch?

from monstache.

rwynn commented on July 18, 2024

@miklosbarabas,

Just published new version with configurable timeout. Defaults now to 60 seconds.

from monstache.

miklosbarabas commented on July 18, 2024

@rwynn

Sorry for the late response, and thank you for your fast reaction and for the new release!

Some additions for the scenario:
Using ES 2.4.4 with 3-5 nodes in the cluster, and running monstache on the node boxes (3 workers / node) the above mentioned error occurred regardless of resources (tho it started to happen earlier on smaller AWS T2 boxes)
The following configurations were used:

...
elasticsearch-max-docs = 1000
elasticsearch-max-seconds = 1
elasticsearch-max-bytes = 32768 # I tried with the default as well..
elasticsearch-retry = true
gtm-channel-size = 200
direct-read-namespaces = ["mydb.mycoll"] # here we use multiple db and collection
#some mappings
...

Also the error was happening during the direct-read-namespaces stage (new oplog entries weren't synced at this point) and somehow made the ES cluster stucked: no error in logs, the REST endpoint seemingly up, tho apps could not connect on transport protocol (9300) anymore. The metrics showed normal resource consumption, the cluster was in green state, but not really fully usable...

Finally after rereading the monstache documentation (:+1:) found the Thead link
Adding the following properties into elasticsearch.yml made the issue go away:

threadpool:
    index:
        type: fixed
        size: 30
        queue_size: 1000
threadpool:
    bulk:
        type: fixed
        size: 30
        queue_size: 1000

So far the above mentioned scenario is working properly. I could even roll out your new version without any issue with other host's workers picking up the job!

from monstache.

miklosbarabas commented on July 18, 2024

@rwynn I have some other questions, hope you don't mind if I put them down here as well:

Any idea why the direct-read-namespaces phase sometimes slows down that much, that it is inserting 1-2 indices/s only? (these times, the stats shows that 1-2 indices are queued)
Checking the monstache.monstache collection in Mongo, sometimes the worker ts field is being updated in a strange way: hopping back-and-forth in time (like 2017-feb-4 12.00, then 2015-may-12 14.00, then 2017-jan-3 11.00 with inc of 0) Why is this happening?
Also could you please tell some words about how the elasticsearch-retry exponential backoff is working?

Many thanks!

from monstache.

rwynn commented on July 18, 2024

@miklosbarabas,
can you explain a little more about slow down you are seeing? I tried direct reads on 3 mongo collections with 100K documents each, and monstache indexed them in 30 seconds. You can try timing this with a command like:

time ./monstache --direct-read-namespace test.a --direct-read-namespace test.b --direct-read-namespace test.c --exit-after-direct-reads

I pushed a new release which should solve the ts field in the monstache collection. This was because it was recording timestamps from direct reads which it should not. Now the ts should only progress forward and be from the oplog data only.

I also added some documentation about retry. Hopefully, this will explain it.

Have you tried only running the direct reads via a cron job and not in the process with clustering and workers? The direct reads are good for catching up periodically. The long running processes can be tailing the oplog and not doing direct reads in a cluster. Another single process, non clustered, could be run periodically to do the direct reads and exit.

from monstache.

miklosbarabas commented on July 18, 2024

@rwynn
Thx for the new release!

It is happening after around the same amount of documents indexed, which is ~3 million and only happening for this collection which is ~6 million.
MongoDB metrics are fine as well as the ES metrics. No errors in the logs.

Tried to run only one monstache, non-worker and non-HA mode, just to direct-read the data and set to exit when it's done.
Experienced the same slowdown.

Even tried to increase gtm values like this:

[gtm-settings]
channel-size = 512
buffer-size = 256
buffer-duration = "4000ms"

with
elasticsearch-max-seconds = 5
Nothing changed. Slowdown at around the same point of the direct-read. (after ~3 million docs)

Following the trace file, it looks like it really is inserting mostly 1-3 indices per request, but there are some times when there one or two larger (1000) bulk is appearing.

from monstache.

miklosbarabas commented on July 18, 2024

Also after enabling index-oplog-time I see these in the trace:
"_oplog_date":"1970/01/01 00:00:00","_oplog_ts":0
for some of the indices.

The application stops after a while so direct-read finished it's job, but comparing the indexed document count in ES and the count of the collection from mongodb it still seemingly missing ~1mil docs.
Based on the stats file, only ~1500 failed before it's finished the direct-read.

from monstache.

rwynn commented on July 18, 2024

The oplog-time is not always available for direct reads. If the _id of the document is an ObjectID then it can be determined, because of the nature of ObjectIDs. If not, the timestamp will just be 0.

https://github.com/rwynn/gtm/blob/master/gtm.go#L465

from monstache.

rwynn commented on July 18, 2024

My current guess is that the slowdown may be occuring because of the nature of the direct read query. It uses a $natural sort which might not be optimal. I can try to change this to an _id sort to see if it then uses the index to speed things up.

https://github.com/rwynn/gtm/blob/master/gtm.go#L448

from monstache.

miklosbarabas commented on July 18, 2024

I could give it a try straight away if you release one based on the _id sort!

from monstache.

rwynn commented on July 18, 2024

@miklosbarabas, just pushed a new release. Give that one a try and let me know if the problem persists. Thanks for reporting this issue.

from monstache.

rwynn commented on July 18, 2024

The new release includes an updated version of the github.com/rwynn/gtm library. That is where the changes related to the slowness issue were made.

from monstache.

rwynn commented on July 18, 2024

@miklosbarabas, any luck with the latest release?

from monstache.

miklosbarabas commented on July 18, 2024

@rwynn: with the new release, the missing ~1million doc got inserted properly, didn't experience the slowdown. Thanks for all the help!!!

from monstache.

rwynn commented on July 18, 2024

Glad to hear it helped. You might be able to squeeze out a little more performance with the workers option and multiple processes. That will increase, by a multiple of N workers, the overall query load on MongoDB but will decrease the Elasticsearch indexing load of each process by N-1/N . Depending on how well Mongo can handle the queries it might give you a boost overall.

from monstache.

Bulk insertion timeout about monstache HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent