I don't think this is possible yet - I'd like to be able to specify multiple hosts to

We should definitely coordinate with <a class="user-mention notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Connect to multiple hosts about corb2 HOT 31 CLOSED

marklogic-community commented on August 18, 2024 1

Connect to multiple hosts

from corb2.

Comments (31)

rjrudin commented on August 18, 2024 2

When I first heard of DMSDK, I was hoping corb3 would be a bit of a rewrite with DMSDK at the core and all the corb command-line options still present. But as noted here - marklogic/java-client-api#779 - I think there's the hangup with XCC generally being faster than the REST API, which DMSDK depends on.

from corb2.

bbandlamudi commented on August 18, 2024

This is quite interesting and can be useful. It is basically a poor man's load balancer for Corb. I have thought about it earlier, but didn't think it would be worth the effort as every uses some kind of load balancer in the production environment. We should consider implementing it.

from corb2.

hansenmc commented on August 18, 2024

It doesn't look like it would be that difficult to implement.
We would need to:

change contentSource to be a List
adjust the prepareContentSource() set List
ensure that the contentSource is obtained through the getContentSource() method, rather than referencing it directly (by making the contentSource List private).

from corb2.

jmakeig commented on August 18, 2024

This is a good idea, especially if you’re restricted to certain hosts only through a load balancer. Please coordinate with @sammefford and @mattsunsjf on a design/gotchas. They’ve both worked on this problem for DMSDK and mlcp, respectively.

from corb2.

bbandlamudi commented on August 18, 2024

We should also keep an eye on the occasional long running threads.. If a particular host is less responsive, corb shouldn't keep bombarding with more requests - we occasionally see hosts becoming hotter in our project (esp with other load). So, the code should try to check active connections to hosts while balancing the load instead of simple round robin. Also, corb should detect unresponsive hosts and remove them from the list. We should definitely coordinate with @sammefford and @mattsunsjf as they probably have more insight into these issues.

from corb2.

jmakeig commented on August 18, 2024

We should definitely coordinate with @sammefford and @mattsunsjf as they probably have more insight into these issues.

Yes. Please.

from corb2.

sammefford commented on August 18, 2024

We haven't attempted to detect "hotter" hosts in DMSDK. For query we talk to all forests and get just the results from each forest. For write we do round-robin. I'm curious what measures you have in mind to identify "hotter" hosts. To detect unresponsive hosts and remove them from the list, you could take a look at HostAvailabilityListener. And, of course, I'm glad to answer any questions via phone or IM.

from corb2.

bbandlamudi commented on August 18, 2024

@sammefford - For 'hotter' hosts, I am thinking something relatively simple to implement. For example, we can keep track of active sessions/threads to each host and/or average time of currently active sessions/threads. So, during the host allocation, if we detect that a host is not responding fast enough, then corb can skip the host and send the request to the next host. If the hot host gets freed up in the next pass, corb will resume sending the requests.

from corb2.

mattsunsjf commented on August 18, 2024

Starting in 9.0-ea4, mlcp accepts a list of host names, and restricts the connection with ML only to instances on these hosts. mlcp takes a pass on all provided hosts at the beginning to filter out invalid hosts (non-resolvable). While it is doing the job (injesting, importing, copying etc), mlcp handles scenarios when connection failed or disk failover happened. Depending on the failure type, mlcp may retry, blacklist the host, try backups (replica's host, as an example) or abort the job.

from corb2.

rjkennedy98 commented on August 18, 2024

Similar to accepting multiple hosts. It would be useful to be able to run a job where CORB would run the uris query for each host and then run the same transform only against the same host. I know Mark Plontick wrrote scripts to essentially do this manually, but it may be a useful feature for corb if you really need to scale a job. Your uris query would then only search against the forests on that host.

from corb2.

bbandlamudi commented on August 18, 2024

@rjkennedy98 - I am a little confused about the scenario you listed. How is different from running separate corb jobs against each host? Also, how does it work in case of e-node/d-node config i.e. how do we know that same uris query against different host (in the same environment) does not return duplicate uris or does the uris query ensures that it returns the uris only from the forests it hosts?

FYI - I have initially thought this is related to in-forest eval, which we can do in the transform (similar to how the https://github.com/bradmann/marklogic-spawnlib does) and did for couple of jobs in our projects.

from corb2.

rjkennedy98 commented on August 18, 2024

@bbandlamudi Yes, it is very similar to spawnlib. With spawnlib you can specify a forest for the uris and a forest to eval with the transform as a batch performance improvement. MLCP obviously has fast-load which I believe does the same thing - inserting docs into forests on a particular host.

Functionally its the same as running separate corb jobs against each host where each host processes only uris on that host. I'm not sure its possible to do it in a split e/d node environment, but I haven't thought about it too long.

Perhaps the actual use case is too narrow, but I thought I'd bring it up seeing as its related to running against multiple hosts.

from corb2.

bbandlamudi commented on August 18, 2024

@rjkennedy98 - Thanks! Corb does not have much control over the xqueries. I think it is harder to implement host affinity for uris+transform while keeping rest of the features and configuration in tact and we want to run the uris queries in parallel. It may be easier to do in-forest evaluation in transform itself.

from corb2.

sammefford commented on August 18, 2024

I'm no Corb expert, but perhaps there's a way add an option to get the list of forest ids, then pass each forestid to one get uris xquery call. That way if that option is on Corb could enable the xquery to run forest-specific queries, thus getting only the uris in that forest. I don't think it requires in-forest evaluation since cts:search, search:search, and cts:uris all accept a forestId parameter or option. Then if subsequent transforms were sent to the same host, there might be some efficiency.

I'm not saying this is what Corb should do, like I say, I don't know it well enough. But this is what DMSDK does, and I can share any lessons learned if it's helpful. One big difference is that DMSDK doesn't use xquery to get the uris, it uses structured queries (generated in java, and eventually sent to search:parse).

from corb2.

bbandlamudi commented on August 18, 2024

FYI - I am going to start working on this task soon.

from corb2.

rjkennedy98 commented on August 18, 2024

@bbandlamudi @sammefford When you get into corb jobs that have to run on 1 billion+ documents, it becomes necessary to split up the jobs across forests. We have a concept of a "super corb" which essentially divies up corb jobs by hosts and provides the forest id to the uris.xqy and the transform.xqy. We are currently using a shells script to do this, but it would be ideal if this was a configuration option.

from corb2.

damonfeldman commented on August 18, 2024

Possible dup of #61. DMSDK does some balancing OOTB/supported, so if that gets done this should be handled.

from corb2.

bbandlamudi commented on August 18, 2024

Changes have been committed to https://github.com/marklogic-community/corb2/tree/load-balance. Connection handling is now delegated to a different class. This is a more involved change than initially thought especially due to fail-over and number of test cases that need to be adjusted. There are few issues around fail-over pending some additional testing.

from corb2.

bbandlamudi commented on August 18, 2024

Damon, we still don't know if/when we are going to integrate Corb with DMSDK. As discussed earlier, we can consider not adding new features into corb after we flush out tasks currently in development and close to finishing.

from corb2.

jmakeig commented on August 18, 2024

This is a more involved change than initially thought especially due to fail-over and number of test cases that need to be adjusted.

That’s why it’s generally preferred that the lower-level tools take care of this. I’d like to understand if DMSDK would meet your requirements—even if we decide that it’s not worth the effort to ever port CoRB to use DMSDK. CoRB is the type of tool that DMSDK was designed for.

from corb2.

bbandlamudi commented on August 18, 2024

Sure. We will take a look at DMSDK. The actual code to handle this isn't difficult.. :). Its mostly about figuring out what to do and most of my time was spent in resolving test failures as this change affected a lot of test cases and I wanted to write it in a 'cleaner' way.

from corb2.

bbandlamudi commented on August 18, 2024

As Mads suggested, it is probably a good idea to do the dmsdk integration as a new repo corb-dmsdk. We can deprecate corb2 if corb2-dmsdk is received well.

from corb2.

bbandlamudi commented on August 18, 2024

I believe the primary selling points for corb2 is very lightweight, ease of use and of course performance. We are generally very strict about this and do not want to risk adoption even as we act on feature requests from developers. At least on our project, clients generally prefer not use other batch tool if corb2 can do the same with much better performance and less development overhead. I think DMSDK integration under a new repo may be safest approach. We would like to wrap up on going development tasks before we start looking into DMSDK integration.

from corb2.

mpheckel commented on August 18, 2024

Hi Rob, interesting write up. Wonder if what will come out of this is that DMSDK would be refactored to use XCC? Good work-

…

________________________________ From: rjrudin [[email protected]] Sent: Monday, July 24, 2017 9:35 PM To: marklogic-community/corb2 Cc: Subscribed Subject: Re: [marklogic-community/corb2] Connect to multiple hosts (#46) When I first heard of DMSDK, I was hoping corb3 would be a bit of a rewrite with DMSDK at the core and all the corb command-line options still present. But as noted here - marklogic/java-client-api#779<marklogic/java-client-api#779> - I think there's the hangup with XCC generally being faster than the REST API, which DMSDK depends on. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#46 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIgM8wljFfTXUKqfzMQaHABXgopZZbf7ks5sRUZzgaJpZM4LsaG_>.

from corb2.

sammefford commented on August 18, 2024

Interesting side-note, the ML performance team did a comparison with Java Client API and XCC and found that Java Client API was faster for ingestion. Ingestion is the use case Rob wrote up. So we have two different tests with conflicting outcomes. I'm looking forward to diving in to tease out the differences and makes some improvement recommendations one way or the other.

from corb2.

jkerr5 commented on August 18, 2024

Did this get abandoned? I have a user that has a system that could use the built-in load balancing capability.

from corb2.

bbandlamudi commented on August 18, 2024

Hello @jkerr5 - This is not abandoned, but delayed due to other commitments. The development is mostly done pending testing and merge conflicts, which is taking a bit more time than expected. Hopefully, we can pick up on this soon. Sorry for the delay!

from corb2.

jmakeig commented on August 18, 2024

@jkerr5, if you can move to MarkLogic 9, the Data Movement SDK does this with aplomb, including failover. You’ll have to write some plumbing code to get a command-line interface, but that’s pretty straightforward.

from corb2.

jkerr5 commented on August 18, 2024

Thanks for the update. @jmakeig unfortunately ML9 is not an option.

from corb2.

bbandlamudi commented on August 18, 2024

Preliminary implementation is available under feature branch https://github.com/marklogic-community/corb2/tree/load-balance-2. We need to work on additional test cases as well as full performance testing. This may take a bit of time.

The default implementation allows three options to balance load i.e, Round-Robin (Default), Random and Load (picks the host with least number of active connections). If a connection to specific host fails, the request is automatically retried with next available host. A failed host is tried default 3 times (configurable) at 60 sec (configurable) intervals before being removed from pool. Please note that the retry feature already exists in corb, but expanded to support multiple hosts.

from corb2.

bbandlamudi commented on August 18, 2024

The development is complete and staged for next release along with other changes going in at the same time- code is available at https://github.com/marklogic-community/corb2/tree/development. Since this change affected quite a few files, we might do further tests to confirm nothing is broken if any i.e, outside the normal junit coverage.

from corb2.

Connect to multiple hosts about corb2 HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent