Code Monkey home page Code Monkey logo

Comments (31)

rjrudin avatar rjrudin commented on August 18, 2024 2

When I first heard of DMSDK, I was hoping corb3 would be a bit of a rewrite with DMSDK at the core and all the corb command-line options still present. But as noted here - marklogic/java-client-api#779 - I think there's the hangup with XCC generally being faster than the REST API, which DMSDK depends on.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

This is quite interesting and can be useful. It is basically a poor man's load balancer for Corb. I have thought about it earlier, but didn't think it would be worth the effort as every uses some kind of load balancer in the production environment. We should consider implementing it.

from corb2.

hansenmc avatar hansenmc commented on August 18, 2024

It doesn't look like it would be that difficult to implement.
We would need to:

from corb2.

jmakeig avatar jmakeig commented on August 18, 2024

This is a good idea, especially if you’re restricted to certain hosts only through a load balancer. Please coordinate with @sammefford and @mattsunsjf on a design/gotchas. They’ve both worked on this problem for DMSDK and mlcp, respectively.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

We should also keep an eye on the occasional long running threads.. If a particular host is less responsive, corb shouldn't keep bombarding with more requests - we occasionally see hosts becoming hotter in our project (esp with other load). So, the code should try to check active connections to hosts while balancing the load instead of simple round robin. Also, corb should detect unresponsive hosts and remove them from the list. We should definitely coordinate with @sammefford and @mattsunsjf as they probably have more insight into these issues.

from corb2.

jmakeig avatar jmakeig commented on August 18, 2024

We should definitely coordinate with @sammefford and @mattsunsjf as they probably have more insight into these issues.

Yes. Please.

from corb2.

sammefford avatar sammefford commented on August 18, 2024

We haven't attempted to detect "hotter" hosts in DMSDK. For query we talk to all forests and get just the results from each forest. For write we do round-robin. I'm curious what measures you have in mind to identify "hotter" hosts. To detect unresponsive hosts and remove them from the list, you could take a look at HostAvailabilityListener. And, of course, I'm glad to answer any questions via phone or IM.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

@sammefford - For 'hotter' hosts, I am thinking something relatively simple to implement. For example, we can keep track of active sessions/threads to each host and/or average time of currently active sessions/threads. So, during the host allocation, if we detect that a host is not responding fast enough, then corb can skip the host and send the request to the next host. If the hot host gets freed up in the next pass, corb will resume sending the requests.

from corb2.

mattsunsjf avatar mattsunsjf commented on August 18, 2024

Starting in 9.0-ea4, mlcp accepts a list of host names, and restricts the connection with ML only to instances on these hosts. mlcp takes a pass on all provided hosts at the beginning to filter out invalid hosts (non-resolvable). While it is doing the job (injesting, importing, copying etc), mlcp handles scenarios when connection failed or disk failover happened. Depending on the failure type, mlcp may retry, blacklist the host, try backups (replica's host, as an example) or abort the job.

from corb2.

rjkennedy98 avatar rjkennedy98 commented on August 18, 2024

Similar to accepting multiple hosts. It would be useful to be able to run a job where CORB would run the uris query for each host and then run the same transform only against the same host. I know Mark Plontick wrrote scripts to essentially do this manually, but it may be a useful feature for corb if you really need to scale a job. Your uris query would then only search against the forests on that host.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

@rjkennedy98 - I am a little confused about the scenario you listed. How is different from running separate corb jobs against each host? Also, how does it work in case of e-node/d-node config i.e. how do we know that same uris query against different host (in the same environment) does not return duplicate uris or does the uris query ensures that it returns the uris only from the forests it hosts?

FYI - I have initially thought this is related to in-forest eval, which we can do in the transform (similar to how the https://github.com/bradmann/marklogic-spawnlib does) and did for couple of jobs in our projects.

from corb2.

rjkennedy98 avatar rjkennedy98 commented on August 18, 2024

@bbandlamudi Yes, it is very similar to spawnlib. With spawnlib you can specify a forest for the uris and a forest to eval with the transform as a batch performance improvement. MLCP obviously has fast-load which I believe does the same thing - inserting docs into forests on a particular host.

Functionally its the same as running separate corb jobs against each host where each host processes only uris on that host. I'm not sure its possible to do it in a split e/d node environment, but I haven't thought about it too long.

Perhaps the actual use case is too narrow, but I thought I'd bring it up seeing as its related to running against multiple hosts.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

@rjkennedy98 - Thanks! Corb does not have much control over the xqueries. I think it is harder to implement host affinity for uris+transform while keeping rest of the features and configuration in tact and we want to run the uris queries in parallel. It may be easier to do in-forest evaluation in transform itself.

from corb2.

sammefford avatar sammefford commented on August 18, 2024

I'm no Corb expert, but perhaps there's a way add an option to get the list of forest ids, then pass each forestid to one get uris xquery call. That way if that option is on Corb could enable the xquery to run forest-specific queries, thus getting only the uris in that forest. I don't think it requires in-forest evaluation since cts:search, search:search, and cts:uris all accept a forestId parameter or option. Then if subsequent transforms were sent to the same host, there might be some efficiency.

I'm not saying this is what Corb should do, like I say, I don't know it well enough. But this is what DMSDK does, and I can share any lessons learned if it's helpful. One big difference is that DMSDK doesn't use xquery to get the uris, it uses structured queries (generated in java, and eventually sent to search:parse).

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

FYI - I am going to start working on this task soon.

from corb2.

rjkennedy98 avatar rjkennedy98 commented on August 18, 2024

@bbandlamudi @sammefford When you get into corb jobs that have to run on 1 billion+ documents, it becomes necessary to split up the jobs across forests. We have a concept of a "super corb" which essentially divies up corb jobs by hosts and provides the forest id to the uris.xqy and the transform.xqy. We are currently using a shells script to do this, but it would be ideal if this was a configuration option.

from corb2.

damonfeldman avatar damonfeldman commented on August 18, 2024

Possible dup of #61. DMSDK does some balancing OOTB/supported, so if that gets done this should be handled.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

Changes have been committed to https://github.com/marklogic-community/corb2/tree/load-balance. Connection handling is now delegated to a different class. This is a more involved change than initially thought especially due to fail-over and number of test cases that need to be adjusted. There are few issues around fail-over pending some additional testing.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

Damon, we still don't know if/when we are going to integrate Corb with DMSDK. As discussed earlier, we can consider not adding new features into corb after we flush out tasks currently in development and close to finishing.

from corb2.

jmakeig avatar jmakeig commented on August 18, 2024

This is a more involved change than initially thought especially due to fail-over and number of test cases that need to be adjusted.

That’s why it’s generally preferred that the lower-level tools take care of this. I’d like to understand if DMSDK would meet your requirements—even if we decide that it’s not worth the effort to ever port CoRB to use DMSDK. CoRB is the type of tool that DMSDK was designed for.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

Sure. We will take a look at DMSDK. The actual code to handle this isn't difficult.. :). Its mostly about figuring out what to do and most of my time was spent in resolving test failures as this change affected a lot of test cases and I wanted to write it in a 'cleaner' way.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

As Mads suggested, it is probably a good idea to do the dmsdk integration as a new repo corb-dmsdk. We can deprecate corb2 if corb2-dmsdk is received well.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

I believe the primary selling points for corb2 is very lightweight, ease of use and of course performance. We are generally very strict about this and do not want to risk adoption even as we act on feature requests from developers. At least on our project, clients generally prefer not use other batch tool if corb2 can do the same with much better performance and less development overhead. I think DMSDK integration under a new repo may be safest approach. We would like to wrap up on going development tasks before we start looking into DMSDK integration.

from corb2.

mpheckel avatar mpheckel commented on August 18, 2024

from corb2.

sammefford avatar sammefford commented on August 18, 2024

Interesting side-note, the ML performance team did a comparison with Java Client API and XCC and found that Java Client API was faster for ingestion. Ingestion is the use case Rob wrote up. So we have two different tests with conflicting outcomes. I'm looking forward to diving in to tease out the differences and makes some improvement recommendations one way or the other.

from corb2.

jkerr5 avatar jkerr5 commented on August 18, 2024

Did this get abandoned? I have a user that has a system that could use the built-in load balancing capability.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

Hello @jkerr5 - This is not abandoned, but delayed due to other commitments. The development is mostly done pending testing and merge conflicts, which is taking a bit more time than expected. Hopefully, we can pick up on this soon. Sorry for the delay!

from corb2.

jmakeig avatar jmakeig commented on August 18, 2024

@jkerr5, if you can move to MarkLogic 9, the Data Movement SDK does this with aplomb, including failover. You’ll have to write some plumbing code to get a command-line interface, but that’s pretty straightforward.

from corb2.

jkerr5 avatar jkerr5 commented on August 18, 2024

Thanks for the update. @jmakeig unfortunately ML9 is not an option.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

Preliminary implementation is available under feature branch https://github.com/marklogic-community/corb2/tree/load-balance-2. We need to work on additional test cases as well as full performance testing. This may take a bit of time.

The default implementation allows three options to balance load i.e, Round-Robin (Default), Random and Load (picks the host with least number of active connections). If a connection to specific host fails, the request is automatically retried with next available host. A failed host is tried default 3 times (configurable) at 60 sec (configurable) intervals before being removed from pool. Please note that the retry feature already exists in corb, but expanded to support multiple hosts.

from corb2.

bbandlamudi avatar bbandlamudi commented on August 18, 2024

The development is complete and staged for next release along with other changes going in at the same time- code is available at https://github.com/marklogic-community/corb2/tree/development. Since this change affected quite a few files, we might do further tests to confirm nothing is broken if any i.e, outside the normal junit coverage.

from corb2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.