Code Monkey home page Code Monkey logo

Comments (17)

moderakh avatar moderakh commented on July 25, 2024

@chetanmeh seems there is a bug in readFeed for multi partition collection. For now please use the query api as you suggested yourself which should work fine. We will fix readFeed.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh Thanks for confirming the alternative approach. Would use the SQL approach.

Any guidance on how to speed this up. If I understand the flow a single query would result in "N" number of parallel calls for all "N" partitions where each such call would fetch result in batch of 100. Is that understanding correct? Somehow I see only downloads ~ 200 docs/sec

from azure-cosmosdb-java.

moderakh avatar moderakh commented on July 25, 2024

@chetanmeh do you mean the execution of query is slow?

If so, can you please provide the sample code for how you are doing query? what are the options, etc?

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

The code can be found here. It uses the default settings for AsyncDocumentClient. The FeedOptions have enableCrossPartitionQuery enabled. Further it uses Akka Stream to treat query result as a reactive source and then processes and emits the json to a file

As this is a rarely used operation I can tweak settings even if they lead to higher resource consumption.

from azure-cosmosdb-java.

moderakh avatar moderakh commented on July 25, 2024

Hi @chetanmeh can you please try setting maxDegreeOfParallelism to a positive number and see if that helps.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh thanks for the pointer. Would give that a try. So far based on this answer it appears that sdk would use that many parallel request. However looking at code usage of FeedOptions#getMaxDegreeOfParallelism I just see x-ms-documentdb-query-parallelizecrosspartitionquery header being set to true but the actual value is not being used. Probably missing some flow here by just looking at code ...

Would try now by changing the code to see if it gives benefit.

from azure-cosmosdb-java.

moderakh avatar moderakh commented on July 25, 2024

@chetanmeh just setting x-ms-documentdb-query-parallelizecrosspartitionquery will give you the query execution boost.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh After digging into the flow further it appears that currently ParallelDocumentQueryExecutionContext is using concat operator to merge multiple observables which per this doc perform sequentially

It simply concatenates two sequences. Once the first sequence completes, the second sequence is subscribed to and its values are passed on through to the result sequence

public Observable<FeedResponse<T>> drainAsync(int maxPageSize) {
List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream()
.map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList());
return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
}

Replacing this with merge doubled throughput for my code from ~160 doc/s to ~300 doc/s and bottleneck was allocated RSU.

Index: sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java	(revision cc314c1cfda23cd3135bf12ae9682b0905b9c715)
+++ sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java	(date 1534175443000)
@@ -155,7 +155,7 @@
     public Observable<FeedResponse<T>> drainAsync(int maxPageSize) {
         List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream()
                 .map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList());
-        return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
+        return Observable.merge(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
     }
 
     @Override

I think to make full use of parallel query the code should switch to merge operator instead of concat.

Would such a change makes sense?

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh ping

from azure-cosmosdb-java.

moderakh avatar moderakh commented on July 25, 2024

@chetanmeh sorry for the delay.

There is one cavity in using Observable.merge(.) for this use case: it doesn't respect order.
When no explicit "ORDERBY" is specified in the query, we would like to return the results in the same order as partitions (we want the results to be returned in a deterministic order).

I think we need something like this:
http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int-

or this one which has support for backpressure:
http://reactivex.io/RxJava/javadoc/io/reactivex/Flowable.html#concat-org.reactivestreams.Publisher-int-

These are in rxjava2, If you can find a way to make this work for rxjava1, your PR is welcomed.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

we want the results to be returned in a deterministic order

@moderakh Is this a documented behaviour which needs to be supported for compatibility? As in general when query does not have any ORDERBY specified then order of document is not deterministic. Some system return in some "document order" which probably maps to partition order but thats mostly non deterministic and application should not rely on that.

Trying to enforce this order would always require some trade off. May be we have this as a FeedOption setting for new behaviour to be used. As the throughput gain is pretty good with proposed change and for some cases having no inherent order is fine.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh And thoughts on previous comment?

from azure-cosmosdb-java.

moderakh avatar moderakh commented on July 25, 2024

my apologies for delayed response @chetanmeh
All our other sdks go with this predefined order. So when there is no explicit orderby, all sdks returns the results in the order of partitions. a predefined order also would help on resuming work.

If we were on RxJava2 this would be fairly easy. for rxjava1 we should build it. A PR for doing similar to what http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int- is doing is welcomed 👍

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh Based on linked docs it appears that it would still not perform a concurrent read.

Concat waits to subscribe to each additional Observable that you pass to it until the previous Observable completes.

From what I understand enforcing any partition ordering would require that observables from each partition should be consumed completely. For example if we have 5 partitions P1-P5 and upon query we create 5 observables from those 5 partitions O1-O5. Then to ensure that documents are emitted in partition order first O1 needs to be consumed fully and then only reads from O2 can be performed.

In case of explicit "ORDER BY" reads can be done in parallel from O1-O5 in batches of page size (100 per default) and then results would be merged based on sort criteria.

So to get maximum throughput for non "ORDER BY" case we would need to have an option to remove this partition based ordering. Some apps may be fine to not have deterministic order and would thus gain from higher throughput.

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh Any thoughts on previous comment?

from azure-cosmosdb-java.

chetanmeh avatar chetanmeh commented on July 25, 2024

@moderakh Ping ^^

from azure-cosmosdb-java.

christopheranderson avatar christopheranderson commented on July 25, 2024

This is fixed since 2.4.0. https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-async-java#240

from azure-cosmosdb-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.