For some cases we need to read all documents from a collection. For that I tried using

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The code can be found <a href="https://github.com/apache/incubator-openwhisk/pull/3814

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to read all documents from a collection about azure-cosmosdb-java HOT 17 CLOSED

azure commented on July 25, 2024

How to read all documents from a collection

from azure-cosmosdb-java.

Comments (17)

moderakh commented on July 25, 2024

@chetanmeh seems there is a bug in readFeed for multi partition collection. For now please use the query api as you suggested yourself which should work fine. We will fix readFeed.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh Thanks for confirming the alternative approach. Would use the SQL approach.

Any guidance on how to speed this up. If I understand the flow a single query would result in "N" number of parallel calls for all "N" partitions where each such call would fetch result in batch of 100. Is that understanding correct? Somehow I see only downloads ~ 200 docs/sec

from azure-cosmosdb-java.

moderakh commented on July 25, 2024

@chetanmeh do you mean the execution of query is slow?

If so, can you please provide the sample code for how you are doing query? what are the options, etc?

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

The code can be found here. It uses the default settings for AsyncDocumentClient. The FeedOptions have enableCrossPartitionQuery enabled. Further it uses Akka Stream to treat query result as a reactive source and then processes and emits the json to a file

As this is a rarely used operation I can tweak settings even if they lead to higher resource consumption.

from azure-cosmosdb-java.

moderakh commented on July 25, 2024

Hi @chetanmeh can you please try setting maxDegreeOfParallelism to a positive number and see if that helps.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh thanks for the pointer. Would give that a try. So far based on this answer it appears that sdk would use that many parallel request. However looking at code usage of FeedOptions#getMaxDegreeOfParallelism I just see x-ms-documentdb-query-parallelizecrosspartitionquery header being set to true but the actual value is not being used. Probably missing some flow here by just looking at code ...

Would try now by changing the code to see if it gives benefit.

from azure-cosmosdb-java.

moderakh commented on July 25, 2024

@chetanmeh just setting x-ms-documentdb-query-parallelizecrosspartitionquery will give you the query execution boost.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh After digging into the flow further it appears that currently ParallelDocumentQueryExecutionContext is using concat operator to merge multiple observables which per this doc perform sequentially

It simply concatenates two sequences. Once the first sequence completes, the second sequence is subscribed to and its values are passed on through to the result sequence

azure-cosmosdb-java/sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java

Lines 155 to 159 in cc314c1

    
           public Observable<FeedResponse<T>> drainAsync(int maxPageSize) { 
        
               List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream() 
        
                       .map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList()); 
        
               return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize)); 
        
           }

Replacing this with merge doubled throughput for my code from ~160 doc/s to ~300 doc/s and bottleneck was allocated RSU.

Index: sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java	(revision cc314c1cfda23cd3135bf12ae9682b0905b9c715)
+++ sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java	(date 1534175443000)
@@ -155,7 +155,7 @@
     public Observable<FeedResponse<T>> drainAsync(int maxPageSize) {
         List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream()
                 .map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList());
-        return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
+        return Observable.merge(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
     }
 
     @Override

I think to make full use of parallel query the code should switch to merge operator instead of concat.

Would such a change makes sense?

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh ping

from azure-cosmosdb-java.

moderakh commented on July 25, 2024

@chetanmeh sorry for the delay.

There is one cavity in using Observable.merge(.) for this use case: it doesn't respect order.
When no explicit "ORDERBY" is specified in the query, we would like to return the results in the same order as partitions (we want the results to be returned in a deterministic order).

I think we need something like this:
http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int-

or this one which has support for backpressure:
http://reactivex.io/RxJava/javadoc/io/reactivex/Flowable.html#concat-org.reactivestreams.Publisher-int-

These are in rxjava2, If you can find a way to make this work for rxjava1, your PR is welcomed.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

we want the results to be returned in a deterministic order

@moderakh Is this a documented behaviour which needs to be supported for compatibility? As in general when query does not have any ORDERBY specified then order of document is not deterministic. Some system return in some "document order" which probably maps to partition order but thats mostly non deterministic and application should not rely on that.

Trying to enforce this order would always require some trade off. May be we have this as a FeedOption setting for new behaviour to be used. As the throughput gain is pretty good with proposed change and for some cases having no inherent order is fine.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh And thoughts on previous comment?

from azure-cosmosdb-java.

moderakh commented on July 25, 2024

my apologies for delayed response @chetanmeh
All our other sdks go with this predefined order. So when there is no explicit orderby, all sdks returns the results in the order of partitions. a predefined order also would help on resuming work.

If we were on RxJava2 this would be fairly easy. for rxjava1 we should build it. A PR for doing similar to what http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int- is doing is welcomed 👍

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh Based on linked docs it appears that it would still not perform a concurrent read.

Concat waits to subscribe to each additional Observable that you pass to it until the previous Observable completes.

From what I understand enforcing any partition ordering would require that observables from each partition should be consumed completely. For example if we have 5 partitions P1-P5 and upon query we create 5 observables from those 5 partitions O1-O5. Then to ensure that documents are emitted in partition order first O1 needs to be consumed fully and then only reads from O2 can be performed.

In case of explicit "ORDER BY" reads can be done in parallel from O1-O5 in batches of page size (100 per default) and then results would be merged based on sort criteria.

So to get maximum throughput for non "ORDER BY" case we would need to have an option to remove this partition based ordering. Some apps may be fine to not have deterministic order and would thus gain from higher throughput.

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh Any thoughts on previous comment?

from azure-cosmosdb-java.

chetanmeh commented on July 25, 2024

@moderakh Ping ^^

from azure-cosmosdb-java.

christopheranderson commented on July 25, 2024

This is fixed since 2.4.0. https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-async-java#240

from azure-cosmosdb-java.

How to read all documents from a collection about azure-cosmosdb-java HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	public Observable<FeedResponse<T>> drainAsync(int maxPageSize) {
	List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream()
	.map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList());
	return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
	}