Comments (17)
@chetanmeh seems there is a bug in readFeed for multi partition collection. For now please use the query api as you suggested yourself which should work fine. We will fix readFeed.
from azure-cosmosdb-java.
@moderakh Thanks for confirming the alternative approach. Would use the SQL approach.
Any guidance on how to speed this up. If I understand the flow a single query would result in "N" number of parallel calls for all "N" partitions where each such call would fetch result in batch of 100. Is that understanding correct? Somehow I see only downloads ~ 200 docs/sec
from azure-cosmosdb-java.
@chetanmeh do you mean the execution of query is slow?
If so, can you please provide the sample code for how you are doing query? what are the options, etc?
from azure-cosmosdb-java.
The code can be found here. It uses the default settings for AsyncDocumentClient
. The FeedOptions have enableCrossPartitionQuery
enabled. Further it uses Akka Stream to treat query result as a reactive source and then processes and emits the json to a file
As this is a rarely used operation I can tweak settings even if they lead to higher resource consumption.
from azure-cosmosdb-java.
Hi @chetanmeh can you please try setting maxDegreeOfParallelism to a positive number and see if that helps.
from azure-cosmosdb-java.
@moderakh thanks for the pointer. Would give that a try. So far based on this answer it appears that sdk would use that many parallel request. However looking at code usage of FeedOptions#getMaxDegreeOfParallelism
I just see x-ms-documentdb-query-parallelizecrosspartitionquery
header being set to true but the actual value is not being used. Probably missing some flow here by just looking at code ...
Would try now by changing the code to see if it gives benefit.
from azure-cosmosdb-java.
@chetanmeh just setting x-ms-documentdb-query-parallelizecrosspartitionquery will give you the query execution boost.
from azure-cosmosdb-java.
@moderakh After digging into the flow further it appears that currently ParallelDocumentQueryExecutionContext
is using concat operator to merge multiple observables which per this doc perform sequentially
It simply concatenates two sequences. Once the first sequence completes, the second sequence is subscribed to and its values are passed on through to the result sequence
Replacing this with merge doubled throughput for my code from ~160 doc/s to ~300 doc/s and bottleneck was allocated RSU.
Index: sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java (revision cc314c1cfda23cd3135bf12ae9682b0905b9c715)
+++ sdk/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java (date 1534175443000)
@@ -155,7 +155,7 @@
public Observable<FeedResponse<T>> drainAsync(int maxPageSize) {
List<Observable<FeedResponse<T>>> obs = this.documentProducers.stream()
.map(dp -> dp.produceAsync().map(dpp -> dpp.pageResult)).collect(Collectors.toList());
- return Observable.concat(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
+ return Observable.merge(obs).compose(new EmptyPagesFilterTransformer<>(new RequestChargeTracker(), maxPageSize));
}
@Override
I think to make full use of parallel query the code should switch to merge
operator instead of concat
.
Would such a change makes sense?
from azure-cosmosdb-java.
@moderakh ping
from azure-cosmosdb-java.
@chetanmeh sorry for the delay.
There is one cavity in using Observable.merge(.) for this use case: it doesn't respect order.
When no explicit "ORDERBY" is specified in the query, we would like to return the results in the same order as partitions (we want the results to be returned in a deterministic order).
I think we need something like this:
http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int-
or this one which has support for backpressure:
http://reactivex.io/RxJava/javadoc/io/reactivex/Flowable.html#concat-org.reactivestreams.Publisher-int-
These are in rxjava2, If you can find a way to make this work for rxjava1, your PR is welcomed.
from azure-cosmosdb-java.
we want the results to be returned in a deterministic order
@moderakh Is this a documented behaviour which needs to be supported for compatibility? As in general when query does not have any ORDERBY specified then order of document is not deterministic. Some system return in some "document order" which probably maps to partition order but thats mostly non deterministic and application should not rely on that.
Trying to enforce this order would always require some trade off. May be we have this as a FeedOption
setting for new behaviour to be used. As the throughput gain is pretty good with proposed change and for some cases having no inherent order is fine.
from azure-cosmosdb-java.
@moderakh And thoughts on previous comment?
from azure-cosmosdb-java.
my apologies for delayed response @chetanmeh
All our other sdks go with this predefined order. So when there is no explicit orderby, all sdks returns the results in the order of partitions. a predefined order also would help on resuming work.
If we were on RxJava2 this would be fairly easy. for rxjava1 we should build it. A PR for doing similar to what http://reactivex.io/RxJava/javadoc/io/reactivex/Observable.html#concat-io.reactivex.ObservableSource-int- is doing is welcomed 👍
from azure-cosmosdb-java.
@moderakh Based on linked docs it appears that it would still not perform a concurrent read.
Concat waits to subscribe to each additional Observable that you pass to it until the previous Observable completes.
From what I understand enforcing any partition ordering would require that observables from each partition should be consumed completely. For example if we have 5 partitions P1-P5 and upon query we create 5 observables from those 5 partitions O1-O5. Then to ensure that documents are emitted in partition order first O1 needs to be consumed fully and then only reads from O2 can be performed.
In case of explicit "ORDER BY" reads can be done in parallel from O1-O5 in batches of page size (100 per default) and then results would be merged based on sort criteria.
So to get maximum throughput for non "ORDER BY" case we would need to have an option to remove this partition based ordering. Some apps may be fine to not have deterministic order and would thus gain from higher throughput.
from azure-cosmosdb-java.
@moderakh Any thoughts on previous comment?
from azure-cosmosdb-java.
@moderakh Ping ^^
from azure-cosmosdb-java.
This is fixed since 2.4.0. https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-async-java#240
from azure-cosmosdb-java.
Related Issues (20)
- In flight http request during failover not moving to other available region
- storePhysicalAddress is null - V2 Java Async SDK HOT 2
- Reliable connection configuration for Azure functions consumption plan HOT 1
- Bump to >2.4.0 yields DocumentClientException/ Request url is invalid. HOT 1
- QueryPlan should be a read request
- Add support of getDuration for Queries in Cosmos Diagnostics - V4 Java Cosmos SDK HOT 1
- FR Add an option to SQL API @Container to set auto scaling = true
- Duplicate results when DISTINCT querying
- "sortedRanges" exception when querying Unique Key IN list HOT 1
- I'm not able to see which branch is for latest version of V3 version (3.7.5) HOT 6
- Connection leak with Cosmos DB client SDK HOT 2
- Getting 404 even though cosmos has document
- Netty issue while using azure cosmos in android
- ETag Check in Cosmos SDK V4 batch
- Including "uri" in the Diagnostic logs
- 同学,您这个项目引入了119个开源组件,存在41个漏洞,辛苦升级一下
- Fail to reach global gateway HOT 1
- This repo is missing important files
- Direct Connectivity Issue from VNET SpringApps to CosmosDB HOT 2
- Java SDK not returning cumulative RU for operations with retries HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azure-cosmosdb-java.