Comments (11)
Please include a minimal but complete console application that reproduces the problem, that will help us be certain that we are testing exactly the same scenario as you.
A few general things to consider though:
- How many cores does your CPU has? You are performing I/O operations, so in general you probably will benefit from more threads than cores, but that doesn't mean that the more threads, the better. How did you arrive to the 20 threads per user number? Have you tested with less threads? More threads?
- In general, the more threads you have, the more CPU usage you should expect, you are precisely increasing the amount of threads so that CPU is not idle while performing I/O, right?
- The libraries in this repo support async versions for all service operations. Have you considered rewriting your application so that it starts Tasks instead of Threads. The .NET runtime will then manage threads and schedulle tasks in a very efficient manner.
from google-api-dotnet-client.
GDriveDownloadTest.zip
Above is the sample code.
Please do the following, before executing the application.
- Service account credentials are expected in C:\gcreds.json
- Create a folder GDriveDownloadTest under C:\
- Replace the user1 to user5 in Program.cs with actual smtp address. In my case, each was having 20k to 60k files.
My CPU has 4 cores.
I tested with 10 threads, where CPU usage avg about 10%. (though occasional spikes to 25%)
We are experimenting with different number of threads. Since with 20 threads, requests are not getting throttled, we decided to go with this number.
For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.
from google-api-dotnet-client.
Since with 20 threads, requests are not getting throttled, we decided to go with this number.
What exactly do you mean by "not getting throttled"?
For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.
Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.
And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.
I will take a look later today to your code and library code and see if anything seems out of the ordinary. I'll report back with my findings. If you could answer the questions above that'd be helpful.
from google-api-dotnet-client.
One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.
from google-api-dotnet-client.
One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.
I tried disabling gzip. But it did not affect CPU usage much. Avg usage reduced by 2% only.
What exactly do you mean by "not getting throttled"?
I meant that I am not getting 403 ( rateLimitExceeded) error at any time. That implies that requests are within the Qutoa limit. isnt it?
Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.
Yes agree. But I just want to keep my sample program simple with only download logic just to check the usage. Since the sample program itself takes more CPU, my complicated production logic obviously will take more as I also get metadata for all the files and then parallelly download them.
And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.
In our production code, multiple such process may be running at any point of time. If 1 process takes 25%, then multiple processes on production can take 100% CPU.
This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.
Thanks
from google-api-dotnet-client.
This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.
I haven't had time to look yet, I'll report back here when I know more.
from google-api-dotnet-client.
OK, so I've looked some. First, the application you sent is not really a minimal reproduction. Threre's too much going on there, and definetely some of your code may be having an impact in performance. For instance:
- You are using the sync versions of the methods, which means that threads block waiting for IO. Parallelizing on top of that has no effect.
- You are filtering the list client side, where you could probably use the Files.List instead of Changes.List, as it doesn't seem you are doing anything specifically with the change information itself, just using that to download the file? Files.List has a Q field and one of the fields you can include there is the
modifiedTime
. - But what's more, you call to ToList on a result of the query, and then iterate over the whole thing again.
- You are calling Task.Result on a couple of places. That blocks threads that could be doing something else.
I didn't run your code, it wouldn't have been useful for me to determine if there was an issue with the Google.Apis.Drive.v3 library.
What I did was I ran the following code with batchsize
set to 1, 20, 50, 100. I used Visual Studio's Performance Profiler and here are the results:
Downloading files in parallel batches of 1.
Downloaded 1000 out of 1000 in 19.403887711666666 minutes.
Peak of 8% avg of 3%
Downloading files in parallel batches of 20.
Downloaded 1000 out of 1000 in 2.0180067916666666 minutes.
Peak of 11% avg of 5%
Downloading files in parallel batches of 50.
Downloaded 1000 out of 1000 in 1.9617455133333332 minutes.
Peak of 12% avg of 6%
Downloading files in parallel batches of 100.
Downloaded 1000 out of 1000 in 1.6262795116666666 minutes.
Peak of 14% avg of 6%
None of this is formal benchmarking, but the results seem very reasonable to me.
Some notes about my code:
-
I didn't try with 4 different users as that won't make a difference to CPU performance. I uploaded 1000 files of random size between 1MB and 3MB, to my own drive. The commented out code at the end does that.
-
I'm not using Files.Export, instead I'm using Files.Get, just because it was more convenient and it wouldn't affect performance. But do note that you don't need to directly use the export URL (that's meant to be used directly by a browser). The code for exporting a file it's simpler, and very similar to the Files.Get code (full example here):
var exportRequest = service.Files.Export(file.Id, "<content-type>"); exportRequest.MediaDownloader.ChunkSize = 80 * 1024; var progress = await exportRequest.DownloadAsync(fileStream);
This is my code:
using Google.Apis.Auth.OAuth2;
using Google.Apis.Download;
using Google.Apis.Drive.v3;
using Google.Apis.Services;
using System.Diagnostics;
using DriveFile = Google.Apis.Drive.v3.Data.File;
var clientSecretsPath = Environment.GetEnvironmentVariable("TEST_CLIENT_SECRET_FILENAME");
var clientSecrets = await GoogleClientSecrets.FromFileAsync(clientSecretsPath);
var folderId = "<the-folder-id-in-drive-to-store-files>";
string contentType = "application/octet-stream";
UserCredential credential = await GoogleWebAuthorizationBroker.AuthorizeAsync(
clientSecrets.Secrets,
new[] { DriveService.ScopeConstants.Drive },
"user-drive-download",
CancellationToken.None
);
var service = new DriveService(new BaseClientService.Initializer()
{
HttpClientInitializer = credential,
});
string? nextPageToken = null;
var listRequest = service.Files.List();
listRequest.IncludeItemsFromAllDrives = false;
listRequest.Q = $"trashed = false and '{folderId}' in parents";
listRequest.Fields = "nextPageToken, files(id, name)";
listRequest.PageSize = 100;
int batchSize = 100;
int downloaded = 0;
Console.WriteLine($"Downloading files in parallel batches of {batchSize}.");
Stopwatch stopwatch = Stopwatch.StartNew();
do
{
listRequest.PageToken = nextPageToken;
var listResponse = await listRequest.ExecuteAsync();
nextPageToken = listResponse.NextPageToken;
var fileBatches = listResponse.Files.Chunk(batchSize);
foreach (var batch in fileBatches)
{
var downloaders = batch.Select(async file =>
{
using var fileStream = File.Create(@$"downloaded\{file.Name}");
var getRequest = service.Files.Get(file.Id);
getRequest.MediaDownloader.ChunkSize = 80 * 1024;
var progress = await getRequest.DownloadAsync(fileStream);
if (progress.Status == DownloadStatus.Failed)
{
Console.WriteLine($"Failed downloading {file.Id} with message {progress.Exception?.Message}.");
}
else
{
downloaded++;
}
});
await Task.WhenAll(downloaders);
}
}
while (nextPageToken is not null);
stopwatch.Stop();
Console.WriteLine($"Downloaded {downloaded} out of 1000 in {stopwatch.Elapsed.TotalMinutes} minutes.");
//var random = new Random();
//var oneMbInBytes = 1024 * 1024;
//var threeMbInBytes = 3 * oneMbInBytes + 1; // plus one because the top boundary of the range is not inclusive.
//var folderIds = new List<string> { folderId };
//Console.WriteLine("Uploading 1000 files of 1MB to 3MB");
//for (int i = 0; i < 100; i++)
//{
// IEnumerable<Task> uploaders = Enumerable.Range(0, 10).Select(j =>
// {
// var stream = GenerateData();
// var mediaUploader = service.Files.Create(
// new DriveFile
// {
// Name = $"test_file_{10 * i + j}",
// Parents = folderIds
// },
// stream,
// contentType);
// return mediaUploader.UploadAsync();
// });
// await Task.WhenAll(uploaders);
//}
//MemoryStream GenerateData()
//{
// int size = random.Next(oneMbInBytes, threeMbInBytes);
// byte[] data = new byte[size];
// random.NextBytes(data);
// return new MemoryStream(data);
//}
Bottom line, I don't think there's anything wrong with Google.Apis.Drive.v3, instead, there are a few aspects of your code that are possibly impacting performance.
My advice is that you bechmark your whole code, start by using the Performance Profiler to find hot paths, and remove those, etc. Then move on to more formal benchmarking so you can tweak parameters, in my code batchSize
, to achieve the best balance between throughput and CPU usage, etc. I would strongly advice you to use the async versions of the library methods, instead of trying to control threads or anything else through parallelization, etc.. Then, as it is done in my code, you only need to make a decision on how many active tasks you want at any given time.
I'll leave this issue open for a few days, waiting for your aknowledgement, but unless/until you find hard evidence that there's a significant performance issue with the libraries, we won't be looking into this further.
from google-api-dotnet-client.
Thanks, Amanda, for your inputs.
Though my sample code uses Task.Result which blocks the thread, my actual production code uses async throughout since it is using .NET Channel.
One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads.
Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.
Also, what is the configuration of your machine (CPU and memory) on which you profiled the test code?
Anyways, since your test confirms there is no performance issue with Drive API, I will take it up and check more on my side.
Again, thank you so much for confirming this.
from google-api-dotnet-client.
One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads.
Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.
This means that you end up with more than 100 parallel downloads almost certainly, right? So, your pages being of size 1000 (at least on the code you shared) means that you potentially have 1000 parallel downloads?
The machine I tried this on had, same as yours, 4 core CPU and 16GB of memory. It was idle at the time of running these test.
from google-api-dotnet-client.
Actually, I have max of 20 threads only. They way it works is,
Single producer thread -> does the scan, keeps adding the item to bounded channel (size = 100)
20 consumer thread -> reads the item from channel, download and continue the loop.
(the moment download thread takes the item, scan will add more, so the queue will be full till the end)
So, at any time, I have only max of 20 threads doing the download. But with the above design, there is no pause in downloads between scan pages. It keeps on downloading till the end.
Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?
from google-api-dotnet-client.
the moment download thread takes the item, scan will add more, so the queue will be full till the end
Yes, I see what you mean, whereas I have at most 20 parallel downloads you have always 20 threads downloading. I still wouldn't think that's a reason for the difference in performance though. I would still look first at some of the aspects I mentioned in my previous comment. In particular I think that sifting from manging your own threads (with sync or async versions of the methods) to relying on the scheduller to execute tasks will make a difference.
Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?
This is a question better suited for the Drive API team through their support channels. I don't know if there's a problem with using the export link URL directly. What I can say is that the export link URL is different from the URL that calling Files.Export(...).Download(...) would use. See for instance, for the same document.
The Export operation uses
https://www.googleapis.com/drive/v3/files/<readacted_file_id>/export?mimeType=application%2Fvnd.openxmlformats-officedocument.wordprocessingml.document
where the export link for that mime type is:
https://docs.google.com/feeds/download/documents/export/Export?id=<readacted_file_id>&resourcekey=<redacted_resource_key>&exportFormat=docx
from google-api-dotnet-client.
Related Issues (20)
- google.apis is missing NuGet package README file HOT 3
- google.apis.drive.v3 is missing NuGet package README file
- google.apis.sheets.v4 is missing NuGet package README file
- Can't load books. HOT 3
- Error: System.FormatException: String was not recognized as a valid DateTime at Google.Apis.Util.DiscoveryFormat.ThrowFormatException HOT 4
- Configuring Google.Calendar.Apis to Generate Google Meet Links for Attendees to Join Without Prompt, while Others Require Permission,How to Implement? HOT 2
- How to create a code block? HOT 1
- Google.Apis.PlacesAPINew.v1 nuget package uses wrong REST path for text search HOT 6
- Getting Error - An error occurred while sending the request, while trying to use google directory service from server. HOT 1
- google input tool in blazor server application. HOT 4
- Workflow identity federation with managed identity in Azure Kubernetes Service HOT 14
- Signing an email yields dkim=neutral (body hash did not verify) HOT 2
- Implement IDataStore HOT 6
- IDataStore to store multiple tokens per user HOT 7
- labelModifications didn't showup correctly in Google Drive UI HOT 4
- BackOffHandler to honor retryAfter value in case of rateLimitExceeded HOT 1
- Clarification about the .NET8 runtimes sub-folder (Google.Apis.Gmail.v1) HOT 9
- How can we effectively do ErrorHandling in PageStreamer requests HOT 2
- Set PooledConnectionLifetime on HttpMessageHandler HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from google-api-dotnet-client.