It would be great if there were a way to do the equivalent of the following command fr

Another workaround would be to use <a href="https://pypi.org/project/sh/" rel="nofollo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support parallel operations for copying objects about python-storage HOT 23 CLOSED

googleapis commented on September 15, 2024 24

Support parallel operations for copying objects

from python-storage.

Comments (23)

hector-sab commented on September 15, 2024 14

Another workaround would be to use sh and do something like

import sh
sh.gsutil("-m cp data gs://mybucket/data".split())

from python-storage.

1fish2 commented on September 15, 2024 9

@hector-sab Shelling out to gsutil has significant limitations:

Uploads don't create the empty "placeholder" directory-like entries that are needed to make gcsfuse run at a reasonable speed. It's a 10x speed factor.
Parallel downloads will put all the files into the same directory except when using -r to recursive download a directory. So I can't give it a list of files from multiple directories to put into multiple directories.
Error handling and recovery: Can one reliably parse its stdout to tell which transfers succeeded?

from python-storage.

tswast commented on September 15, 2024 7

Can we reopen this as a feature request? It would be useful to have helpers for recursive copies similar to gsutil.

from python-storage.

tseaver commented on September 15, 2024 5

@tswast, @frankyn Perhaps we could resolve this by adding a more robust sample to the docs, based on the one @dhermes outlined above. E.g.:

"""Sample: show parallel uploads to a bucket.

"""
from concurrent import futures
import os
import sys

from google.api_core import exceptions
from google.cloud.storage import Client

def ensure_bucket(client, bucket_name):
    try:
        return client.get_bucket(bucket_name)
    except exceptions.NotFound:
        return client.create_bucket(bucket_name)

def get_files(path):
    for directory, _, filenames in os.walk(path):
        for filename in filenames:
            yield os.path.join(directory, filename)

def upload_serial(bucket, filename):
    blob = bucket.blob(filename)
    blob.upload_from_filename(filename)

def main(argv):
    bucket_name, dirnames = argv[0], argv[1:]
    client = Client()
    bucket = ensure_bucket(client, bucket_name)

    pool = futures.ThreadPoolExecutor(10)
    uploads = []

    for dirname in dirnames:
        for filename in get_files(dirname):
            upload = pool.submit(upload_serial, bucket, filename)
            uploads.append(upload)

    futures.wait(uploads)

if __name__ == '__main__':
    main(sys.argv[1:])

from python-storage.

benbro commented on September 15, 2024 4

I think that the following also guarantees that all task complete.

with ThreadPoolExecutor(max_workers=5) as executor:
  results = executor.map(upload, files)
for res in results:
  pass

from python-storage.

wyardley commented on September 15, 2024 3

Using workaround similar to the above, we're getting a bunch of these errors:
urllib3.connectionpool(WARNING): Connection pool is full, discarding connection: storage.googleapis.com
even with fairly modest concurrency:

pool = futures.ThreadPoolExecutor(max_workers=5)

(with default, which is 5x # of cores, we were getting even more of those errors).

Have other folks seen that? Since the calls to urllib3 are being made in the library, is there any hook to increase the connection pool size or otherwise resolve this issue?

Given that gsutil has an option for this, I would echo that it would be very convenient to have this implemented within the library itself.

from python-storage.

theacodes commented on September 15, 2024 2

assuming that either each blob's upload uses a separate HTTP transport object, or that if they share one, it's thread-safe.

We use requests so it shares a thread-safe connection pool. :)

from python-storage.

wyardley commented on September 15, 2024 1

Hi --

Code like the following (based pretty closely on the example above, only for downloading files) was resulting in missing files a good amount of the time (with a single thread, works reliably, and fails more of the time as num threads increases):

from google.cloud import storage

def get_templates(download_folder):

    # filter warnings about using default credentials
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")
        storage_client = storage.Client()

        # The "folder" where the files you want to download are
        bucket = storage_client.get_bucket('somebucket')
        blobs = bucket.list_blobs()

        pool = futures.ThreadPoolExecutor(max_workers=5)
        downloads = []

        # Iterating through for loop one by one using API call
        for blob in blobs:
            download = pool.submit(_download_files, download_folder, bucket, blob)
            downloads.append(download)

        futures.wait(downloads)


def _download_files(download_folder, bucket, blob):
    # Have tried pre-creating these as well
    name = blob.name
    if not name.endswith("/"):
        subdir = os.path.dirname(name)

        subdir_path = '{}/{}'.format(download_folder, subdir)
        if not os.path.exists(subdir_path):
            os.makedirs(subdir_path)

        # The name includes the sub-directory path
        file_path = '{}/{}'.format(download_folder, name)
        blob.download_to_filename(file_path)

Even after following some suggestions from a helpful person on GCP community slack, and using futures.as_completed(), it still takes me about 11 seconds to copy a directory structure with 41 relatively small Jinja templates from GCS.

The same thing with, e.g., gsutil -m cp:

- [41/41 files][ 24.5 KiB/ 24.5 KiB] 100% Done                                  
Operation completed over 41 objects/24.5 KiB.                                    
gsutil -m cp -r gs://xxx /tmp  2.55s user 1.06s system 131% cpu 2.736 total

from python-storage.

wyardley commented on September 15, 2024 1

Yes, in the end, the urllib3 warnings were a red herring (tho pretty sure I was able to get them with fewer than 10 threads). A solutions eng on the GCP Slack gave me an example that fixes the missing files issue.

Either way, I do think this is functionality that should be builtin to the module.

from python-storage.

dhermes commented on September 15, 2024

@bencaine1 You can do this as follows:

def get_files(path):
    for directory, _, filenames in os.walk(path):
        for filename in filenames:
            yield os.path.join(directory, filename)


def upload_serial(bucket, filename):
    blob = bucket.blob(filename)
    blob.upload_from_filename(filename)

def upload_parallel(bucket, path):
    threads = []
    for filename in get_files(path):
        # You'll probably want something a bit more deliberate here,
        # e.g. a thread pool
        thread = threading.Thread(target=upload_serial, args=(bucket, filename))
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

I am going to close this since the feature is "supported" in this way. I'm happy to continue discussing / reconsider if you'd like.

from python-storage.

houglum commented on September 15, 2024

Thanks, @dhermes.

Extra note, for context. The -m flag in gsutil will spawn multiple processes, and multiple threads per process to perform the individual copies in parallel. The solution @dhermes mentioned above is roughly equivalent, assuming that either each blob's upload uses a separate HTTP transport object, or that if they share one, it's thread-safe. (Gsutil uses httplib2 for its transport, which aren't thread-safe, thus we create individual http objects for each thread to use.)

from python-storage.

tseaver commented on September 15, 2024

@frankyn I'll let you prioritize this one among other GCS feature requests.

from python-storage.

frankyn commented on September 15, 2024

Acking

from python-storage.

ayserarmiti commented on September 15, 2024

@tswast is there any update regarding this feature request. Thanks

from python-storage.

tswast commented on September 15, 2024

@ayserarmiti none yet. I have a meeting with some of the other Python client maintainers next week to determine priority / design of useful features such as this.

from python-storage.

frankyn commented on September 15, 2024

Hi folks,

PR googleapis/google-cloud-python#8483 attempted to resolve this feature request. This is a feature request and it's not within scope of what should be worked on at this time. We have a priority on fixing reliability bugs in our clients before feature requests.

Additionally, this feature in particular requires a design review that has not yet been prioritized.
Thank you for your patience. We will revisit this feature request in the future.

from python-storage.

benbro commented on September 15, 2024

from python-storage.

benbro commented on September 15, 2024

@wyardley I'm only starting to get the urllib3 warning when the thread pool size is above 10 workers.
Even then, I think it's only a warning and the same request will be retried: https://stackoverflow.com/a/53765527

from python-storage.

benbro commented on September 15, 2024

I agree that it should be part of the library.
What fixed the missing files issue?

from python-storage.

wyardley commented on September 15, 2024

It seemed to have to do with using futures.wait(). I adapted a variation that @domZippilli kindly threw together and that’s worked relatively well, though it’s not quite as simple as the examples above. Maybe he’d be willing to gist it here.

from python-storage.

domZippilli commented on September 15, 2024

Of course! Here's the gist. Usual disclaimer that this is offered as-is, no guarantees that it will work or even not hurt you, use at your own risk.

This was about 30 days ago so my memory is a bit foggy but ISTR the key here is that one collect the futures (in this case the downloads list) and then check them using the as_completed function. This ensures that the program waits on every Future returning, as it must block on as_completed until they're all done and of course future.result() for each. My intuition with the first implementation is that some of the futures were not completing before the program exited, causing the missing files.

Then again, I cannot tell you why that would be happening based on the pydoc for futures.wait(). That is basically an abstraction for what I implemented here. So there's probably something I'm missing, but I've always found this works better.

from python-storage.

wyardley commented on September 15, 2024

I had also tried some simpler suggestions for fixing the above (including, if memory serves, returning something from _download_files(), and they didn't seem to work either.

from python-storage.

domZippilli commented on September 15, 2024

It should! Executor's map function rules. And I believe the iterator yields the result of each future, so that is a succinct way of doing the same.

from python-storage.

Support parallel operations for copying objects about python-storage HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent