Code Monkey home page Code Monkey logo

chunk's Introduction

Chunk

Tests Format Lint GoDoc

Chunk is a download tool for slow and unstable servers.

Usage

CLI

Install it with go install github.com/cuducos/chunk/cmd/chunk@latest then:

$ chunk <URLs>

Use --help for detailed instructions.

API

The Download method returns a channel with DownloadStatus statuses. This channel is closed once all downloads are finished, but the user is in charge of handling errors.

Simplest use case

d := chunk.DefaultDownloader()
ch := d.Dowload(urls)

Customizing some options

d := chunk.DefaultDownloader()
d.MaxRetries = 42
ch := d.Dowload(urls)

Customizing everything

d := chunk.Downloader{...}
ch := d.Download(urls)

How?

It uses HTTP range requests, retries per HTTP request (not per file), prevents re-downloading the same content range and supports wait time to give servers time to recover.

Download using HTTP range requests

In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using HTTP range requests. This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response.

Retries by chunk, not by file

In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (content range), not the whole file.

Control of which chunks are already downloaded

In order to avoid re-starting from the beginning in case of non-handled errors, chunk knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads.

Detect server failures and give it a break

In order to avoid unnecessary stress on the server, chunk relies not only on HTTP responses but also on other signs that the connection is stale and can recover from that and give the server some time to recover from stress.

Why?

The idea of the project emerged as it was difficult for Minha Receita to handle the download of 37 files that adds up to just approx. 5Gb. Most of the download solutions out there (e.g. got) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand.

chunk's People

Contributors

cuducos avatar danielfireman avatar devils2ndself avatar makon avatar vmesel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

chunk's Issues

Create the parallel download semaphore/workers limit

Implement a maximum of parallel downloads per domain/subdomain.

  1. Identify each different subdomain in all the URLs
  2. create a fixed number of channels mapped to each one map[string]chan
  3. spin a worker reading each of these channels
  4. process the download in these workers

Two chunk instances downloading the same file

Do we care about the same file being downloaded simultaneously by two chunk instances? Asking because I believe this can a pre-release feature.

Example:

$ ./chunk HTTP://a.b/c &
$ ./chunk HTTP://a.b/c &

The result of this sequence of operations is unknown.

We could deal with it after #8, by augmenting the infrastructure in place.

cc/ @cuducos

Use retry when getting file sizes

Here:

chunk/main.go

Line 222 in a27be85

t, err := d.getDownloadSize(ctx, u) // TODO: retry

We could use retry.Do(…) in a similar way we use on the download:

chunk/main.go

Lines 162 to 173 in a27be85

err := retry.Do(
func() error {
b, err := d.downloadFileWithTimeout(ctx, u)
if err != nil {
return err
}
ch <- b
return nil
},
retry.Attempts(d.MaxRetriesPerChunk),
retry.MaxDelay(d.WaitBetweenRetries),
)

Create a function to calculate all content ranges for any given file size

This function can be agnostic of the context and scope, something as simple as func (d *Downloader) (t int64) []chunk where t is the total bytes of the file and chunk can:

type chunk struct {
    start uint64
    end uint64
}

These should be the values to be used in the HTTP header, e.g. Content-Range: bytes 0-42 (these indexes are inclusive on both ends).

Optionally chunk might have helpers methods to return its size and the contents of the Content-Range header value as a formatted string.

Fix downloading of binary (?) files

It looks like our simple tests of downloading string-based contents from an HTTP server are OK, but downloading a binary such as a ZIP archive seems to result in a corrupted downloaded file.

See #28 for a failing test case.

Error on `go install` (following the README)

$ docker run --rm -it golang:1.19-bullseye /bin/bash
root@6ebcdedd1767:/go# go install github.com/cuducos/chunk@latest
go: downloading github.com/cuducos/chunk v1.0.0
go: downloading github.com/avast/retry-go v3.0.0+incompatible
package github.com/cuducos/chunk is not a main package

Fix test and/or `getDownloadSize`

Ideally, I think the idea was that getDownloadSize should not return 0, nil because getting the file size is not an end; it is a mean to determine how many chunks we need to download using the content-range HTTP requests.

But if we do that, there's a test failing as mentioned here #19 (comment) (tks @devils2ndself).

We need to:

  1. understand why is this test failing
  2. address it properly, meaning: if the test is wrong, fix the test itself; if the logic of getDownloadSize, fix the function itself

Create an enhanced CLI through which we can customize de default options

Something like, probably using Cobra.

$ chunk --directory data \
        --timeoutPerChunk 1m \
        --maxParallelDownloadsPerServer 4 \
        --maxRetriesPerChunk 13 \
        --chunkSize 1024 \
        --waitBetweenRetries 30s [URLs]

Or:

$ chunk -d data -t 1m -p 4 -r 13 -s 1024 -w 30s [URLs]

The --directory option is not in the Downloader struct yet, check #13.

Make Chunk directory configurable

#38 added ~/.chunk as the default directory for saving progress files. There's a TODO to make it configurable:

chunk/progress.go

Lines 16 to 28 in 02999cd

// get the chunk directory under user's home directory
// TODO: make it configurable (maybe an envvar?)
func getChunkDirectory() (string, error) {
u, err := user.Current()
if err != nil {
return "", fmt.Errorf("could not get current user: %w", err)
}
d := filepath.Join(u.HomeDir, defaultChunkDir)
if err := os.MkdirAll(d, 0755); err != nil {
return "", fmt.Errorf("could not create chunk's directory %s: %w", d, err)
}
return d, nil
}

Maybe reading from and environment variable like CHUNK_DIR?

Cannot continue stopped download on Windows

As described in #44:

Tried to restart download, and the following error was reported:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
(base) PS C:\Users\mauricio\chunk_teste>

Error unzipping large file downloaded with `chunk` in Windows

I'm using Windows 10, with Powershell (with base conda environment automatically activated).

Tried to download the biggest file (Estabelecimentos0.zip). Had the following error:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip --force-restart
Downloading 622.4MB of 878.1MB  70.88%  1.4MB/s2022/12/26 18:51:31 error downloadinf chunk #90073: error downloading https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip: All attempts fail:
#1: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#2: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#3: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#4: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#5: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
(base) PS C:\Users\mauricio\chunk_teste>

Tried to restart download, and the following error was reported:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
(base) PS C:\Users\mauricio\chunk_teste>

With the flag --force-restart the download worked, however from the beggining of the file. Once again, after over 500Mb downloaded, the prior timeout error occurred. Can't restart without --force-restart flag`

The zip file, however, is downloaded and, when I try to unzip it (using 7-zip) it reports a data error, but saves the content (a csv file). But this file cannot be loaded in pandas or even in a spreadsheet software. In a text editor (Notepad++) it shows coherent data for the first lines (about 4.000.000), but after that it's clearly cluttered.

With a smaller file (Empresas1.zip), it worked correctly. The file was downloaded, unzipped and opened in Pandas (4.494.859 lines)

Structure to allow stop/restart downloads

Given that a user has started a download and this download was interrupted, the user can restart the download if the user does not change the Downloader.ChunkSize.

  1. Each downloaded file can have a hidden file version to control its download status, e.g. my-file.zip would have a .my-file.zip
  2. This file holds two pieces of information, the chunk size and an array of sequential booleans for each chunk needed to download this file
  3. There should be a function to check whether a specific chunk (by its index in the array) is download successfully or not
  4. There should be a function to check whether a specific file is downloaded successfully or not
  5. There should be a function o write the status to a file (probably using gob)
  6. There should be a function o load an existing status file

A reference can be the prototype for that in Minha Receita.

(The integration of this functionality can be integrated in the download in a follow up PR)

Add target directory

Once we start a download we must be able to tell the Downloader where to save the files. This should be passed via CLI too (e.g. in the structure described in #12) and can have a meaningful default set to the cwd.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.