cuducos / chunk Goto Github PK

View Code? Open in Web Editor NEW

53.0 5.0 3.0 71 KB

🧱 Chunk is a download manager for slow and unstable servers

License: MIT License

Go 96.67% Dockerfile 0.55% Shell 2.79%

hacktoberfest

chunk's Introduction

Chunk

Chunk is a download tool for slow and unstable servers.

Usage

CLI

Install it with go install github.com/cuducos/chunk/cmd/chunk@latest then:

$ chunk <URLs>

Use --help for detailed instructions.

API

The Download method returns a channel with DownloadStatus statuses. This channel is closed once all downloads are finished, but the user is in charge of handling errors.

Simplest use case

d := chunk.DefaultDownloader()
ch := d.Dowload(urls)

Customizing some options

d := chunk.DefaultDownloader()
d.MaxRetries = 42
ch := d.Dowload(urls)

Customizing everything

d := chunk.Downloader{...}
ch := d.Download(urls)

How?

It uses HTTP range requests, retries per HTTP request (not per file), prevents re-downloading the same content range and supports wait time to give servers time to recover.

Download using HTTP range requests

In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using HTTP range requests. This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response.

Retries by chunk, not by file

In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (content range), not the whole file.

Control of which chunks are already downloaded

In order to avoid re-starting from the beginning in case of non-handled errors, chunk knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads.

Detect server failures and give it a break

In order to avoid unnecessary stress on the server, chunk relies not only on HTTP responses but also on other signs that the connection is stale and can recover from that and give the server some time to recover from stress.

Why?

The idea of the project emerged as it was difficult for Minha Receita to handle the download of 37 files that adds up to just approx. 5Gb. Most of the download solutions out there (e.g. got) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand.

chunk's People

Contributors

Stargazers

Watchers

Forkers

vmesel devils2ndself makon

chunk's Issues

Integrate the pause/restart feature to the download function

Use the structure created in #7 to allow downloads to stop and re-start.

Create the parallel download semaphore/workers limit

Implement a maximum of parallel downloads per domain/subdomain.

Identify each different subdomain in all the URLs
create a fixed number of channels mapped to each one map[string]chan
spin a worker reading each of these channels
process the download in these workers

Two chunk instances downloading the same file

Do we care about the same file being downloaded simultaneously by two chunk instances? Asking because I believe this can a pre-release feature.

Example:

$ ./chunk HTTP://a.b/c &
$ ./chunk HTTP://a.b/c &

The result of this sequence of operations is unknown.

We could deal with it after #8, by augmenting the infrastructure in place.

cc/ @cuducos

Use retry when getting file sizes

Here:

chunk/main.go

Line 222 in a27be85

t, err := d.getDownloadSize(ctx, u) // TODO: retry

We could use retry.Do(…) in a similar way we use on the download:

chunk/main.go

Lines 162 to 173 in a27be85

    
           err := retry.Do( 
        
           	func() error { 
        
           		b, err := d.downloadFileWithTimeout(ctx, u) 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		ch <- b 
        
           		return nil 
        
           	}, 
        
           	retry.Attempts(d.MaxRetriesPerChunk), 
        
           	retry.MaxDelay(d.WaitBetweenRetries), 
        
           )

Use a `HEAD` request to get the file total size before queueing it for download

This info can be added to the existing DownloadStatus.FileSizeBytes field.

Create a function to calculate all content ranges for any given file size

This function can be agnostic of the context and scope, something as simple as func (d *Downloader) (t int64) []chunk where t is the total bytes of the file and chunk can:

type chunk struct {
    start uint64
    end uint64
}

These should be the values to be used in the HTTP header, e.g. Content-Range: bytes 0-42 (these indexes are inclusive on both ends).

Optionally chunk might have helpers methods to return its size and the contents of the Content-Range header value as a formatted string.

Fix downloading of binary (?) files

It looks like our simple tests of downloading string-based contents from an HTTP server are OK, but downloading a binary such as a ZIP archive seems to result in a corrupted downloaded file.

See #28 for a failing test case.

Error on `go install` (following the README)

$ docker run --rm -it golang:1.19-bullseye /bin/bash
root@6ebcdedd1767:/go# go install github.com/cuducos/chunk@latest
go: downloading github.com/cuducos/chunk v1.0.0
go: downloading github.com/avast/retry-go v3.0.0+incompatible
package github.com/cuducos/chunk is not a main package

Fix test and/or `getDownloadSize`

Ideally, I think the idea was that getDownloadSize should not return 0, nil because getting the file size is not an end; it is a mean to determine how many chunks we need to download using the content-range HTTP requests.

But if we do that, there's a test failing as mentioned here #19 (comment) (tks @devils2ndself).

We need to:

understand why is this test failing
address it properly, meaning: if the test is wrong, fix the test itself; if the logic of getDownloadSize, fix the function itself

Create an enhanced CLI through which we can customize de default options

Something like, probably using Cobra.

$ chunk --directory data \
        --timeoutPerChunk 1m \
        --maxParallelDownloadsPerServer 4 \
        --maxRetriesPerChunk 13 \
        --chunkSize 1024 \
        --waitBetweenRetries 30s [URLs]

Or:

$ chunk -d data -t 1m -p 4 -r 13 -s 1024 -w 30s [URLs]

The --directory option is not in the Downloader struct yet, check #13.

Add pre-commit checks

gofmt
staticcheck
go test

Make Chunk directory configurable

#38 added ~/.chunk as the default directory for saving progress files. There's a TODO to make it configurable:

chunk/progress.go

Lines 16 to 28 in 02999cd

    
           // get the chunk directory under user's home directory 
        
           // TODO: make it configurable (maybe an envvar?) 
        
           func getChunkDirectory() (string, error) { 
        
           	u, err := user.Current() 
        
           	if err != nil { 
        
           		return "", fmt.Errorf("could not get current user: %w", err) 
        
           	} 
        
           	d := filepath.Join(u.HomeDir, defaultChunkDir) 
        
           	if err := os.MkdirAll(d, 0755); err != nil { 
        
           		return "", fmt.Errorf("could not create chunk's directory %s: %w", d, err) 
        
           	} 
        
           	return d, nil 
        
           }

Maybe reading from and environment variable like CHUNK_DIR?

Should we export the HTTP client to make it easier for users to customize it?

chunk/downloader.go

Line 59 in 53b8bf9

client *http.Client

Just asking, I don't have any use case in mind.

Maybe someone wants to use cookies? Or authenticating before the download? Not sure. I think it's difficult because of the way we set it up: the HTTP client handles the heavy lifting of parallelism, but we might have advanced users… idk.

Cannot continue stopped download on Windows

As described in #44:

Tried to restart download, and the following error was reported:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
(base) PS C:\Users\mauricio\chunk_teste>

Error unzipping large file downloaded with `chunk` in Windows

I'm using Windows 10, with Powershell (with base conda environment automatically activated).

Tried to download the biggest file (Estabelecimentos0.zip). Had the following error:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip --force-restart
Downloading 622.4MB of 878.1MB  70.88%  1.4MB/s2022/12/26 18:51:31 error downloadinf chunk #90073: error downloading https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip: All attempts fail:
#1: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#2: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#3: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#4: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
#5: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
(base) PS C:\Users\mauricio\chunk_teste>

Tried to restart download, and the following error was reported:

(base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
(base) PS C:\Users\mauricio\chunk_teste>

With the flag --force-restart the download worked, however from the beggining of the file. Once again, after over 500Mb downloaded, the prior timeout error occurred. Can't restart without --force-restart flag`

The zip file, however, is downloaded and, when I try to unzip it (using 7-zip) it reports a data error, but saves the content (a csv file). But this file cannot be loaded in pandas or even in a spreadsheet software. In a text editor (Notepad++) it shows coherent data for the first lines (about 4.000.000), but after that it's clearly cluttered.

With a smaller file (Empresas1.zip), it worked correctly. The file was downloaded, unzipped and opened in Pandas (4.494.859 lines)

Structure to allow stop/restart downloads

Given that a user has started a download and this download was interrupted, the user can restart the download if the user does not change the Downloader.ChunkSize.

Each downloaded file can have a hidden file version to control its download status, e.g. my-file.zip would have a .my-file.zip
This file holds two pieces of information, the chunk size and an array of sequential booleans for each chunk needed to download this file
There should be a function to check whether a specific chunk (by its index in the array) is download successfully or not
There should be a function to check whether a specific file is downloaded successfully or not
There should be a function o write the status to a file (probably using gob)
There should be a function o load an existing status file

A reference can be the prototype for that in Minha Receita.

(The integration of this functionality can be integrated in the download in a follow up PR)

Implement HTTP content range to download files chunk by chunk

Enqueue HTTP requests by chunk, not by URL
Use HTTP content range to process each chunk download
Coordinate the write []byte received per chunk to the file (here, creating the file with truncate to the total size might help)

Write tests for `getDownloadSize`

As described here:

chunk/main_test.go

Line 220 in a27be85

    
           // TODO: add tests for getDownloadSize (success with Content-Length, success with Content-Range, failure)

Add target directory

Once we start a download we must be able to tell the Downloader where to save the files. This should be passed via CLI too (e.g. in the structure described in #12) and can have a meaningful default set to the cwd.

	err := retry.Do(
	func() error {
	b, err := d.downloadFileWithTimeout(ctx, u)
	if err != nil {
	return err
	}
	ch <- b
	return nil
	},
	retry.Attempts(d.MaxRetriesPerChunk),
	retry.MaxDelay(d.WaitBetweenRetries),
	)

	// get the chunk directory under user's home directory
	// TODO: make it configurable (maybe an envvar?)
	func getChunkDirectory() (string, error) {
	u, err := user.Current()
	if err != nil {
	return "", fmt.Errorf("could not get current user: %w", err)
	}
	d := filepath.Join(u.HomeDir, defaultChunkDir)
	if err := os.MkdirAll(d, 0755); err != nil {
	return "", fmt.Errorf("could not create chunk's directory %s: %w", d, err)
	}
	return d, nil
	}