Chunk#
Chunk a download tool for slow and unstable servers.
The idea of the project emerged as it was difficult for Minha Receita to handle the download of 37 files that adds up to just approx. 5Gb. Most of the download solutions out there (e.g. got) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand.
Main fetaures#
Download using HTTP range requests#
In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using HTTP range requests. This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response.
Retries by chunk, not by file#
In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (that bytes range), not the whole file.
Control of which chunks are already downloaded#
In order to avoid re-starting from the beginning in case of non-handled errors, chunk knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads.
Detect server failures and give it a break#
In order to avoid unnecessary stress on the server, chunk relies not only on HTTP responses but also on other signs that the connection is stale and can:
- recover from that and
- give the server some time to recover from stress.
Tech design#
Input#
- List of URLs
- Directory where to save the files
- Configuration (they can have defaults and be optional; customizing them can be a stretch goal):
- Chunck download attempt timeout
- Maximum parallel connection to each server
- Max retries per chunk (must have an option to unlimited)
- Range maximum size (chunk size)
- Time to wait on server failure
Prepare downloads#
For each URL of the list (this can be done in parallel):
- Make sure the server accepts HTTP range requests (stretch goal)
- Can fail if it doesn't
- Or can default to regular HTTP request to download
- Find out the file total size
- Determine all the chunks to be downloaded (each start and end bytes)
- Read or create a temporary control of chunks downloaded and pending chunks
- Enqueue all the pending chunks
With all this information, show a progress bar with the total work remaining.
Download#
- Set a timeout
- Start the HTTP range request
- In case of failure or timeout, re-queue this chunk
- In case of success, send the chunk contents to a
resultschannel
Writing files#
- Read the bytes from the
resultschannel - Write to the file on disk
- Update a progress bar to give the user an idea about the status of the downloads
Prototype#
The prototype is a CLI that wraps a GET HTTP request in a 45s timeout independent of the HTTP client's timeout. It also includes 3 retries by default.
$ go run main.go <URL> # e.g. go run main.go https://github.com/cuducos/chunk/archive/refs/heads/main.zip
The API should work like this:
// simple use case
d := NewDownloader()
ch := d.Dowload(urls)
// partial customization
d := NewDownloader()
d.MaxRetriesPerChunk = 42
ch := d.Dowload(urls)
// full control
d := chunk.Downloader{...}
ch := d.Download(urls)
The resulting channel will transmit data about each download:
type DownloadStatus struct {
URL string
DownloadedFilePath string
FileSizeBytes uint64
DownloadedFileBytes uint64
Error error
}