Code Monkey home page Code Monkey logo

iadownloader's Introduction

iadownloader

Summary

iadownloader is a tool to automatically download files from the Internet Archive. It will download all the files - individually or as a compressed archive - in an internet archive upload url automatically, to a configurable download location (defaults to the current working directory). It can also download complete collections etc, by parsing either json or csv files generated by Internet Archive’s advanced search tool.

Usage

iadownloader.py [-h] [-c] [-o OUTPUT_DIR] [-t THREADS] [-T] url

positional arguments:
  url                   URL or path to json/csv file

optional arguments:
  -h, --help            show this help message and exit
  -c, --compressed      Get the compressed archive download instead of the individual files
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path to output directory
  -t THREADS, --threads THREADS
                        Number of simultaneous downloads (maximum of 10)
  -T, --torrent         Only download the torrent file if available

The basic usage is to simply invoke iadownloader with a download url.

python iadownloader.py https://archive.org/download/<url>

This causes all the files in the url to be downloaded to the directory the script was invoked from.

Optionally specify the download location:

python iadownloader.py -o /download/path https://archive.org/download/<url>

To download the compressed archive of the upload just add the ‘-c’ flag:

python iadownloader.py -c -o /download/path https://archive.org/download/<url>

You can also specify the amount of threads (up to 10):

python iadownloader.py -t 8 /download/path https://archive.org/download/<url>

It defaults to 4 threads if not specified.

Don’t confuse “download url” with individual file urls. Those are trivially downloaded through your web browser. This tool is to simplify downloading all the included urls in an upload on Internet Archive. Even this can be done using the Web UI quite easily. Where iadownloader shines is the ability to download full collections automatically.

To download a whole collection, all files from a certain author, etc, go to Internet Archive’s advanced search tool and follow the following steps:

  1. Scroll down to “Advanced Search returning JSON, XML, and more”. In the “Query” field enter collection:<name of collection> for collections, creator:<name of creator> for creators, etc. In “Field to return” select “identifier” if not already selected. Select an appropriate “Number of results” depending on the collection.
  2. Choose either JSON format or CSV format. CSV format is a bit more convenient since it prompts you to download it immediately, while the JSON format opens a javascript page with embedded JSON data. Save the .csv file to a location. If you choose JSON, save the page and make sure to save it with the .json ending rather than the suggested .js one.
  3. Run iadownloader.py like this:
    python iadownloader.py -o /download/path /path/to/csv-or-json-file
        

iadownloader will go through all the downloads of the collection and download them into the download path.

Requirements

iadownloader uses requests, lxml, and tqdm to do its magic. To make sure you have them use the included requirements.txt:

pip install -r requirements.txt

Of course, you need python and pip as well.

iadownloader's People

Contributors

rsvensson avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.