Code Monkey home page Code Monkey logo

dr-boulder's Introduction

dr-boulder

Script(s) for the data rescue -boulder event

Assumptions

Disclaimer

  • These scripts have only been tested on a mac, and the use of virtualenv and pip should result in the same behavior on Windows/Linux systems - but it's possible you might need to tweak the code. Very open to suggestions and/or Pull Requests!

  • for Linux distributions with coexistant python2 and python3, use python2 version.

Set things up

  • In your terminal/iterm (mac/unix) or Command Line/Git Shell (Windows):

Clone the repo and create a python virtualenv:

git clone https://github.com/rchakra3/dr-boulder.git
cd dr-boulder
virtualenv env

Activate your virtualenv:

  • Mac/Unix:
source env/bin/activate
  • Windows:
.\env\Scripts\activate

** You should now see (env) in your terminal/command prompt before your folder structure **

  • Download all the requirements:
pip install -r requirements.txt

That should have everything setup in your virtualenv

Scripts

There are currently 3 important scripts in this repository:

  1. Generate a list of all the files available on an FTP server:

    1. To run:

      python -W ignore ftp_utils/get_all_files_from_ftp_server.py --server=<server domain name or IP> --output_file=<output file name>
      
    2. This will generate a list of all the files that are available for download at a particular domain

    3. The name/IP of the server is required. If the output file is not specified, it will write to ftp_files.txt

  2. Download all the URLs listed in a file [NON FTP]:

    1. This helps download a huge list of URLs (pdfs, json, xmls, etc)

    2. Put the list of URLs in a file - 1 URL per line

    3. Help:

      python download_data.py -h
      
    4. To run:

      python -W ignore download_data.py --filename=<name of file specified in 2> --max_space=<max disk space to use(Defaults to 5GB) --downloads_folder=<name of folder where you want to store the data>
      
  3. Download a list of files at FTP endpoints:

    1. Same as the previous script, but for FTP files
    2. You can use the file generated by the ftp_utils/get_all_files_from_ftp_server.py script as the input file for this script or create a new file with one ftp file per line
    3. FTP downloads seem to be much slower in general - Would recommend running the script over a small number of files at a time
    4. Help:
      python ftp_utils/download_ftp_files.py -h
      
    5. Run:
      python -W ignore ftp_utils/download_ftp_files.py --filename=<name of file specified in 2> --downloads_folder=<folder where you want to save the files>
      

Domain-specific scripts:

So far there's only one. For edg.epa.gov/data/public

Generate the list of files:
cd edg_epa_data_public
python -W ignore find_data_edg_epa.py
  • This script will generate 3 files:
    1. edg_epa_file_list.txt: The list of all the files that aren't ftp://
    2. edg_epa_ftp_file_list.txt: The list of all the files that are ftp://
    3. edg_epa_skipped_file_list.txt: The list of files that weren't downloaded for various reasons, including running out of disk space, exceeding the space limit specified, 404s
Downloading the files:

Use the scripts described above to download the URLs in the edg_epa_file_list.txt and edg_epa_ftp_file_list.txt lists

dr-boulder's People

Contributors

brettlyons avatar rohancme avatar

Watchers

 avatar  avatar

Forkers

brettlyons

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.