Code Monkey home page Code Monkey logo

harvesting-tools's Introduction

Harvesting Tools

A collection of code snippets mostly designed to be dropped into the data harvesting process directly after generating the zip starter kit. We welcome tools written in any language! Especially if they cover use cases we haven't described. To add a new tool to this repository please review our Contributing Guidelines

Usage

  • Familiarize yourself with the harvesting instructions in the Data Rescue workflow repo. Within the pipeline app, click Download Zip Starter from the page related to a URL that you have checked out.
  • unzip the zipfile
  • Choose a tool that seems likely to be helpful in capturing this particular resource, and copy the contents of its directory in this repo to the tools directory, e.g. with:
    cp -r harvesting-tools/TOOLNAME/* RESOURCEUUID/tools/
    
  • Adjust the base URL for the resource along with any other relevant variables, and tweak the content of the tool as necessary
  • After the resource has been harvested, proceed with the further steps in the workflow.

Matching Tools and Datasets

Each tool in this repo has a fairly specific use case. Your choice will depend on the shape and size of the data you're dealing with. Some datasetswill require more creativity/more elaborate tools. If you write a new tool, please add it to the repo.

wget-loop for largely static resources

If you encounter a page that links to lots of data (for example a "downloads" page), this approach may well work. It's important to only use this approach when you encounter data, for example pdf's, .zip archives, .csv datasets, etc.

The tricky part of this approach is generating a list of urls to download from the page.

  • If you're skilled with using scripts in combination with html-parsers (for example python's wonderful beautiful-soup package), go for it.
  • if the URL's you're trying to access are dynamically generated by JavaScript in the browser environment, we've also included the jquery-url-extraction guide], which will guide you through the process of exracting URL's directly from the browser console.

download_ftp_tree.py for FTP datasets

Government datasets are often stored on FTP; this script will capture FTP directories and subdirectories.

PLEASE NOTE that the Internet Archive has captured over 100Tb of government FTP resoures since December 2016. Be sure to check the URL using check-ia/url-check, the Wayback Machine Extension, or your own tool that uses the Wayback Machine's API (example 1, example 2 w/ wildcard. If the FTP directory you're looking at has not been saved to the Internet Archive, be sure that it has also been nominated as a web crawl seed.

Whether it has been saved or not, you may decide to download it for chain-of-custody preservation reasons. If so, this script should do what you need.

Ruby/Watir for full browser automation

The last resort of harvesting should be to drive it with a full web browser. It is slower than other approaches such as wget, curl, or a headless browser. Additionally, this implementation is prone to issues where the resulting page is saved before it's done loading. There is a ruby example in tools/example-hacks/watir.rb.

Identify Data Links & acquire them via WARCFactory

For search results from large document sets, you may need to do more sophisticated "scraping" and "crawling" -- check out tools built at previous events such as the EIS WARC archiver or the EPA Search Utils for ideas on how to proceed.

the utils directory is for scripts that have been useful in the past but may not have very general application. But you still might find something youu like!

API scrape / Custom Solution

If you encounter an API, chances are you'll have to build some sort of custom solution, or investigate a social angle. For example: asking someone with greater access for a database dump. Be sure to include your ocode in the tools directory of your zipfile, and if there is any likelihood of general application, please add to this repo.

harvesting-tools's People

Contributors

aniketaranake avatar cabhishek avatar dcwalk avatar dgkf avatar diafygi avatar elliot42 avatar titaniumbones avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.