Code Monkey home page Code Monkey logo

datahog's Introduction

DataHog

DataHog is a web application for analyzing how your storage space is being used. The app builds a database of files stored in iRODS collections (such as the CyVerse data store), Amazon S3 buckets, or directories on your device, and allows you to search, sort, and compare them. It provides information about file sizes, types, and duplicated files.

Running DataHog

Option 1: Discovery Environment

DataHog is available as an app on the CyVerse Discovery Environment. Simply click "Launch Analysis" to start up a new instance.

Option 2: Docker

The latest DataHog image is hosted on DockerHub.

Alternatively, you can build it yourself by running docker build . -t datahog in the root directory.

The app runs on port 8000 of the container, so you'll want to publish the port using something like this:

docker run -it -p 8000:8000 <name:tag>

Option 3: Local Build

If you want to set up DataHog locally for development, follow these steps:

  1. Install SQLite 3
  2. Install Python 3.6.6
  3. Install RabbitMQ
  4. Install the pip packages in django/requirements.txt
  5. Run python manage.py migrate inside the django directory to populate your database.
  6. Run python manage.py runserver to start the server.
  7. In another terminal, run celery -A celery_app worker to start a task worker process.
  8. Install Node.JS 8.12.0
  9. Install the npm packages using npm install inside the react directory.
  10. Run npm run js to build the JS files (the build will auto-refresh if you keep it running).

Usage Guide

The launch page offers five options for importing file data into DataHog:

  • iRODS: Use the iRODS API to import data from a specific collection. The options for importing files from the CyVerse data store are prefilled.
  • .datahog File: Upload a .datahog file containing file data. These can be generated by a Python script which you can download and run on any machine (see: Crawler Script).
  • CyVerse: Use the CyVerse file search API to import any data stored in the data store. This method currently does not support exact duplicate matching, and may be slower than iRODS in some cases.
  • S3 Bucket: Use your AWS access keys to import an S3 bucket, or a specific directory from one.
  • Restore Database: If you previously backed up a DataHog database, you can upload it to restore your data.

Depending on how many files are being scanned, the import process can take a few minutes to complete. Some extremely large directories (millions of files) may take much longer–feel free to close the tab and check up on it later if you wish.

Once the import process for your first file source is complete, you will have access to 4 tabs:

  • Summary: View a summary of each of your file sources, including various file rankings and visualizations.
  • Browse Files: Explore the folder structure for each of your file sources, or search your files using names, regex expressions, or date and size filters. - Each column header can be clicked to sort the table by that value.
  • Duplicated Files: View a list of files with identical contents. By default, this page uses checksums to compare files, but file sizes or names can also be used. Each column header can be clicked to sort the table by that value.
  • Manage File Sources: Import a new file source, remove an existing one, or download a backup of the current file database.

Crawler Script

The DataHog Crawler Script is a small Python 3 program used to scan a directory and generate a .datahog file, which can be imported directly into DataHog.

You can run it like so:

python3 datahog_crawler.py <root path> [<options>]

The script calculates MD5 checksums for each file it scans, in order to detect duplicated files. This can be slow for large directories, so you can use the -n or --no-checksums option to disable this.

By default, the script creates a file called <directory name>.datahog, but this can be overridden with the -o or --output option.

datahog's People

Contributors

csklimowski avatar edwins avatar slr71 avatar dependabot[bot] avatar

Stargazers

Vince Fulco--Bighire.tools avatar Jonas Weissmann Gaiarsa avatar  avatar

Watchers

James Cloos avatar  avatar  avatar Chris G. avatar  avatar

Forkers

benlazarine slr71

datahog's Issues

Description

Please add something to the readme file about what Datahog actually does.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.