Code Monkey home page Code Monkey logo

metamail's Introduction

Welcome to Metamail

Metamail is mail analysis tool based on Hadoop. The project is composed of 3 subprojects:

  • Mail Analyzer. Hadoop jobs for analysing mail.
  • Enron Mail Importer. Imports the 'Enron Email Dataset' into a HBase database.
  • Metamail UI. Web UI to visualize the results of the Hadoop jobs.

Each folder contains a README. Please, read them to know how to run Metamail.

There is a demo version of the Metamail UI available at http://igalia-metamail.herokuapp.com

What statistics does it include?

Metamail includes the following statistics:

  • Mails by size
  • Mails by thread length
  • Mails per day of the week
  • Mails per hour of the day
  • Mails per month
  • Mails per year
  • Mails received
  • Mails sent

Can I use it for analyzing my own email?

Yes. Metamail is a general-purporse tool and should work of any email dataset. Since it's based on Hadoop metamail can analyze large amounts of data. However, notice that Metamail was built as a learning tool, so it's very likely you would need to tune some parts to make it work with your own dataset. Patches are welcome.

Some things to consider if you would like to use Metamail in your own organization:

  • First, use the database importer (enron-importer) to import your email dataset.
  • Build the MapReduce jobs and execute them. Check all jobs finish successfully.
  • Copy the results of the MapReduce jobs to the Web Tool. This step will require you to put the output in the right format (check the current .csv files). You can either do it manually or coding your own scripts.

What's the data Metamail comes with?

Metamail uses the 'Enron Email Dataset' as example data. This dataset, which is about 500MB, was made public and published to the web during the Enron trial. Many researchers use nowadays this dataset as sample data for email. The Enron Email Dataset can be found at http://www.cs.cmu.edu/~enron/.

Visualization

The data used for visualization differs from the data obtained from the execution of the MapReduce jobs. This simplification was made to make visualization easier. For instance, mails by thread lengths may return thousands of threads many with just 1 email. That information is useless. In this case, the chart only show the Top 50 emails by thread length.

  • Mails by size. Mails by size grouped on intervals by 20KB (0-20KB, 20KB-40KB, etc). As almost all mails are on the 0-20KB interval, that interval was discarded (1 unit = 10 emails)
  • Mails by thread length. Top 50 mails by thread length
  • Mails per day of the week. From Monday-Sunday, mails on each day
  • Mails per hour of the day. From 0 hours to 23 hours, mails by hour
  • Mails per month. Only the emails received on year 2001 are shown
  • Mails per year
  • Mails received. Top 50 people who received more email
  • Mails sent. Top 50 people who sent more email

In all charts 1 unit = 1000 emails, except 'Mails by size'.

Contact

If you have any request with regard to Metamail you can contact me at [email protected]. If you would like to contribute with code please check the repo at github: https://github.com/dpino/Metamail.

metamail's People

Contributors

dpino avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.