enron's Introduction

enron

The expected file structure is:

enron_root_directory

-- edrm-enron-v1

-- edrm-enron-v2

-- lost+found

My program expects enron_root_directory/edrm-enron-v2 as input. It does not require write access to that folder or its contents but only read access. Some of the test cases, create zip files (and delete zip files after test) and will require write access to the directory where the code resides.

Set up: If you are using a virtualenv, activate virtual envioronment and install the required packages:

$ pip install -r requirements.txt

Testing: The simplest way to test is $pytest -v (-s to see console output).

Running : python script.py /path/to/enron/edrm/v2

Example output is provided in the repository.

Hardware requirements: I had to use t2.small as the memory of t2.micro was not sufficient. [The program consumed more than 90% of 1gb ram at 93% of runtime and swapping was required]. The typical run time on the full dataset is 12-14 mins. I think, unzippping and processing will be much slower than using zipfile module.

Assumptions:

1). A word in email body is considered as some_alpha_numeric_content. The * could be a ',',' ', '.', '*' etc.

2). Every zipfile has one and only one xml file which has the metadata on emails

3). The email body is standard MIME and readily parsable by any of the MIME parsers.

4). An email address should have @ in it to be considered valid and I remove any special characters in the word containing @.

Recommend Projects

vijayvammi / enron Goto Github PK

enron's Introduction

enron

enron's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent