Code Monkey home page Code Monkey logo

mailing-list-problem's Introduction

Problem

Given a table 'mailing':

CREATE TABLE mailing ( addr VARCHAR(255) NOT NULL );

The mailing table will initially be empty. New addresses will be added on a daily basis. It is expected that the table will store at least 10,000,000 email addresses and 100,000 domains.

Write a Python script that updates another table which holds a daily count of email addresses by their domain name.

Use this table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total.

** NOTE **

  • The original mailing table should not be modified.

  • All processing must be done in Python (eg. no complex queries or sub-queries)

  • Submit a compressed file(tar/zip) with the files required to run your script.

Solution

The following assumptions were made for the solution:

  • "percentage growth of the last 30 days compared to the total" meant that the percentage growth for a domain would be the increase/decrease in the last 30 days over the total initial count. (see Example 2 below)
  • the second table can contain an additional date column

This solution utilizes the peewee ORM (a small and simple Object-relational mapping library for MySQL). peewee can be obtained from: http://peewee.readthedocs.org/en/latest/index.html# or by running "pip install peewee".

Util.py

This file contains:

  • the credentials for the database
  • a simple function to get the first and last element in a data set
  • a function to format and print the final solution

handle_addresses.py

This file handles adding and removing addrs from the Mailing table.

run_update.py

This file contains the main solution to the problem. It begins be creating the Mailing and Domain Model classes, which represent the corresponding tables. A connection to the database is then created and the tables are creates if necessary.

The Domain table is where the domain counts are stored with their respective dates. A domain name and count entry is only created if there is no previous count or the count has changed. This means that the upper bound to the table size would be 100,000 * 30 rows while the lower bound would be 100,000. Because the probability of new addrs from all domains being added is low, we will likely never use 100,000 * 30 rows.

Example 1:

+----+-------------+-------+------------+
| id | name        | count | date       |
+----+-------------+-------+------------+
|  3 | gmail.com   |     1 | 2015-05-28 |
|  4 | yahoo.com   |     3 | 2015-05-29 |
|  5 | yahoo.com   |     6 | 2015-05-30 |
+----+-------------+-------+------------+

During each update, entries older than 30 days are deleted since they no longer effect the final output.

To get the top 50 domains by count, a MySQL query is used. This query sorts the rows in descending order by count, while grouping rows with the same domain name. Then using the LIMIT feature, the top 50 rows are obtained.

To get the percentage growth for each top 50 domain, the difference between the newest and oldest count is obtained and then divided by the oldest count.

Example 2:

Using the table from Example 1 above, we can see that the oldest entry for yahoo.com has a count of 3 and the newest entry has a count of 6.

6-3 ___ x 100 = 100% 3

Therefore, yahoo.com had a percentage growth of 100% in the last 30 days. Note that if the oldest entry was older than 30 days, it would have been removed. In that case, the growth would be 0%, as would be the case for gmail.com.

tests.py

This file contains several tests, each with a description. The file tests things such as sanity test, negative growth, sorting by growth, load test, etc.

To run the tests, execute the following line: python test.py

Alternatively, you can view the sample output in tests_output.txt

Note: the addresses for the load test were obtained from http://www.sistersofspam.co.uk/Scam_Email_Addresses_1.php

mailing-list-problem's People

Contributors

farhan3 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.