Code Monkey home page Code Monkey logo

csc591_dic_f17_project's Introduction

MapReduce performance evaluation on EC2 (HDFS HA) vs EMR

Application - Online spam reviewers detection

CSC 591(002) Data Intensive Computing Fall ‘17

Team 3
Divya Guttikonda (dguttik), Nithya Kumar (nkumar8), Sahithi Guddeti (sguddet)

Abstract
Online product reviews are becoming prevalent due to the veracity of reviews provided by the users. This particularly helps a lot of users in their decision-making process during product purchase, for example Amazon product reviews. One major problem that exists is ‘opinion spamming’ where fraudulent reviewers write manipulative spam reviews to promote or demote a product. Since this is a large scale distributed problem, analyzing and handling huge volumes of data is limited by storage and cost constraints. In this project, we have dealt with the above-mentioned problem by proposing a MapReduce application framework to detect online spam reviewers and by building & comparing two infrastructure models viz., Hadoop HA on EC2 (with EBS) and EMR (with S3) to handle huge volumes of input data. Our project successfully detects spammers on different product reviews big datasets [1] and also evaluates the performance of MapReduce on the built EC2 and EMR clusters. The performance evaluation is done considering the aspects of application execution time, instance costs, storage costs, number of input splits, number of map and reduce tasks, and the number of output file partitions. Our conclusions are focused on the optimality and cost-effectiveness of our MapReduce application in detecting spammers and on the infrastructure decision to be taken under different data-intensive scenarios.

Project Claims

Project Report

Project Presentation

Hadoop cluster over EC2 instances

EMR Architecture

Spam Detection MapReduce Application Source Code

Hadoop over EC2 results

EMR results

csc591_dic_f17_project's People

Contributors

dguttik avatar nithya-kumar avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.