Code Monkey home page Code Monkey logo

bigdata-spark's Introduction

BigData-Spark

The idea of this project was to get used to using Kafka, Spark, Hbase and graphing the findings using Jupyter to do some analysis on each new submission coming from reddit.

To start the project please install Kafka and move it to the home directory. (Note make sure the file is renamed to Kafka, this will be useful for starting up the zookeeper/kafka server using the scripts provided.)

First tier component is Kafka Producer which is written in Python. Before running run the following: 1)Zookeeper-server.sh 2)Kafka-server.sh

it is Recommended to have Python3 installed and use pip3 to install the following modules before running the producer: Kafka Praw

The data source we are using to get the data is Reddit using the Praw Python module, then we stream this data on Kafka on localhost onto topic final-lab-topic.

Second tier component, we have two Kafka consumers currently that listen onto this topic:

  1. Java Consumer: reads the streamed data using kafka and spark, then add them to the database using spark SQL. This must be a different machine and must have the following installed: A) Hbase B) Hadoop 2.x+
  2. Spark The dependancies are all included in the pom-xml file, you can use Intellj Ide to download any missing ones.

2)Python Consumer, currently reads the data on the same machine, it is built for future implemention for some machine learning model to run on the streamed data.

Install pyspark before running this one.

3)Data visualization: Currently reads the data from the json file produced by the python producer, in the future it will read the data directly from Hbase that is hosted on another machine.

Install Jupiyter notebook, pandas, plotly, matplotlib before running this one.

TO DO: In the data visualization phase, make the code read directly from the remote machine that hosts hive/hbase on it, and use spark sql to read it instead of reading a local file.

In the Python Kafka consumer, implement a way to find most common categories people are talking about and use it for data visualization.

bigdata-spark's People

Contributors

verlich avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.