Code Monkey home page Code Monkey logo

kafka_spark_structured_streaming's Introduction

Information

image

This repo illustrates the Spark Structured Streaming

Gets random names from the API. Sends the name data to Kafka topics every 10 seconds using Airflow. Every message is read by Kafka consumer using Spark Structured Streaming and written to Cassandra table on a regular interval.

stream_to_kafka_dag.py -> The DAG script that writes the API data to a Kafka producer every 10 seconds.

stream_to_kafka.py -> The script that gets the data from API and sends it to Kafka topic

spark_streaming.py -> The script that consumes the data fromo Kafka topic with Spark Structured Streaming

response.json -> Sample response coming from the API

Apache Airflow

Run the following command to clone the necessary repo on your local

git clone https://github.com/dogukannulu/docker-airflow.git

After cloning the repo, run the following command only once:

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .

Then change the docker-compose-LocalExecutor.yml file with the one in this repo and add requirements.txt file in the folder. This will bind the Airflow container with Kafka and Spark container and necessary modules will automatically be installed:

docker-compose -f docker-compose-LocalExecutor.yml up -d

Now you have a running Airflow container and you can access the UI at https://localhost:8080

Apache Kafka

docker-compose.yml will create a multinode Kafka cluster. We can define the replication factor as 3 since there are 3 nodes (kafka1, kafka2, kafka3). We can also see the Kafka UI on localhost:8888.

We should only run:

docker-compose up -d
image

After accessing to Kafka UI, we can create the topic random_names. Then, we can see the messages coming to Kafka topic:

image

Cassandra

docker-compose.yml will also create a Cassandra server. Every env variable is located in docker-compose.yml. I also defined them in the scripts.

By running the following command, we can access to Cassandra server:

docker exec -it cassandra /bin/bash

After accessing the bash, we can run the following command to access to cqlsh cli.

cqlsh -u cassandra -p cassandra

Then, we can run the following commands to create the keyspace spark_streaming and the table random_names:

CREATE KEYSPACE spark_streaming WITH replication = {'class':'SimpleStrategy','replication_factor':1};
CREATE TABLE spark_streaming.random_names(full_name text primary key, gender text, location text, city text, country text, postcode int, latitude float, longitude float, email text);
DESCRIBE spark_streaming.random_names;

image

Running DAGs

We should move stream_to_kafka.py and stream_to_kafka_dag.py scripts under dags folder in docker-airflow repo. Then we can see that random_people_names appears in DAGS page.

When we turn the OFF button to ON, we can see that the data will be sent to Kafka topics every 10 seconds. We can check from Kafka UI as well.

image

Spark

First of all we should copy the local PySpark script into the container:

docker cp spark_streaming.py spark_master:/opt/bitnami/spark/

We should then access the Spark container and install necessary JAR files under jars directory.

docker exec -it spark_master /bin/bash

We should run the following commands to install the necessary JAR files under for Spark version 3.3.0:

cd jars
curl -O https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.3.0/spark-cassandra-connector_2.12-3.3.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.13/3.3.0/spark-sql-kafka-0-10_2.13-3.3.0.jar

While the API data is sent to the Kafka topic random_names regularly, we can submit the PySpark application and write the topic data to Cassandra table:

cd ..
spark-submit --master local[2] --jars /opt/bitnami/spark/jars/spark-sql-kafka-0-10_2.13-3.3.0.jar,/opt/bitnami/spark/jars/spark-cassandra-connector_2.12-3.3.0.jar spark_streaming.py

After running the commmand, we can see that the data is populated into Cassandra table

image

Enjoy :)

kafka_spark_structured_streaming's People

Contributors

dogukannulu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.