Code Monkey home page Code Monkey logo

iot_demo's Introduction

IOT Demo

Spark Streaming + Kafka + Kudu

This is a demonstration showing how to use Spark/Spark Streaming to read from Kafka and insert data into Kudu - all in Python. The streaming data is coming from the Particle.io event stream, and requires an API key to consume. I believe this may be the first demonstration of reading from/writing to Kudu from Spark Streaming using Python.

Content Description
  • particlespark.py: SSEClient reading from Particle.io event stream and sending to Kafka topic
  • iot_demo.py: Spark streaming application reading from Kafka topic and inserting into a Kudu table
  • event_count.py: Spark streaming application reading from Kafka topic, counting unique words of last 20 seconds and upserting into a Kudu table every 20 seconds (shows update capabilities)
  • data_count.py: Same as above but counting data instead of events
  • event_count_total.py: Spark batch job that reads from the master event table (particle_test) and counts the total occurance of each word for all time and upserts (shows update capabilities)
Versions
  • CDH 5.8
  • Kafka 2.0.2-1.2.0.2.p0.5
  • Kudu 0.10.0-1.kudu0.10.0.p0.7
  • Impala_Kudu 2.6.0-1.cdh5.8.0.p0.17
Python Dependencies
sudo pip install sseclient
sudo pip install python-kafka
Impala create table:
CREATE TABLE `particle_test` (
`coreid` STRING,
`published_at` STRING,
`data` STRING,
`event` STRING,
`ttl` BIGINT
)
DISTRIBUTE BY HASH (coreid) INTO 16 BUCKETS
TBLPROPERTIES(
 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
 'kudu.table_name' = 'particle_test',
 'kudu.master_addresses' = 'ip-10-0-0-224.ec2.internal:7051',
 'kudu.key_columns' = 'coreid,published_at',
 'kudu.num_tablet_replicas' = '3'
);
CREATE TABLE `particle_counts_last_20_data` (
`data_word` STRING,
`count` BIGINT
)
DISTRIBUTE BY HASH (data_word) INTO 16 BUCKETS
TBLPROPERTIES(
 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
 'kudu.table_name' = 'particle_counts_last_20_data',
 'kudu.master_addresses' = 'ip-10-0-0-224.ec2.internal:7051',
 'kudu.key_columns' = 'data_word',
 'kudu.num_tablet_replicas' = '3'
);
CREATE TABLE `particle_counts_last_20` (
`event_word` STRING,
`count` BIGINT
)
DISTRIBUTE BY HASH (event_word) INTO 16 BUCKETS
TBLPROPERTIES(
 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
 'kudu.table_name' = 'particle_counts_last_20',
 'kudu.master_addresses' = 'ip-10-0-0-224.ec2.internal:7051',
 'kudu.key_columns' = 'event_word',
 'kudu.num_tablet_replicas' = '3'
);
CREATE TABLE `particle_counts_total` (
`event_word` STRING,
`count` BIGINT
)
DISTRIBUTE BY HASH (event_word) INTO 16 BUCKETS
TBLPROPERTIES(
 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
 'kudu.table_name' = 'particle_counts_total',
 'kudu.master_addresses' = 'ip-10-0-0-224.ec2.internal:7051',
 'kudu.key_columns' = 'event_word',
 'kudu.num_tablet_replicas' = '3'
);
CREATE TABLE `particle_counts_total_data` (
`data_word` STRING,
`count` BIGINT
)
DISTRIBUTE BY HASH (data_word) INTO 16 BUCKETS
TBLPROPERTIES(
 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
 'kudu.table_name' = 'particle_counts_total_data',
 'kudu.master_addresses' = 'ip-10-0-0-224.ec2.internal:7051',
 'kudu.key_columns' = 'data_word',
 'kudu.num_tablet_replicas' = '3'
);

Kafka create topic:

kafka-topics --create --zookeeper  ip-10-0-0-224.ec2.internal:2181 --replication-factor 1 --partitions 1 --topic particle

spark-submit:

spark-submit --master yarn --jars kudu-spark_2.10-0.10.0.jar --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0 iot_demo.py
spark-submit --master yarn --jars kudu-spark_2.10-0.10.0.jar --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0 event_count.py
spark-submit --master yarn --jars kudu-spark_2.10-0.10.0.jar --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0 data_count.py
spark-submit --master yarn --jars kudu-spark_2.10-0.10.0.jar --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0 total_event_count.py

python producer:

python particlespark.py

iot_demo's People

Contributors

bkvarda avatar khushbukp avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.