Practices - Data engineering

Have a stackoverflow account : https://stackoverflow.com/

Have a github account : https://github.com/

And a github repo to push your code.

Fork the repo on your own Github account

https://github.com/polomarcus/tp/fork

Docker and Compose

Take time to read and install

https://docs.docker.com/get-started/overview/

docker --version
Docker version 20.10.14

https://docs.docker.com/compose/

docker-compose --version
docker-compose version 1.29.2

TP2 - Functional programming for data engineering

You're the new data engineer of a scientific team in charge of monitoring CO2 levels in atmosphere, which are at their highest in 800,000 years..

You have to give your best estimate of CO2 levels for 2050.

Your engineering team is famous for taking a great care of the developer experience: using Type, small functions (using .map, .filter, .reduce), tests and logs.

Your goal is to map, parse, filter CO2 concentration levels in the atmosphere coming from an observatory in Hawaii from 1950 to 2022.

CO2 concentration level has been inserted inside utils/ClimateService.

How to write a Scala application ?

Scala IDE : https://www.jetbrains.com/idea/
Scala https://docs.scala-lang.org/getting-started/index.html
Scala build tool (SBT) : https://www.scala-sbt.org/download.html

Could you install SBT on your machine ? If yes

sbt run

Should give you an implementation is missing error :

(...)
2022-05-16 18:33:48 [run-main-0] INFO  com.github.polomarcus.main.Main$ - Starting the app
[error] (run-main-0) scala.NotImplementedError: an implementation is missing

Same for sbt test

You couldn't install SBT on your machine

Tips: having trouble to install Idea, SBT or scala? You can use Docker and Docker Compose to run this code and use your default IDE to code or a web IDE https://scastie.scala-lang.org/:

docker-compose build my-scala-app
docker-compose run my-scala-app bash # connect to your container to acces to SBT
> sbt test
# or 
> sbt run

Continuous build and test

Pro Tips : https://www.scala-sbt.org/1.x/docs/Running.html#Continuous+build+and+test

Make a command run when one or more source files change by prefixing the command with ~. For example, in sbt shell try:

sbt
> ~ testQuick

Test Driven Development (TDD) - Write a function and its tests that detect climate related sentence

Look at and update "isClimateRelated" to add one more test test/scala/ClimateService
Look at and update "isClimateRelated" function inside main/scala/com/github/polomarcus/utils/ClimateService

Write a function that use `Option[T]` to handle CO2 Record

With data coming from Hawaii about CO2 concentration in the atmosphere, iterate over it and find the difference between the max and the min value.

Look at and update "parseRawData" to add one more test test/scala/ClimateService
Look at and update "parseRawData" function inside main/scala/com/github/polomarcus/utils/ClimateService
Create your own function to find the min, max value. Write unit tests and run sbt test Tips:

Tips: def getMinMax() : (Int, Int)
Use scala API to get max and min from a list : https://www.w3resource.com/scala-exercises/list/scala-list-exercise-6.php
You can also use "reduce functions" such as foldLeft : https://alvinalexander.com/scala/how-to-walk-scala-collections-reduceleft-foldright-cookbook/

Create your own function to find the min, max value for a specific year. Write unit tests Tips:

Re use getMinMax to create this function :
def getMinMaxByYear(year: Int) : (Int, Int)

Create your own function to difference between the max and the min. Write unit tests

Tips:

Iteration - filter

Remove all data from december (12), winter makes data unreliable there, values with filterDecemberData inside main/scala/com/github/polomarcus/utils/ClimateService

Iteration - map

implement showCO2Data inside main/scala/com/github/polomarcus/utils/ClimateService
Make your Main program works using sbt run

Bonus

Estimate CO2 levels for 2050 based on past data.

How would you do if a continuous stream of data come ?

Tips: Batch processing / Stream processing ?

Continuous Integration (CI)

If it works on your machine, congrats !

Test it on a remote servers now thanks to a Continuous Integration (CI) system such as GitHub Actions :

Have a look to the .github/workflows folder and files
Something weird ? Have a look to their documentation : https://github.com/features/actions
Ready to run a CI job ? Go on your Github's Fork/Clone of this and find the "Action" tab
Find your CI job running
Create a CI workflows using Docker to run the sbt test command (inspiration : https://github.com/polomarcus/television-news-analyser/blob/main/.github/workflows/docker-compose.yml#L7-L17)

Tools

Scala IDE : https://www.jetbrains.com/idea/

TP1 - Apache Kafka

Communication problems

Why Kafka ?

https://kafka.apache.org/documentation/#introduction

Answer these questions with what you can find on the documentation :

What problems does Kafka solve ?
Which use cases ?
What is a producer ?
What is a consumer ?
What are consumer groups ?
What is a offset ?
Why using partitions ?
Why using replication ?
What are In-Sync Replicas (ISR) ?

Try to install Kafka without docker

https://kafka.apache.org/documentation/#gettingStarted

Use kafka with docker

Start multiples kakfa servers (called brokers) by downloading a docker compose recipe :

https://github.com/conduktor/kafka-stack-docker-compose#single-zookeeper--multiple-kafka

Check on the docker hub the image used :

https://hub.docker.com/r/confluentinc/cp-kafka

Verify

docker ps
CONTAINER ID   IMAGE                             COMMAND                  CREATED          STATUS         PORTS                                                                                  NAMES
b015e1d06372   confluentinc/cp-kafka:7.0.1       "/etc/confluent/dock…"   10 seconds ago   Up 9 seconds   0.0.0.0:9092->9092/tcp, :::9092->9092/tcp, 0.0.0.0:9999->9999/tcp, :::9999->9999/tcp   kafka1
(...)

Getting started with Kafka

Connect to your kafka cluster with 2 command-line-interface (CLI)

Using Docker exec

docker exec -ti my_kafka_container_name bash
> pwd

> kafka-topics 
# will give you help to use this command
> kafka-topics --describe --bootstrap-server localhost:9092 
# will give you an error

Read this blog article to fix Broker may not be available. error : https://rmoff.net/2018/08/02/kafka-listeners-explained/

Pay attention to the KAFKA_ADVERTISED_LISTENERS config from the docker-compose file.

Create a "mailbox" - a topic with the default config : https://kafka.apache.org/documentation/#quickstart_createtopic
Check on which Kafka broker the topic is located using --describe
Send events to a topic on one terminal : https://kafka.apache.org/documentation/#quickstart_send
Keep reading events from a topic from one terminal : https://kafka.apache.org/documentation/#quickstart_consume

try the default config
what does the --from-beginning config do ?
what about using the --group option for your producer ?

stop reading
Keep sending some messages to the topic

Partition

Check consumer group with kafka-console-consumer : https://kafka.apache.org/documentation/#basic_ops_consumer_group

notice if there is lag for your group

read from a new group, what happened ?
read from a already existing group, what happened ?
Recheck consumer group

Replication - High Availability

Increase replication in case one of your broker goes down : https://kafka.apache.org/documentation/#topicconfigs
Stop one of your brokers with docker
Describe your topic, check the ISR (in-sync replica) config : https://kafka.apache.org/documentation/#design_ha
Restart your stopped broker
Check again your topic

TP 3 - Kafka Streams to read and write to Kafka

https://kafka.apache.org/documentation/streams/
https://github.com/polomarcus/Spark-Structured-Streaming-Examples
Kafka User Interface (UI) : https://www.conduktor.io/download/

Continuous Integration (CI)

If it works on your machine, congrats. Test it on a remote servers now thanks to a Continuous Integration (CI) system such as GitHub Actions :

How to use containers inside a CI : https://docs.github.com/en/github-ae@latest/actions/using-containerized-services/about-service-containers
A Github Action example : https://github.com/conduktor/kafka-stack-docker-compose/blob/master/.github/workflows/main.yml

elnonito / tp_dev_04_2022 Goto Github PK

tp_dev_04_2022's Introduction

Practices - Data engineering

Fork the repo on your own Github account

Docker and Compose

TP2 - Functional programming for data engineering

How to write a Scala application ?

Could you install SBT on your machine ? If yes

You couldn't install SBT on your machine

Continuous build and test

Test Driven Development (TDD) - Write a function and its tests that detect climate related sentence

Write a function that use Option[T] to handle CO2 Record

Iteration - filter

Iteration - map

Bonus

How would you do if a continuous stream of data come ?

Continuous Integration (CI)

Tools

TP1 - Apache Kafka

Communication problems

Why Kafka ?

Try to install Kafka without docker

Use kafka with docker

Verify

Getting started with Kafka

Partition

Replication - High Availability

TP 3 - Kafka Streams to read and write to Kafka

Continuous Integration (CI)

tp_dev_04_2022's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Write a function that use `Option[T]` to handle CO2 Record