andersonchoi / datastreaming-nanodegree-class Goto Github PK

0.0 0.0 0.0 42 KB

datastreaming-nanodegree-class

Python 100.00%

datastreaming-nanodegree-class's Introduction

데이터 엔지니어 @ Kakao | 실시간 데이터 파이프라인 개발 및 운영 | 카프카 기반 플랫폼 개발 | 스트림 데이터 강연 및 세미나 | 아파치 카프카 베스트셀러 작가

소개
- 카프카 커넥트를 활용한 데이터 파이프라인 플랫폼 개발
- 카프카 스트림즈 기반 상태기반(Stateful) 대규모 애플리케이션 개발 
- 데이터 수집 파이프라인 고도화, 운영
- 스파크를 활용한 대용량 데이터 처리&암/복호화 애플리케이션 개발과 운영
- 적극적인 태도와 긍정적인 마인드로 사내/외 고객을 만족시키는 개발 수행
- 동료들과 함께 성장하기 위해 노력
- 대내외 기술 전파 및 공유에 힘씀

아파치 카프카 애플리케이션 프로그래밍 with 자바

이 책은 아파치 카프카로 새로운 개발 트렌드를 준비하는 분들을 위해 집대성한 아파치 카프카 최종 솔루션이다. 국내 서적 중 최초로 카프카의 핵심 기능인 '미러메이커2(MirrorMaker2)'에 대한 설명과 '스프링 카프카', '클라우드 카프카'에 대한 내용을 다루기 때문에 아파치 카프카 도입을 앞둔 개발자뿐만 아니라, 이미 활용하고 있는 개발자에게도 추가적인 역량 향상의 기회를 제공한다. 또한, 실무 아키텍처와 유사한 구조의 실전 프로젝트와 38개의 예제 소스코드를 통해 실제 업무에서 사용하는 기법과 코드를 익힐 수 있다.

실시간 데이터 파이프라인 아키텍처

이 책은 비즈니스 목표를 달성하기 위해 어떤 실시간 데이터 아키텍처로 구성하고 운영해야 할지 명확한 가이드라인을 제시한다. 스트리밍 데이터에 대한 정의부터 시작하여 단계별로 세분화된 아키텍처의 역할, 동작 방식, 각 장단점과 상황에 맞는 선택지를 보여준다. 그리고 스트리밍 데이터 아키텍처에 대해 공부한 지식을 기반으로 코드로 실습해 보는 시간을 가지면서 마무리한다. 기본 개념부터 아키텍처 구성, 코드 실습까지 구성되어 있으므로 실시간 데이터 처리에 대해 고민하는 개발자, 엔지니어, 조직장 등 모든 분들에게 큰 도움이 될 것이다.

datastreaming-nanodegree-class's People

Contributors

Watchers

datastreaming-nanodegree-class's Issues

Kafka CLI tools

Kafka CLI Tools

In this exercise, you will learn how to use the most common Kafka CLI tools.

Listing topics

First, let's see how to list topics that already exist within Kafka.

kafka-topics --list --zookeeper localhost:2181

When you run this command, you should see output like the following:

__confluent.support.metrics
__consumer_offsets
_confluent-ksql-ksql_service_docker_command_topic
_schemas
connect-config
connect-offset
connect-status

The --list switch tells the kafka-topics CLI to list all known topics.

The --zookeeper localhost:2181 switch tells the kafka-topics CLI where the
Zookeeper ensemble Kafka is using is located. Note that in the newest versions of Kafka the
--zookeeper switch is deprecated in favor of a --bootstrap-server switch that points to Kafka. The --zookeeper switch still works, but will likely be dropped in the next major revision of Kafka.

We haven't created any topics yet, so what you're seeing are system topics that Kafka and other
Kafka ecosystem tools make use of.

Creating topics

Now that we've seen what topics exist, let's create one.

kafka-topics --create --topic "my-first-topic" --partitions 1 --replication-factor 1 --zookeeper localhost:2181

When you run this command it should silently exit after a few moments.

The switch --topic "my-first-topic" tells Kafka what to name the topic

The switch --partitions 1 and the switch --replication-factor 1 are required configuration
which we will explore more in the next lesson.

To check that our topic was successfully created, let's repeat the command to list topics with a
slight modification:

kafka-topics --list --zookeeper localhost:2181 --topic "my-first-topic"

Now, a single topic should be printed, like so:

my-first-topic

Producing data

Now that we have a topic, let's add some data.

kafka-console-producer --topic "my-first-topic" --broker-list PLAINTEXT://localhost:9092

The switch --broker-list serves the same purpose as --zookeeper for the kafka-topics command --
it simply tells the tool where to find Kafka.

When you hit enter, you should be dropped into an interactive terminal.

Try typing out a few messages and hitting enter.

root@6b48dc2bd81c:/# kafka-console-producer --topic "my-first-topic" --broker-list PLAINTEXT://localhost:9092
>hello
>world!
>my
>first
>kafka
>events!
>

Consuming data

While it's great that we've produced data, it would be more exciting if we could see it being
consumed.

Open a new terminal tab and run the following command:

kafka-console-consumer --topic "my-first-topic" --bootstrap-server PLAINTEXT://localhost:9092

Notice that nothing prints out? Remember that by default Kafka doesn't provide historical messages to new consumers. Return to the producer and enter a few new messages. You should see them come across the screen!

Hit Ctrl+C to exit the consumer. Let's try this again, but ask Kafka to provide all the messages that have been published to the topic:

kafka-console-consumer --topic "my-first-topic" --bootstrap-server PLAINTEXT://localhost:9092 --from-beginning

The --from-beginning switch tells kafka-console-consumer to read data from the beginning of the topic, not just data from when we connect.

You should now see output that includes all of the messages you've produced:

root@6b48dc2bd81c:/# kafka-console-consumer --topic "my-first-topic" --bootstrap-server PLAINTEXT://localhost:9092 --from-beginning
hello
world!
my
first
kafka
events!
hello again!

Deleting topics

Now that we've explored working with the CLI tools, let's clean up our topic.

kafka-topics --delete --topic "my-first-topic" --zookeeper localhost:2181

This command is nearly identical to the --create command from earlier, except now we're calling the
command with the --delete switch instead.

This command does not print any output if it's successful. To check that your topic is actually deleted, list the topics one more time:

kafka-topics --list --zookeeper localhost:2181

my-first-topic should no longer appear in the list of topics.

Batch vs Stream processing

Batch Processing

Runs on a scheduled basis
May run for a longer period of time and write results to a SQL-like store
May analyze all historical data at once
Typically works with mutable data and data stores

Stream Processing

Runs at whatever frequency events are generated
Typically runs quickly, updating in-memory aggregates
Stream Processing applications may simply emit events themselves, rather than write to an event store
Typically analyzes trends over a limited period of time due to data volume
Typically analyzes immutable data and data stores

Batch and Stream processing are not mutually exclusive. Batch systems can create events to feed into stream processing applications, and similarly, stream processing applications can be part of batch processing analyses.

Streaming Data Store

May look like a message queue, as is the case with Apache Kafka
May look like a SQL store, as is the case with Apache Cassandra
Responsible for holding all of the immutable event data in the system
Provides guarantee that data is stored ordered according to the time it was produced
Provides guarantee that data is produced to consumers in the order it was received
Provides guarantee that the events it stores are immutable and unchangeable

Stream Processing Application and Framework

Stream Processing applications sit downstream of the data store
Stream Processing applications ingest real-time event data from one or more data streams
Stream Processing applications aggregate, join, and find differences in data from these streams
Common Stream Processing Application Frameworks in use today include:
- Confluent KSQL
- Kafka Streams
- Apache Flink
- Apache Samza
- Apache Spark Structure Streaming
- Faust Python Library

Further Optional Reading on Message Queues

RabbitMQ
ActiveMQ

Benefits of Stream Processing

Faster for scenarios where a limited set of recent data is needed
More scalable due to distributed nature of storage
Provides a useful abstraction that decouples applications from each other
Allows one set of data to satisfy many use-cases which may not have been predictable when the dataset was originally created
Built-in ability to replay events and observe exactly what occurred, and in what order, provides more opportunities to recover from error states or dig into how a particular result was arrived at

Kafka Connect Troubleshooting Tips

As demonstrated in the demo video above, if you run into trouble with Kafka Connect in the workspace exercise below, or during your project, here are some tips to help your debugging:

First, use the REST API to check the connector status. curl http:<connect_url>/connectors/<your_connector>/status to see what the status of your connector is
Next, use the REST API to check the task status for the connector. curl http:<connect_url>/connectors/<your_connector>/tasks/<task_id>/status to see what the status of your task is

If you can’t deduce the failure from these two options, the next best bet is to examine the logs of Kafka Connect. Typically, a tool like tail or less is useful in examining the logs for Kafka Connect. On Linux systems, Kafka Connect logs are often available in /var/log/kafka/. Kafka Connect is often verbose and will indicate what the issue is that it is experiencing.

If you are familiar with Java Management Extensions (JMX) and have access to the server, you may also opt to inspect its JMX metrics for information on failures. However, JMX is most useful for automated monitoring, so you likely will not receive any additional insights from using JMX vs the API or the logs.