Code Monkey home page Code Monkey logo

1brc_streaming's Introduction

1brc challenge with streaming solutions for Apache Kafka

Inspired by original 1brc challenge created by Gunnar Morling: https://www.morling.dev/blog/one-billion-row-challenge

⚠️ This is still a WIP project

⚠️ This challenge does not aim to be competitive with the original challenge. It is a challenge dedicated to streaming technologies that integrate with Apache Kafka. Results will be evaluated taking in consideration complete different measures.

Pre requirements

  • docker engine and docker compose
  • about XXGB free space
  • challenge will run on these supported architectures only:
    • Linux - x86_64
    • Darwin (Mac) - x86_64 and arm
    • Windows

Simulation Environment

  • Kafka cluster with 3 brokers. Cluster must be local only. Reserve approximately XXGB for data.
  • Topic with 32 partitions, replication factor 3 and LogAppendTime named data for input
  • Topic with 32 partitions, replication factor 3 named results for output
  • Kafka cluster must run using the script run/bootstrap.sh from this repository. bootstrap will also create input and output topics.
  • Brokers will listen on port 9092, 9093 and 9094. No Authentication, no SSL.

Rules

  • Implement a solution with kafka APIs, kafka streams, flink, ksql, spark, NiFi, camel-kafka, spring-kafka... reading input data from data topic and sink results to results topics. and run it!. This is not limited to JAVA!

  • Ingest data into a kafka topic:

    • Create 10 csv files using script run/data.sh or run/windows/data.exe from this repository. Reserve approximately 19GB for it. This will take minutes to end.
    • Each row is one data in the format <string: customer id>;<string: order id>;<double: price in EUR>, with the price value having exactly 2 fractional digits.
    ID672;IQRWG;363.81
    ID016;OEWET;9162.02
    ID002;IOIUD;15017.20
    ..........
    
    • There are 999 different customers
    • Price value: not null double between 0.00 (inclusive) and 50000.00 (inclusive), always with 2 fractional digits
    • Read from csv files AND send continuously data to data topic using the script producer.sh from this repository
  • Output topic must contain messages with key/value and no additional headers:

    • Key: customer id, example ID672
    • Value: order counts | order counts_with_price > 40000 | min price | max price, example 1212 | 78 | 4.22 | 48812.22 grouped by key.
    • Expected to have 999 different messages

💡 Kafka Cluster runs cp-kafka, Official Confluent Docker Image for Kafka (Community Version) version 7.6.0, shipping Apache Kafka version 3.6.x

💡 Verify messages published into data topic with run/consumer.sh script using https://raw.githubusercontent.com/confluentinc/librdkafka/master/examples/consumer.c. Tu run the consumer, verify that you have installed librdkafka

How to test the challenge

  1. Run script run/data.sh or run/windows/data.exe to create 1B rows split in 10 csv files.
  2. Run script run/bootstrap.sh to setup a Kafka clusters and required topics.
  3. Deploy your solution and run it, publishing data to results topic.
  4. Run script run/producer.sh in a new terminal. Producer will read from input files and publish to data topic.

At the end clean up with script run/tear-down.sh

How to participate in the challenge

  1. Fork this repo
  2. Add your solution to folder challenge-YOURNAME, example challenge-hifly
  3. Open a Pull Request detailing your solution with instructions on how to deploy it

✅ Your solution will be tested using the same docker-compose file. Results will be published on this page.

💻 Solutions will be tested on a (TODO) server

💡 A sample implementation is present in folder challenge with Kafka Streams. Test it with:

cd challenge
mvn clean compile && mvn exec:java -Dexec.mainClass="io.hifly.onebrcstreaming.SampleApp"

1brc_streaming's People

Contributors

hifly81 avatar paoven avatar ram-pi avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

paoven

1brc_streaming's Issues

Producer: threads running < number of csv files

There are 10 csv files in folder but Producer is using only 4 threads; I see that this is hard-coded in producer java file

#! /usr/bin/java --class-path ./kafka_2.13-3.6.1/libs/* --source 17 -XX:ActiveProcessorCount=4

is there a way to determine that value from the hosting machine?

Run producer...
Producing...

Number of files:10

Thread ID: 24 Processing file: data_1.csv

Thread ID: 25 Processing file: data_10.csv

Thread ID: 1 Processing file: data_6.csv

Thread ID: 26 Processing file: data_3.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.