Code Monkey home page Code Monkey logo

Comments (4)

onurbaran avatar onurbaran commented on September 28, 2024 3

Caminte is an interesting project for hybrid data storage solutions. We can use when we seperate our data for hot and cold segments. (if needed)
Let's go step by step and define our high level components and requirements. (it will be a little long story for warm up myself)

image

1.Streaming Processor
This stage can be defined as the stage where we will touch incoming data for the first time. One of the most important points we should pay attention to at this stage is that our communication with the API layer should continue without being blocked.

First of all we must think which technologies, tools,concepts we need. Generally we must check our requirements and goals in this step. Our first goal is to perfectly handle incoming requests under high traffic and create the ideal pipeline for real-time data processing. If we play the game on data driven projects we must think and guess i++ steps from current requirements/needs. So I added analytics and long term storage stages my high level diagram for describe general approach. I have some analytic plans about implementing Hidden Markov Chains with real time click streaming data.

We can use Apache Kafka to handle API's data and distribute streaming events to the Real Time, Analytics and Storage parts of project with Kafka topics.

2. Streaming Engine
Streaming engine's main behaviour is consume the processor's (1) data and invoke some pre-calculations jobs. Pre-calculation jobs can be invoked according to kafka topics or incoming streaming data's attributes. We can use Apache Druid, Spark, KSQL(Kafka SQL), Apache Flink and custom solutions as a consumer. For example if we want to continue wtih custom solution We can use Haskell (yeah apply mathematician guy is here), go, scala, python ..... I recommend to use go and go channels as a consumer part of this project.Because we can invoke some mini hidden processing pipelines with go channels. Hidden processing pipeline means "auto generated processing rules" here.You can think of these operations as simple map reduce flow. Of course, the biggest difference is that we will manage our processor on the fly like Spark processors.

image

So we can pass pre calculated data to our real time stage. Real time stage means keep only hot data and serve dashboards, alerts etc tools with API, socket ... We can use different kind of technologies with some "analytic first thinking" techniques. For example we can use Redis with timestamp:page_name count notation. Or we can use Elastic, VoltDB, MongoDB .. All this technologies have some pros and cons so we need to focuse our goals and base requirements.

from analytics.

muratdemirci avatar muratdemirci commented on September 28, 2024

We can use cross-db ORM for adaptation, which it means if someone has end product (like website/webapp/mobileapp etc...) he can smoothly adapt current product.
Example: https://github.com/biggora/caminte

from analytics.

cagataycali avatar cagataycali commented on September 28, 2024

https://druid.apache.org/

from analytics.

okankaraduman avatar okankaraduman commented on September 28, 2024

Hello,
I think it's very important to check, best practices. Below, there is a schema of Snowplow Data Processing Pipeline. ( Do-it yourself solution for clickstream data )
snowplowarchitecture
They are offering various trackers but that is not our case. After we take our data from Javascript tracker, there is an API (written with Scala) for collecting them and publish that data to Kafka topics.(In GCP, we can consider Pub/Sub, AWS Kinesis but it's far more expensive). There is an enrichment process which we enrich the data and write this data to another topic( like Geolocation etc. worth to check it out from here : https://docs.snowplowanalytics.com/docs/enriching-your-data/available-enrichments/ip-lookup-enrichment/ )

Finally, loader api for loading data to database. In GCP, their loader api simply just start dataflow jobs for ingesting data to Bigquery. (It's expensive.)

My Thoughts About Data Pipeline

I am fully agree with @onurbaran about using GO for collector&enrichment&loader API. (world moving that way) For streaming data, I think only option is Kafka. There is a managed version of it offering by Confluent.(If we do not want to mess with that ) https://www.confluent.io/confluent-cloud/

Finally, below schema, encapsulates all my thoughts about pipeline. I took it from slide, which I found at deep deep internet :) Don't sue me because of resolution :(

Screen Shot 2021-01-08 at 3 14 06 AM

from analytics.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.