Code Monkey home page Code Monkey logo

Comments (4)

istreeter avatar istreeter commented on July 17, 2024

Hi @AkhtemWays this is an interesting topic of discussion, which I had not considered before. Can you give any more details about how often you are seeing duplicate IDs? How do you know the duplicates IDs were created by enrich, and not by the trackers the sent the events?

wanted to know the reason for going towards UUID strategy.

I can't answer this, because the design decision pre-dates when I joined Snowplow!

But... I had never considered it a bad decision. To the best of my understanding, the Java implementation of UUID.randomUUID() is extremely unlikely to generate duplicates.

We do often see duplicate event IDs downstream of enrich. But in our experience those duplicates arise from either:

  • Artifact of the pipeline, especially during autoscaling up or down.
  • Browsers sending the same event multiple times because of network connectivity issues.
  • Badly behaved tracking implementations that re-use the same event id.

from enrich.

AkhtemWays avatar AkhtemWays commented on July 17, 2024

When I executed group by query by event_id and ordered by counts I found that at most 45 same event_ids in DWH.
None of the versions of UUIDs actually guarantee total uniqueness so far as I understood.
It affects joins when the data goes further down to other data sources, where joins are happening, and at this point we are forced to join by many fields which kind of degrades the performance of queries and CPU load overall.
If the primary goal is the generation of absolutely universally unique IDs, then Twitter Snowflake strategy would a good choice I suppose, the algorithm is tied ensures unique id generation because it's tied to unix_timestamp + machine_id that generates it, which basically states that at one point in time one machine can generate only one ID, if the script generates ids for multiple objects at the same time, the solution could be to add some sleeping of 1 nano second or some other strategy to fix it.
This I think would solve the three problems you described and uniqueness problem overall.
One configurable env parameter could be added to specify machine_id I guess.

from enrich.

istreeter avatar istreeter commented on July 17, 2024

I am not surprised that you see duplicate events in the DWH. But I think you are looking in the wrong place for the problem if you think it's because of our UUID generator.

If you find two events with the same event id, then interesting questions to look at next are:

  • Do they share the same collector_tstamp? If yes, then this is probably a full duplicate copy of the same single event that was received by the collector.
  • Do they share the same dvce_created_tstamp? If yes (but collector_tstamp different) then this is probably a duplicate copy of the same single event which a tracker sent multiple times to the collector, e.g. because of a network failure.

If you investigate further the duplicate IDs in your DWH I am sure you will find there are other explanations, unrelated to how we generate UUIDs.

from enrich.

miike avatar miike commented on July 17, 2024

To add to @istreeter comments above - although it is possible to get event id duplicates it is generally rare to see genuine collisions unless duplicates are being sent.

We are unlikely to introduce any technology (e.g., Twitter Snowflake) to produce truly globally unique ids as this is very computationally expensive and we cannot rely on sources of server information (e.g., worker and shard numbers) originating from the client.

from enrich.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.