Comments (4)
Hi @AkhtemWays this is an interesting topic of discussion, which I had not considered before. Can you give any more details about how often you are seeing duplicate IDs? How do you know the duplicates IDs were created by enrich, and not by the trackers the sent the events?
wanted to know the reason for going towards UUID strategy.
I can't answer this, because the design decision pre-dates when I joined Snowplow!
But... I had never considered it a bad decision. To the best of my understanding, the Java implementation of UUID.randomUUID()
is extremely unlikely to generate duplicates.
We do often see duplicate event IDs downstream of enrich. But in our experience those duplicates arise from either:
- Artifact of the pipeline, especially during autoscaling up or down.
- Browsers sending the same event multiple times because of network connectivity issues.
- Badly behaved tracking implementations that re-use the same event id.
from enrich.
When I executed group by query by event_id and ordered by counts I found that at most 45 same event_ids in DWH.
None of the versions of UUIDs actually guarantee total uniqueness so far as I understood.
It affects joins when the data goes further down to other data sources, where joins are happening, and at this point we are forced to join by many fields which kind of degrades the performance of queries and CPU load overall.
If the primary goal is the generation of absolutely universally unique IDs, then Twitter Snowflake strategy would a good choice I suppose, the algorithm is tied ensures unique id generation because it's tied to unix_timestamp + machine_id that generates it, which basically states that at one point in time one machine can generate only one ID, if the script generates ids for multiple objects at the same time, the solution could be to add some sleeping of 1 nano second or some other strategy to fix it.
This I think would solve the three problems you described and uniqueness problem overall.
One configurable env parameter could be added to specify machine_id I guess.
from enrich.
I am not surprised that you see duplicate events in the DWH. But I think you are looking in the wrong place for the problem if you think it's because of our UUID generator.
If you find two events with the same event id, then interesting questions to look at next are:
- Do they share the same
collector_tstamp
? If yes, then this is probably a full duplicate copy of the same single event that was received by the collector. - Do they share the same
dvce_created_tstamp
? If yes (butcollector_tstamp
different) then this is probably a duplicate copy of the same single event which a tracker sent multiple times to the collector, e.g. because of a network failure.
If you investigate further the duplicate IDs in your DWH I am sure you will find there are other explanations, unrelated to how we generate UUIDs.
from enrich.
To add to @istreeter comments above - although it is possible to get event id duplicates it is generally rare to see genuine collisions unless duplicates are being sent.
We are unlikely to introduce any technology (e.g., Twitter Snowflake) to produce truly globally unique ids as this is very computationally expensive and we cannot rely on sources of server information (e.g., worker and shard numbers) originating from the client.
from enrich.
Related Issues (20)
- enrich-pubsub: set user-agent header in Pubsub publisher and consumer
- enrich-kafka: add blob storage support
- Add Snowplow Community License
- Use cron expressions for assets refresh
- Enricher logs unnecessary line while validating date-time fields - "[ERROR] com.networknt.schema.DateTimeValidator - Invalid date-time: Invalid timezone offset: 123"
- Upgrade to Cats Effect 3 ecosystem
- enrich-kafka: support for multiple Azure blob storage account
- Remove config logging
- Move to Snowplow Limited Use License
- Add mandatory SLULA license acceptance flag
- Add headset to the list of valid platform codes
- Switch from Blaze client to Ember client
- Add Cross Navigation Enrichment
- Use SLF4J for Cats Effect starvation warning message
- Remove lacework workflow
- Issue when updating to snowplow-enrich-kinesis-4.0.0.jar HOT 1
- Stop publishing fat jars
- enrich-kafka: authenticate with Event Hubs using OAuth2
- Allow multiple javascript enrichments
- Allow passing an object of parameters to the JS enrichment
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from enrich.