Is there any reason that UUIDs are used for each id needed? UUIDs are very 'heavy' obj

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Alright, I've been thinking about this one (in light of <a class="issue-link js-issue-

UUIDs,about dyn4j/dyn4j

Comments (17)

wnbittle commented on May 10, 2024

This one is interesting - here are some thoughts of mine:

We use UUIDs on the following classes: Shape, Fixture=>BodyFixture, Collidable=>Body, Joint, and World. I wouldn't expect the creation of them to be an issue since these objects are all long lived. I would also expect that their use elsewhere in the library is minimal and reasonably efficient. So I think the question is really about memory footprint right?
One easy optimization (if we didn't want to bite off an overhaul) might be to have the Fixture=>BodyFixture id reference the shape's id instead of creating it's own id. There's a 1-to-1 relationship between a Shape and a Fixture=>BodyFixture. The Fixture class already checks for type equality before comparing the ids.
If we wanted to remove the UUIDs completely it would be a breaking change since they are public. I thought that I marked the id field as something that could change in the future, but I can't find that now.
I'd like to avoid a global id generator. Something at the class level would be fine in my mind. So all the different classes that use ids would start from 0 and just count up. We'd need to make sure that this doesn't cause key collisions in the library internals. Something defined at the World level would be best, but then a Shape or the other objects wouldn't have an id until they were added to a World.
I don't think we need to worry about overflow since it's unlikely to have 2 (or 4 for unsigned) billion shapes in a given world at one time. You'd run into performance issues well before that.
I think UUIDs are convenient for networked applications. If we just use ints the developer would probably need to introduce another identifier.
I wonder if the savings are worth it. Take JDK8 UUID, which is 24 bytes. At a 1000 shapes + 1000 fixtures + 1000 bodies + 500 joints = 3500 UUIDs we have 84KB for UUID vs 14KB for int. For Android UUID is 56 bytes so we'd have 196KB vs 14KB which is a much larger difference. But I'd expect that you'd hit performance problems we'll before the memory becomes an issue.
On the other hand it would make the core library faster and leaner and the userData fields are available to store whatever a developer needs to uniquely identify them if needed.

I go back and forth as to whether this would be a good change. It would be nice if the user base would offer their opinion, but they're probably just as busy as I am.

from dyn4j.

mtsamis commented on May 10, 2024

Thanks for sharing your thoughts, here are some more from me as well:
This is indeed mostly a memory issue, although there's some performance gain exists as well. Namely the equality operator == for two int will be faster compared to UUID's equals method. But this doesn't make a change indeed. But about the memory cost I think there is bigger penalty than what you describe. Here are some additional thoughts and notes:

Apart from the content of UUID, each created object has a header of around 8 - 16 bytes
In some cases the reference in classes with UUID can cost some bytes more due to alignment (this is somewhat trivial)
UUIDs are objects so add extra work for the GC to do apart from taking up more space
Extra memory dereference for accessing (memory cost can be high here)
1K bodies is sort of a underestimation. Currently simulations with 10K bodies are possible in desktop
It was actually a surprise Android uses such a schema, this surely makes it even more important for mobile
Surely HotSpot can do better optimizations and inline with more ease a single int than method calls on objects
Per class id generators are good yeah, better not expose any part of the generator publicly. I don't think there's problem with different things (eg Shape and Joint) having the same id
I'm not sure, but I think UUIDs are not unique in the absolute sense. They're generated from a PRNG and have a (really small) collision probability. So in a sense the incremental id generator is free from those collisions (before overflowing anyway)

Maybe this is getting a bit technical, but what I'm saying is that there are various gains from the change, more than simple byte count; The only problem is that this is indeed a 'breaking change' at the library level.
More discussion is welcome

from dyn4j.

mtsamis commented on May 10, 2024

@wnbittle
Hey William, I'm bringing this back for discussion.
The reason is that I (accidentaly) found that UUID takes up a huge portion of body creation time. On very modern hardware (my desktop) a single Body with a Single shape can take up to 2000 ns, and roughly 1600 of those are for creating UUIDs. This is quite bad (aside from memory consumption).

One will say that body creation is an upfront cost that is not significant, but I imagined some situation where (example in a game) after an event a big object is broken into small fragments in a realtime simulation.
With worse hardware than mine, or on a mobile device, this could easily cause a lag spike which is unpleasant.

Replacing UUIDs with ints will be easy and clean in the codebase, I think it has to happen at some point. I shouldn't be very annoying for users to port their code, since the logic is the same. I think the name can remain identical as well (i.e. getUUID).

from dyn4j.

wnbittle commented on May 10, 2024

There's a usecase where ints could be problematic. Imagine a simulation that's run server side where users can join and exit at will. In that simulation, imagine that the users can generate and destroy objects that have unique ids: shapes, bodies, etc. It's within reason, that you could run out of ints after a short amount of time (maybe hours or days).

A contrived example - sure, but a valid one in my opinion. Let's meet somewhere in the middle. You are concerned about CPU cost at creation time and memory footprint during operation. I'm concerned about uniqueness. I think we can support both by allowing the user of the library choose the id generation scheme, with a small amount of overhead:

package org.dyn4j;
public interface Identifiable {
   public Object getId();
}

package org.dyn4j;
public interface IdentifierGenerator {
   // pass in the calling type just in case an implementer wants to make a decision based on the type
   public Object generate(Class<?> clazz);
}

package org.dyn4j;
public final class UUIDGenerator implements IdentifierGenerator {
   public Object generate(Class<?> clazz) {
       return UUID.random();
   }
}

package org.dyn4j;
public final class IntGenerator implements IdentifierGenerator {
   private int sequenceNumber = 0;
   public Object generate(Class<?> clazz) {
       return this.sequenceNumber++;
   }
}

package org.dyn4j;
// global and not thread-safe on purpose
public final class IdentifierFactory {
   // the `IntGenerator` could be the default since your usecase is more common
   private static IdentifierGenerator generator = new UUIDGenerator();
   public static final Identifier generate(Class<?> clazz) {
      return this.generator.generate(clazz);
   }
   public static final void setGenerator(IdentifierGenerator generator) {
      this.generator = generator;
   }
}

// then have Shape, Fixture, Body, etc. implement the Identifiable interface
// each would call the global factory method IdentifierFactory.generate(this.getClass()); in the constructor (like they do today)

The only thing this doesn't do is force uniqueness. There's internal machinery that requires uniqueness (at a world level). I think we can solve this with proper Javadoc documentation, highlighting the necessity of uniqueness for the library to function properly. It's probably very rare anyone would need something outside of the Int or UUID designs anyway.

from dyn4j.

mtsamis commented on May 10, 2024

This is actually a pretty interesting. Your concern is actually right, under rare but existent situations their value space could be exhausted. I was thinking that it probably won’t be a problem because when the ints are exhausted they’ll just wrap around, hence start creating ‘new Ids’. But again for extreme uses this can cause problem (or from some bad actor? ).

The identifiable interface won’t do in my opinion. It essentially adds some JVM level jump overhead for getId (because of multiple implementations possible) and the int will be boxed anyway.

After this discussion I think that there’s actually a solution that satisfies both needs.

We’ll just use long instead of int. And actually a good thing is that long will be even more unique than UUID. UUIDs are essentially 128 bit numbers but randomly generated. So by the time we reach as many UUIDs as longs can be there’s a 50% chance that UUIDs will have collisions and hence are problematic as well (by the birthday paradox, as sqrt(2^128) = 2^64).

So in my opinion long are fast enough, have zero cost allocation, are not boxed/objects, guaranty uniqueness up to their limit and are not random, and provide ‘a similar order of collision resistance’ with UUIDs (that last one is a bit ad hoc but you get the point). Should be sufficient for all use cases.

from dyn4j.

wnbittle commented on May 10, 2024

We can drop the Identifiable interface - it's not required. I think the impact of this is negligible though - the optimizer can detect that nothing overrides the method at runtime.

Regarding boxing - I don't think boxing is the issue here - it would only happen once. The only issue I see with Integer and Long is the increased memory footprint. 24 bytes for both Long and Integer vs their 8 and 4 byte primitive cousins.

Regarding uniqueness, a long has 1.8 x 10^19 unique values while a UUID has 5.3×10^36. Collisions are possible, sure, but the uniqueness range is larger. Would we ever reach either limit? Probably not.
However, uniqueness isn't just a function of how many unique values there are. Consider the scenario where we need to share objects between JVMs at runtime. A classic example would be a simulation running server side and client side where the client can create objects locally. In this case, a universally unique ID is required. Whether that's a UUID or a user provided class that generates unique ids across JVMs doesn't matter.

I'm still in favor of my solution for these reasons:

It gives the user of the library the option for much faster execution and less memory
It leaves open the possibility for using more complex identifiers (like UUID) if needed
It may not be as efficient as an int or long, but it's clearly better than the current state in every way

from dyn4j.

mtsamis commented on May 10, 2024

When I mentioned boxing as an issue I actually meant the exact same thing; There is a memory overhead and an extra indirection to access the value.
Also uniqueness seems like an issue to me. They way UUIDs are used to implement equals methods a collision can make the simulation misbehave or glitch, and there is no feasible way to detect a UUID collision at runtime (I understand this needs a lot of bad lack to happen, but on the other side you can instantly know if you run out of longs).

Can you help me understand the server-client scenario please? I did not understand why unique ids are needed in this case.

from dyn4j.

wnbittle commented on May 10, 2024

I've answered your comments/questions below more specifically, but I wanted to make sure that my thought process is being understood.

I'm not debating whether UUID, Long, long, or any other type is better, more suited, or more unique than another. Instead, I'm debating that allowing the user of the library to choose what they want is worth the trade-off in performance. In fact, as it stands right now, my design is still a net gain in performance.

Now, let's take it a step further and ask the question, would the extra memory and indirection of using a boxed type vs a primitive even show up on a profiler? I doubt it - especially in this type of library where we're doing thousands upon thousands of floating point operations per iteration of the engine.

So, in my view, it's a no-brainer. We improve performance by swapping out UUID with Integer as the default and at the same time allow the user of the library to choose whatever unique identifier they want at design time.

Also uniqueness seems like an issue to me. They way UUIDs are used to implement equals methods a collision can make the simulation misbehave or glitch, and there is no feasible way to detect a UUID collision at runtime (I understand this needs a lot of bad lack to happen, but on the other side you can instantly know if you run out of longs).

This is unbelievably unlikely. According to Wikipedia: Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion. Wikipedia only references 3 cases where collisions were reported and they've been in use for decades.

Can you help me understand the server-client scenario please? I did not understand why unique ids are needed in this case.

The scenario is a multi-client + server simulation. Imagine a simulation running server-side and client-side - a networked game for example. The bodies for all simulations need to be synchronized so that when you create an object, I see that object and vise-versa. Each client and the server are running in their own VMs with their own sequence of ids. So you may add an object with id 100, and I might add an object with id 100. Unless the clients and server communicate or we force the server to do all creation, we cannot guarantee that objects created in any of the networked simulations have unique ids. UUIDs, on the other hand, would allow every client and the server to freely create objects without communication with statistical certainty of uniqueness.

from dyn4j.

mtsamis commented on May 10, 2024

Alright, I've been thinking about this one (in light of #71 as well) and I believe your approach is a good compromise of performance and customizability. The only downside will probably be that getUUID will return untyped Objects, but this will be hard to solve without polluting a lot of classes with generics.

@wnbittle I can go ahead and implement this change roughly as you described here #47 (comment) , with some minor changes.

from dyn4j.

mtsamis commented on May 10, 2024

Update: After starting to implement this, as described above, I made some more realizations that would like to further discuss.

It occurred to me that UUIDs/IDs are actually not needed anywhere in the library (correct me if I'm wrong). The only -limited- uses inside the dyn4j codebase are essentially of the form
obj1.getId().equals(obj2.getId()) and obj.getId().hashCode()
But since those unique IDs are tied to a single object those are equivalent with
obj1 == obj2 and obj.hashCode(), and should be replaced anyway for simplicity and performance reasons.

Observation 1: Dyn4j has not to rely internally on IDs at all and hence the only possible use for IDs is in user code. This means that if the user does not need them it wouldn't be a problem if they did not exist at all. So if we had used the Identifiable scheme it would make sense to have an option to not create IDs (i.e. generateId() { return null; }). But now this is somewhat ugly because this cannot fit any Identifiable class contract and can be prone to bugs (lack of good interpretation for a method generateId to return null or a constant).

'Since IDs are only used by the user shouldn't they reside in userData?' (This question is just to make a point)
Observation 2: In theory they could but it would be quite awkward for existing code. I believe the usefulness here is that IDs currently are automatically created and attached to objects. At the moment it's also useful that IDs are typed so the getId() method returns UUID and can be used without a cast, but that would be lost if the Identifiable scheme is implemented.

From those observations I can see a possible scenario:

Use the name Tag instead of Id. A tag in this scenario will be an optional, automatically generated Object attached to some dyn4j objects with a possibly user defined tag function.
Replace getId everywhere with getTag, but provide built-in UUID Tag generator for better migration (So if you used getId previously you can just replace it with getTag) The change in name intends to make the users aware as well.
Make the default to be no tags -> maximum memory/speed efficiency. If not needed internally integer tags do not make much sense for the user to use. UUID tags by default can be deemed expensive and this is part of the effort to replace them.

Of course there exists the alternative 'Do nothing of the above and keep current implementation' or similar.

So for now I stopped implementing the Identifiable scheme and seek for more discussion/ideas in order to find the best course of action.

from dyn4j.

wnbittle commented on May 10, 2024

Great observations - i would need to confirm your assertion, but I think you are right - we shouldn’t need an id at all internally and just use reference equals and the default Object.hashcode implementation.

I agree with you that if the user needs an id, they should put it in the user data. If the need more than that, they need to create a complex type to hold what they need.

My only concern is that the current built-in id is convenient. I like your idea of replacing it with a getTag method, but I’m thinking we scrap it all and force the use of userData.

What’s your opinion?

from dyn4j.

mtsamis commented on May 10, 2024

In my opinion the correct approach is to remove IDs. Since there is explicitly space for user defined values, keeping IDs will only be a matter of legacy support.

In general it shouldn't be a major pain to change this in user code. There are least two decent alternatives: 1) storing IDs in userData (either directly or in a complex type as you mentioned or 2) even better if you have extended the Body class (as a lot of examples from dyn4j do) it's completely painless to add an extra ID field of whatever form (with the added bonus of being typed). Depending on the scenario there may be more options as well (for example if the only use of IDs is storing the objects in a file there is no need for the IDs to be constant from body creation).

Because there have been some useful commits from 3.3.0 and it would be nice to allow drop in upgrade to 3.3.1, what I would suggest is:
Remove the use of IDs for comparisons but keep the public API while marking it with @Deprecated(forRemoval = true). Then after the 3.3.1 release make a follow up 3.3.x release with all IDs and public API removed. So users can update with ease while also getting the warning for API removal.

from dyn4j.

wnbittle commented on May 10, 2024

That sounds like a good plan to me. Just to confirm we're on the same page:

Mark the getId methods as Deprecated (both javadoc and annotation)
Update internal code to use reference equals and default Object.hashcode
Ensure that classes that have getId do not override equals
In the next release we'll remove them.

from dyn4j.

mtsamis commented on May 10, 2024

Yeah, that's the plan. I implemented points 1, 2 and 3 alongside other minor changes and housekeeping in #77. I have plans to implement the removal PR in time.

from dyn4j.

wnbittle commented on May 10, 2024

Can we go ahead and update the toString methods to use object.hashcode?

from dyn4j.

mtsamis commented on May 10, 2024

Yeah, done in #85

from dyn4j.

wnbittle commented on May 10, 2024

@mtsamis Should we close this and use #71 as the issue to come back to remove them?

from dyn4j.

UUIDs about dyn4j HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent