bepu / bepuphysics2 Goto Github PK

View Code? Open in Web Editor NEW

2.2K 79.0 245.0 12.39 MB

Pure C# 3D real time physics simulation library, now with a higher version number.

License: Apache License 2.0

C# 98.84% HLSL 1.12% Shell 0.04%

physics-3d physics-engine physics-simulation-library

bepuphysics2's People

Contributors

Stargazers

Watchers

Forkers

valtolibraries impoetk scottenriquez cwharris feetb24 mbossx longde123 jjagg netdebug muleaneeve awesomedotnetcore cassiope bluexo polytronicgr bakahr forestrf chengcat hbcbh1999 anotherend15 zhaoqiyuan123 pleasedownloadlater neuroradiology intfrr liserdarts jlturner aegeanragdolls lingithubname nkast tamarillostudio littlesylvanas cybersys shuningzhou inkimagine szsteven lutzenh damian-666 phr00t swaelo shimoop antoinecharton josephtamhk lastexile-ch sonewbee sierra-zero pkucherov fuyifan asciima reflectronic mengtest astriania cyecp keekekx rhl2073 mingxxming ezdiy erdemyavuzyildiz flaxxy1 arjustini alex520lq qq362946 swordlegend nick-dodonov mohamadalhamoud neos-metaverse justinbritt harsanyi nguyenvuducthuy weihua44 pixelsheep-minos lidgren shin1m zukapin dellis1972 lovebumblebee bvirgo deplkawaii pxgame crazyktw zanxueyan capdj anomalousmedical qiuchangjie brightori mastermann patrickwarnatsch bimbam-industries shanoa wellson-zhang yuruyigit durswd lajbert zonciu 3x1t-5tyl3 yivodev blindfoldedhuman wtf3505 pubpy2015 jaydown phinoox guoxiangzhou

bepuphysics2's Issues

Cross Platform Support

Given that the whole code of Bepu2 is Net.Standard, theoretically it is cross platform by default, which is great!.... but...

I am seeing lots and lots of extreme optimizations, like using heavy duty vector math operations, and lots of memory tricks with Span and even using unmanaged memory for storing buffers.

I've been contributing to SixLabots.ImageSharp, which are also doing lots of these performance optimizations... and they're already having a lot of issues on Xamarin/Android and UWP due to some optimizations not being available, or the compiler not optimizing as expected, or even producing invalid code.

Maybe you could consider, at some point in the roadmap, to introduce/port the demo to xamarin and/or UWP, just for the sake of leveraging and balancing Bepu2 for platforms other than desktop.

UWP port is probably quite trivial, since UWP is supported by SharpDX; the interesting thing is that UWP uses the Roslyn AOT compiler which gives a quite impressive performance boost, but also is very buggy with Span and System.Numerics.Vectors.

Android/IOS would probably need an OpenGL rendering head, which might probe more complicated.

Alternative pose integrators and custom gravity handling

At the moment, gravity is assumed to be a single simulation wide constant by the PoseIntegrator. Some simulations do not require gravity, and others may want to use per-body gravity of some form.

We can easily support the different use cases by letting the user supply a gravity callback in the same way that we support zero overhead narrowphase callbacks (generic structs). Since the pose integrator relies on body data in AOS format, it wouldn't be too terrible to rely on the user to provide the desired implementation.

The same sort of approach could be used to implement damping. The callback would effectively a general purpose velocity modifier.

Csproj project files incorrectly defined.

I've seen this line of code and I was wondering why you need it:

<PropertyGroup>
    <!-- TODO: Why is this necessary? It refers to 1.6.1 rather than 2.0 without this, despite the target framework being set to 2.0 above. -->
    <NetStandardImplicitPackageVersion>2.0</NetStandardImplicitPackageVersion>
  </PropertyGroup>

I think it's because you have defined the project kind in a very odd way:

<Project>
  <!--Note workaround for macro evaluation issue.-->
  <Import Sdk="Microsoft.NET.Sdk" Project="Sdk.props" />
  <PropertyGroup>
    <TargetFramework>netstandard2.0</TargetFramework>
    <!-- ... -->
    <LangVersion>7.2</LangVersion>
  </PropertyGroup>
  <Import Sdk="Microsoft.NET.Sdk" Project="Sdk.targets" />
</Project>

I think if you define the project as in the default templates, you can simplify the project definition:

<Project Sdk="Microsoft.NET.Sdk">  
  <PropertyGroup>    
    <TargetFramework>netstandard2.0</TargetFramework>
    <!-- ... -->
    <LangVersion>7.2</LangVersion>
  </PropertyGroup> 
</Project>

Memory management API needs filling out

As a part of moving to a more explicit API and leaning on manual memory management, we should provide in depth options for controlling the engine's active set.

This control is provided by the Simulation's Clear, EnsureCapacity, Compact, Resize, and Dispose functions. Many systems already implement the needed functions, but the newer ones (broad phase, narrow phase, shapes) don't.

As things progress, the SimulationAllocationSizes type will also need to be extended to match the new systems' requirements.

Nonconvex manifolds may require additional contacts

The narrowphase currently assumes manifolds have a maximum of 4 contacts. This is perfectly fine for convex manifolds, but it is unlikely that a constraint based on 4 contacts will be able to handle all the necessary degrees of freedom in a complex nonconvex pair.

One pathological example: a mesh with 6 spikes with one point touching each face of a cube. While a use-maximum-penetration-depth heuristic could be used to retroactively stop the worst penetration and stop the cube from falling through the mesh, there will always be some spike which lacks a contact. Any force pushing in that direction will not be counteracted until that spike's contact is included in the reduced manifold.

The actual impact should be tested. If the behavior is not acceptable, there are some obvious options:

Allow more contacts in nonconvex manifolds. For example, 8 contacts may be sufficient to eliminate noticeable issues in all practical cases. The end result depends heavily on the heuristics chosen to reduce the many submanifolds.
Allow arbitrarily many submanifolds per pair. This is the v1 solution, and it's not great. It either sacrifices SIMD occupancy or explodes the number of constraint batches.

My guess for now is that 4 contact manifolds will work most of the time, even for complex nonconvexes, and that 6-8 contact manifolds will suffice for the remaining cases. The heuristic tuning is a little worrisome- high quality heuristics that take into account constrained directions might be too expensive.

Kinematics should not be added to constraint batch handle sets

Given that a kinematic body's velocities cannot be modified by constraints, there is no danger in allowing multiple references to the same kinematic body within a constraint batch.

This can be important- imagine five hundred dynamics sitting inside a big kinematic spaceship. Without making this change, there would be five hundred constraint batches (at least until #15 exists, but even if it did, there's no good reason to trigger that fallback with kinematics).

This is pretty easy to implement- the only difficulty is keeping everything synchronized when a body shifts from kinematic to dynamic. All of its constraints would need to be reallocated.

Review usages of ref vs in

If there are no compatibility blockers with in parameters, there are a whole bunch of places where they would be useful for reducing syntax noise. Consider reviewing most of the user-facing APIs for opportunities.

Contact constraint zoo requires filling out

As of this writing, only 3 of 16 (and possibly 24) contact constraints exist.

We'll need to create the rest of them, but we need to run the numbers on memory bandwidth. The effective mass jacobian baking trick might be worth it on one contact constraints.

This is made slightly more complex by the need to keep things minimally duplicated. Nonconvex manifolds should tend to be a bunch of convex 1 contact manifolds lacking twist friction, so it shouldn't be too hellish.

Memory management and garbage collection

Hi,

I'm still new to bepuphysics, so I would like to better understand how memory management works.
It seems to use a lot of structs and manual buffers, which reminds me of 3D engines in C++ :)

However, I noticed that, when running the demos app, the memory usage starts around 338 MB. Then, as I switch through the demos, it goes up to about 1 GB before garbage collections (I assume) occur. Then it drops to around 700 MB and never lower than that.

Is that a normal behavior?
I would have expected the memory to reset back to around 338 MB when switching back to the MeshDemo.

My guess is that there are buffers that stay at a certain size to accommodate future needs, and I would need to explicitly reduce them...

Convex shape zoo

At the time of writing, testing focuses on spheres for simplicity while getting all the core systems up and running. But games need more than spheres.

In rough order of priority, the engine needs:

Spheres
Boxes
Capsules
Convex hulls
Triangles (though only as a part of other group types, like meshes; pretty sure no one ever used v1's TriangleShape by itself)
Cylinders
Cones

Convex hulls will be by far the most difficult of the bunch. All of their collision pairs are pretty complicated and don't SIMDify easily.

And an important note: we are avoiding the use of GJK/MPR/EPA for narrow phase collision detection across the board. While they work reasonably well, they come with too many implicit requirements and tuning knobs that trip people up. Dozens of pair types will be required.

Group types

The engine currently lacks nonconvex collidable types (and, at the time of writing, also most convex types!).

To support nonconvex types, we should use a top level abstraction of group types. A group is a collidable bunch of shapes, but it prescribes no specific data structure or form. For practicality's sake, there will likely be two top level group types:

General shape groups containing any convex type, and
Meshes, or in other words, dedicated collections of triangles. Creating a special case for triangles is likely worth it due to their frequency of use and some special properties.

An ideal group type has a few functions:

Find all overlapping shapes in a bounding box for subpair dispatch.
Find results of other queries, like ray casts or sweep tests.
Ideally, boundary smoothing- a way to treat adjacent convex objects as a continuous surface with no 'bumps' at borders.

These properties do not require any specific data layout. Most likely, we will ship with:

Vertex/index list based meshes with scale/rotation/translation local transforms. They can use the Tree type to handle all the queries. Boundary smoothing is somewhat complicated, but is fundamentally similar to v1's implementation.
Terrain 'windows'. The idea here is that you can have multiple terrain collidables, each covering a portion of the underlying data set while still supporting some form of boundary smoothing at the collidable boundary due to the shared underlying data. The data source can be generic, so these windows could cover a procedural infinite terrain or anything else. To avoid virtual dispatches at runtime, specific generic versions will have to be registered in the collision task registry. We may want to just show examples of custom data sources in the demo rather than shipping in-engine registered versions.
Compounds. Similar to v1's CompoundShape, being able to stick a bunch of other shapes together is important. No compound nesting though- that was a pretty goofy feature. It may be possible to efficiently smooth boundaries within a compound- if all shapes come with a speedy 'get shortest vector to shape surface from point', you could check to see if the contacts you've found are obstructed by adjacent shapes in the group. That's pretty experimental- probably should wait on implementing that.

But there's nothing stopping you from doing some more interesting things. For example:

Voxel regions. Similar idea to terrain windows, except instead of triangles, you create boxes. Hardest part is still just the boundary smoothing.
Dynamically triangulated isosurfaces. With appropriate acceleration structures, you could go directly from something like a distance field to a rigid surface.

An example simple voxel region in the demos could be useful for users learning how to customize the engine.

MeshReduction cannot get rid of all speculative bumps

At the moment, MeshReduction checks the set of triangles detected in the collision task for contact blocking. This assumes that all triangles capable of blocking a generated contact will be available.

Unfortunately, speculative contacts can be generated outside of any collision task query bounding box. If a contact isn't contained by one of the queries, there's no guarantee that a necessary blocking triangle will make it into the set.

In practice, bumps caused by this are pretty rare- it requires a fairly large speculative margin and quick movement. It would be nice to guarantee zero bumps, though.

Some options:

More aggressive bounding box expansion to make it impossible to generate a contact in an area that isn't covered by a bounding box query. Works, but causes additional collision tests.
Dynamic lookups. If each contact queried for a set of blocking triangles, there would be no issue. Explicitly doing a query for each contact would be too expensive, but there might be some more clever options (like a single lookup over the bounding box of contacts).
Light precomputation. If every triangle knows what triangles it touches, you can directly look up the potential blockers without an acceleration structure traversal. Most of these approaches would keep mostly-working during deformation, but not during topological changes.
Heavy precomputation. Can just bake all contact blocking information. A bit less flexible, but would cut costs a bit.
Work around: use a smaller speculative margin.

Crash occurs when creating too many static meshes

Description

Writing a system to spawn in terrain chunks around a central point.
For each terrain chunk, I create a static mesh from a height map, then add a static to the simulation with a collidable description for that mesh.
This works fine, up until the 258th chunk is spawned, at which point the program consistently crashes.
- Currently 1922 triangles per mesh.
"Game.exe has stopped working" shown by windows, no exceptions thrown.
Event viewer doesn't show much useful information, but I'm seeing Faulting module name: clr.dll, version: 4.7.3163.0.

It's possible that this is a bug in some unmanaged code that occurs when the buffer overflows, but either way it seems like this should be handled more gracefully, throwing an exception when this happens.

Any help resolving this issue would be greatly appreciated.

Environment Information

Version: BepuPhysics 2.0.0.0-beta, BepuUtilities 2.0.0.0-beta
OS: Windows 10 x64

Consider region allocators for per-thread ephemeral work

While the power pools are relatively speedy, if you ever see temporary heap allocations in profiling, consider swapping all ephemeral usages out for a region allocator. Could use more memory in some cases, but less in others.

Pretty low value; don't worry about it until there's a reason to worry about it.

Activity management

At the moment, all bodies are permanently active. That's not ideal!

Activity management has two parts:

Graph-traversing deactivation. In a deferred fashion, look at the active set to find deactivation candidates. Once found, remove them from the active set and push them into the inactive set.
Narrowphase-triggered activation. When a new constraint is created, any inactive involved body must be woken up. That brings its whole island with it.

Deactivation thresholds should be defined on a per-body basis to avoid issues with global thresholds being inappropriate at different scales.

Deactivated objects should be near-zero overhead. They are taken out of the simulation completely, with the exception of their collidable residing in the inactive/static tree.

Some questions remain about what API interactions force islands to wake up. Is the user allowed to move an object in a deactivated island? Can you remove an object directly from a deactivated island, or does it force the island awake before performing the removal? Answers likely depend on what requires the least development work.

Shape casts

It's often useful to get an exact time of impact and associated information for a pair of shapes with given poses and swept motion.

Outside of special cases like sphere casts, sweeps are often implemented as extensions of GJK/MPR or, in the rotating case, full conservative advancement.

We would rather not implement a ton of special cases for sweep tests alone, and we'd also rather not use GJK/MPR for numerical reasons.

Notably, in its substepping form, the current CCD implementation uses a combination of speculative contacts and multiple discrete tests. Since the discrete tests all generate contacts even when not intersecting, there is no need for per-pair type specialized logic, nor MPR/GJK fallbacks. This works reasonably well because the narrow phase tests are actually pretty cheap thanks to wide SIMD usage- in fact, for many shapes, the actual manifold generation is a tiny fraction of the narrowphase's execution time.

We can use the same sort of system to avoid dedicated shape cast tests. When you can do 4+ speculative-generating discrete tests for the price of one, a lot of options become available. Full conservative advancement could even work reasonably well with the help of more general root finding schemes.

V1 vs. V2 API Differences?

I'm trying to give Bepuphysics a try, incorporating into a .Net Core 2.1 testbed project.
After gronking through a lot of V1-related samples etc., that refer to the "Space" class, I'm not seeing that exist in V2. Unless I'm "completely" misunderstanding something, it almost looks like "Simulation" has taken over the role of "Space"? But there are method/property differences. I'm terribly confused!
The V1 tutorial fundamentals seemed to make sense, adding things having orientations etc. to the "space" and periodic calls to Update. Is there a list of V1/V2 differences that might show the way? Or V2 tutorial?
Thanks for doing this project, BTW... it looks Really Awesome!
Jim

Deterministic body layout optimization

One of the few inherent sources of nondeterminism is the body layout optimizer's multithreaded execution. The result depends on which thread reaches a given body first.

While all systems have ways of guarding against the nondeterminism that this introduces, compensating for it can introduce complexity and slowdowns.

The deactivator is the most recent and most severe example. It currently uses a sort over all active bodies according to their handles to guarantee a reliable order of deactivation traversal attempts. This can become a significant chunk of frametime, and alternatives aren't that much better- for example, marching the handles array is cheaper per element, but includes all inactive bodies and potentially empty slots in the search.

If body memory layout was deterministic, that slowdown and complexity would vanish.

Deterministic body layout optimization is a little tricky, though. Maintaining determinism with an internally multithreaded approach often results in severely weakened optimization power (e.g. sorting disjoint subsets of bodies). At that point, a simple single threaded version of the current approach would likely end up more effective at optimization per unit time spent.

Given the significant complexity, sync overhead, and nondeterminism of the multithreaded body layout optimizer, there is strong reason to suspect that using a locally sequential version hidden alongside some compatible existing stage would be a net win.

There is basically only one place that fits the requirements and doesn't access bodies at all: the broad phase refit/refine phase. It already has to deal with jobs from multiple sources due to the different refit/refine logic required for actives vs statics. Plugging in another independent task alongside it would be reasonably easy.

In terms of workload, it's not a great match- the broad phase is very memory heavy, and the optimizer is pretty much memory only. However, they both have quite a few inherent cache stalls, so it might still mix reasonably well.

In the worst case, body layout optimization should always be significantly cheaper than the broad phase requirements, so the fact that it's locally sequential should never cause a significant load balancing issue.

Consider angular-only constraint specialization

Many constraints access only angular velocities, yet currently the linear velocities are also gathered/scattered.

While avoiding the linear gather/scatter would not meaningfully change memory bandwidth requirements (since both are effectively always in the same cache line), there is still a little execution overhead associated with handling the unused linear components. Given that the gather/scatters occur several times per constraint every frame, this can add up.

Realistically, it's still not going to be a big deal, but if you find a hyperbolic time chamber, this could be worth doing.

Consider twist friction for contact constraints with 1 contact

Nonconvex manifolds with more than one contact will be composed of a bunch of separate contact constraints that each have their own tangent friction constraint (and so get twist friction for 'free').
Convex manifolds with more than one contact use an explicit twist friction constraint.

But single contact manifolds do not currently attempt to stop twisting motion. This is not entirely unreasonable- perfectly rigid bodies won't have much twist friction with a single contact point. But perfectly rigid bodies don't exist in reality, and the even the 'rigid' bodies we simulate have user-tuned springiness.

So we have two options:

Use twist friction for the single contact case. Heuristically compute the maximum force based on some stand-in for manifold extent. Minimum shape radius as a proxy?
Don't bother; rely on damping or other collisions to slow such an object down.

Broad phase static/inactive tree dedicated refinement

Currently, the static/inactive tree of the broad phase uses the exact same level of aggressiveness as the active tree. That's extremely wasteful, given that static and inactive objects generally do not move at all.

Further, the static tree's jobs are not handled in parallel with the active tree's. That pointlessly creates sync overhead.

To improve this:

Make the multithreaded refinement context create jobs for use in an external scheduler. Do not internally dispatch them.
Build a dedicated static/inactive refinement that takes advantage of the unique assumptions. This most likely involves explicit refits on add/remove/move, followed by low effort periodic refinements. The fact that we probably won't have more than one static refinement per frame is totally fine since it should run in the same dispatch as the active tree's refinement.

There may be value in some form of batched refit. API isn't immediately obvious, but if if you changed a thousand bodies, they'll each be doing a lot of duplicate work near the top of the tree. On the other hand, the worst case might cost a handful of microseconds, so further optimization effort could be silly.

Check compound bounding box velocity expansion

I'm pretty sure when I implemented compound bounding box calculation initially, I did not bother to properly handle velocity expansion. It just uses the compound velocity to expand every child. That works okay for linear motion, but fails to handle the full arc induced by angular motion.

When it's time to revisit CCD, go back and confirm/fix this.

Accumulated impulse scaling helper for timestep changes

Impulses applied by constraints are instantaneous. In order to counteract a force of X with a time step duration of 1/60, a constraint has to apply an impulse of X/60 each time step.

Impulses are remembered from frame to frame for warm starting. If the time step duration changes significantly, then the previous frame's result will no longer be a good guess. Consider what happens in the example above when the next frame uses a time step duration of 1/180: the impulse is three times too big.

Updating the cached accumulated impulses isn't free, so ideally we wouldn't do it every frame under the assumption that the time step has changed. Instead, expose a function to do it explicitly. Could implement it on a per-set basis to let the user choose whether to apply the changes to inactive objects as well.

Consider world space overlap finders for compound/mesh collision tasks and sweeps

All shape acceleration structures are in the shape's local space. Collision tests and sweeps have to create local space bounding boxes to use those acceleration structures.

The local space bounds of a query can become enormous if the target compound/mesh has significant angular velocity.

In most cases, working in world space would significantly shrink the bounding box at the cost of increased per-node traversal cost. You could heuristically choose which path to take based on the difference in bounds and other per type information.

This is primarily concerning for long sweep tests. Collision tests tend to involve far smaller angular displacements.

(Some pair types may benefit from direct tree-tree overlap tests which would require much of the same logic as a world space traversal of a local acceleration structure.)

Simulation-wide queries

At the moment, the simulation lacks any query support. It would be helpful to offer convenience functions that find and dispatch tests for candidates.

Using a generic struct parameter as a filter would be useful for pruning out unwanted candidates during the broad phase queries.

Some commonly requested things:

Ray casts.
Sphere casts.
General shape casts.
Bounding volume queries.
General shape volume queries.
Batched variants of the above. Realistically, I suspect the batched variants will actually be the native form, and the one-shot overloads will be convenience wrappers. There are pretty huge potential speedups to be had by, for example, performing 1000 ray casts at once rather than doing 1 ray cast 1000 times.

Material properties review

Currently, constraints (including contact constraints) are tuned using spring settings defined by a natural frequency and damping ratio pair. This is a pretty convenient representation for tuning thanks to the mass independence, but there remain some concerns:

We haven't yet settled on a good default tuning for contacts.
Softness makes things like velocity targets (in contacts, MaximumRecoveryVelocity) difficult to tune.
Contacts could benefit from a MaximumForce. Overhead of implementing this isn't clear, but if it's sufficiently cheap, it would be an easy way to support some interesting gameplay use cases- like swords biting into surfaces, or a car slamming into a breakable wall without negating its forward velocity on impact.
There are still some user-visible pi values floating around the frequency tuning that we could get rid of.
We have no convenience functions for converting between damping ratio + frequency and stiffness + damping.
Contacts do not even support bounciness at the moment, except through the frequency/ratio induced bounce. I doubt frequency/ratio tuning will be able to take the place of the bounciness bias velocity hack.

Bodies/Statics unification and API improvements

The bodies and statics collections are extremely similar. A great deal of the implementation could likely be shared usefully. Consider what this might look like.

Further, both could benefit from additional convenience functions, particularly around navigating handle->index indirections.

We probably should create a Body/Static wrapper-helper-struct thing, too, if only in the demos to show how it can be done. It would be something like:

struct Body
{
    public Body(Bodies, int handle) ...
    public ref RigidPose Pose => ref bodies.Pose[bodies.HandleToIndex[handle]];
}

Contact helpers

While the engine does not track collision data outside of generated constraints, a way to enumerate a body's existing contact constraints interpreted as contacts would be convenient for application logic. Contact mutability is a questionmark; contact callbacks are the best place for that logic, but we could still expose it in the enumerator.

It would also be nice to have an example in the demos showing how to use contact callbacks to capture collision information in cases where constraints are not being generated.

Consider pushing simulation-level Bodies/Statics/Solver manipulation into subsystems

While Simulation.Add(ref BodyDescription) doesn't seem too strange by itself, things get more questionable when considering the existence of the Simulation.Bodies set and all of its exposed state. Things get even weirder when looking at something like Simulation.ApplyDescription(int, ref BodyDescription).

From an external perspective, it seems likely that keeping all the Bodies related stuff in the actual Bodies set (and likewise for Statics and the Solver) would be beneficial.

The only downside is that the Simulation level functions existed to handle changes in systems outside of the Bodies/Statics/Solver. For example, adding a body requires potentially adding a collidable to the broad phase.

There's nothing technically stopping the Bodies set from reaching out and making that modification itself, but it does imply greater explicit coupling between the different systems.

(Realistically, though, that coupling is already there. Making it explicit doesn't really make things worse.)

Consider box-box contact normal plane clipping

Box-box uses ye olde style face clipping on the plane of one of the two faces rather than the contact plane. It doesn't produce results as bad as triangle-box or triangle-triangle would with the same approach thanks to the representative face never being too far away, but it still isn't as good as performing all clipping on the contact normal plane.

Not super high value, but it would marginally improve contact quality and stability on edges near face representative transitions.

Consider vectorizing nonconvex manifold reduction

While it uses some weak AOS-ish vectorization already, some of the heavy lifting could be pushed into a wider format reasonably easily.

This is primarily a concern for cases where there are a large number of populated input manifolds. Dense meshes would benefit.

Value is a little bit limited; nonconvex reduction doesn't eat much of the frame. Probably best to only consider after all core features are done.

Simple collision filtering for demos

At the moment, the demos narrowphase callback just accepts all collisions regardless. We should instead provide an implementation (or two) of configurable collision filtering as examples.

One obvious option is a per-collidable int or long, where collision is permitted when maskA & maskB > 0. We may also want to show a filtering scheme similar to how v1 worked- but then again, maybe not, because v1's collision filtering was extremely complicated and slow.

The most difficult part of doing this is creating and maintaining per collidable data. It isn't stored inside the Bodies/Statics collections themselves, so you'll either have to mirror the structure, or store it in a handle-aligned array, or something more application specific. There may be room for some sort of Bodies/Statics helper- something that easily lets users create object aligned data, and which stays in sync as the bodies/statics are moved around in memory by simulation execution.

High precision poses

While the vast majority of the engine works in relative space where single precision floating point numbers are good enough, there are two places where it can be problematic:

Body/static world space poses, and
Broad phase bounding boxes.

While it wouldn't be a trivial change, it is relatively localized. We could use conditional compilation to swap out singles for doubles or even fixed point representations without much issue in poses.

For medium size worlds, we could avoid changing the broad phase bounding box representation by simply being conservative and expanding the bounding box to the next FP32 value. For extreme cases (e.g. planet scale and up, relative to human sized objects), you would need to change the broad phase's representation. Changing the broad phase would be quite a bit more painful and would come with a measurable performance penalty.

This feature will likely have to wait a while given the cost/benefit ratio.

Tree batch insertion

Broad phase incremental insertion makes up the majority of sleep/wake cost.
Incremental insertion is sequential in nature and destroys effective load balancing.
Incremental insertion tends to build pretty crappy trees.

You can do quite a bit better by exploiting the fact that islands are almost always composed of bodies that are close to each other. For example, throw a sweep builder at each island before adding the whole subtree at once. Despite being a much higher quality builder, avoiding the need to traverse the upper tree N times for incremental insertions would likely make it faster.

Different islands can be built in parallel trivially, and larger islands could be internally multithreaded as well. This would likely remove broad phase manipulation as a bottleneck in activity management.

Subtree builders could even be made to realize that a given island would be best represented in multiple nodes in terms of SAH, but I doubt that this would be necessary in practice. Refinement would deal with such inefficiencies pretty quickly, and it would almost never be worse than incremental insertion.

This should be treated as depending on #6. The additional complexity associated with scheduling island constructions may be enough to warrant a more cohesive job system as well.

Consider removing CollidableReference kinematic vs dynamic distinction

At the moment, it doesn't look like the kinematic state is used anywhere. It was a holdover from when special case handling was being considered. Unless we bring back something like #17, there doesn't seem to be a reason for this.

It's also a little confusing for a user to get a CollidableReference in a callback that has three potential mobility types plus a handle, but there are only two potential handle spaces- bodies and statics.

Determinism test expansion

The current test (DeterminismTest) is pretty weak. It covers the universally shared bookkeeping, but fails to capture many complex types.

Probably wise to allow it to evaluate any generic demo.

Mesh-mesh support

The beta is shipping with no mesh-mesh support. There is no fundamental reason; I just didn't want to delay a packaged beta for a fairly low value feature.

The most significant chunk of work required to bring mesh-mesh online is fixing triangle-triangle. The current version sort-of-works, but I moved on before perfecting it. Currently, it can generate some goofy contacts at oblique angles. The key change is doing all contact clipping on the plane defined by the contact normal, similar to what was already done in the box-triangle case.

Given that mesh-mesh would tend to imply fairly complicated shapes, it would also benefit from dedicated tree vs tree traversal rather than the general purpose 'all children of A against local tree of B'.

Broad phase improvements

While the rewritten broad phase works pretty well and tends to be a pretty small chunk of simulation time, it still has a bunch of room to improve:

It uses an explicit cache optimizer phase. This is pretty difficult to multithread meaningfully, and attempts to do so can actually end up producing an inferior result. The current implementation suffers from this.
While refinement can keep quality reasonably high by SAH, it tends to stay below a full tree sweep builder. This is not a fundamental limit of incremental refinement.
Self-tree and intertree testing uses almost entirely scalar codepaths. Attempts were made early on to vectorize it, but they all ended up slower. This isn't fundamental, and I am pretty sure we can do better with currently available codegen.
The memory layout of nodes includes a bunch of metadata that is only accessed in specific stages. There is no reason for tree queries (which are vastly more performance sensitive) to suffer cache line inefficiency caused by locally unused fluff; split out what cannot be removed outright.

Both 1 and 2 above can be fixed together by tweaking refinement. The refit traversal can be used to directly identify optimal locations for nodes, even when running multithreaded. Note that the heuristically optimal memory location of any node can be computed from the location of the preceding subtree and the number of leaf nodes held within that preceding subtree. In other words, you don't need to actually traverse a subtree to know its exact memory region.

Further, we can specialize the root's traversal. Rather than using a fixed size for every refinement, we could use an internally multithreaded root refinement plus supplementary locally sequential refinements that run in parallel with it. A larger root refinement combined with a better heuristic for choosing which parts of the tree to include in the root refinement could significantly improve the tree's SAH, even sometimes exceeding that of a single shot sweep builder.

Extending the above refinement improvements to the static tree will require a little more effort, since you ideally wouldn't refit an entire static/inactive tree- it could be an order of magnitude or two larger than the active one!

Roadmap

@RossNordby
https://i.imgur.com/hkaXiNS.png
Seems like roadmap requires corrections? :)

Can't compile with preview nugets

When trying to compile the project I get errors with System.Numerics.Vectors and System.Runtime.CompilerServices.Unsafe not being found.

Reverting to stable v4.4.0 versions on both packages fixes the problem.

If you don't require any feature present in the nuget previews, I would suggest to switch back to v4.4.0 nugets

Tree-including shapes require referenceless representation for buffer storage

To avoid storing GC references in the untyped memory backing the ShapeBatch buffers, a mesh shape that requires a reference to a tree will need some form of referenceless proxy, or we need to revamp the tree to be referenceless.

Making the tree itself referenceless would not be very difficult. The only concern would be confusing users who are used to reference type behavior. However, direct use of the Tree type is likely going to be rare, and we already have precedent for such behavior in the Quick* collections.

Consider reordering collision task subdispatches for instruction cache friendliness

On more complex pair types like box-mesh, the instruction pipeline can become a bottleneck. It's possible for execution to jump from convex-mesh, to box-triangle, to mesh reduction, to nonconvex reduction, to continuation callback, and back again- potentially multiple times in a single batch.

By deferring dispatch of subtasks until after an entire batch has been processed, instruction cache eviction should become far less frequent and severe. Making this change will require a little more temporary cache usage, but it would likely be worth it in nested cases.

Total impact of this change would likely be small- perhaps 5% off of relevant simulations. Implementation is pretty simple.

Non-contact constraint zoo

We have a ball socket joint, but no other constraints at the moment. We should implement the most used of the v1 constraints, and possibly build block solved variants of some of the more common 4-5 DOF constraints.

Important v1 types:

Ball socket
Distance
Point on line
Point on plane
Linear and angular welds
Revolute angular
Swivel hinge angular
Distance limit
Linear axis limit
Swing limit
3DOF angular motor
1DOF linear motor
1DOF angular motor

Good candidates for dedicated block solving:

Welds
Line slider
Plane slider
Prismatic
Revolute
Swivel hinge

Probably won't bother block solving limits with the rest of the constraint; it's a lot of complexity for not-particularly-huge benefit. Motors probably also kept separate for the sake of simplicity.

Less important:

Twist/universal
Twist limit
Twist motor
Angular twist motor (complements twist/universal joint)
Revolute limit (possibly subsumed by swing limit)

Much less important:

EllipseSwingLimit

Some time should be spent considering ways to simplify and unify the zoo. Consolidating a few constraints into one (should that be possible) would likely be worth a small performance hit, unless specific configurations are overwhelmingly common.

Some questions remain about single body constraints. They would likely be redundant with existing two body constraints, but the one body variant would have a performance advantage. They would have to be extremely common to warrant consideration.

Jacobi hybrid solver fallback for excessive batch counts

Constraint batches can only contain one reference to any one body. If two constraints involve the same body, there will be at least two constraint batches in the solver.

If five hundred constraints involve the same body, there will be at least five hundred constraint batches in the solver.

Each constraint batch requires a sync point in multithreaded execution. In pathological cases, this can become a significant cost.

To avoid this, a jacobi solver could be used for all constraints that would have otherwise belonged to constraint batches beyond some (possibly user set) threshold. So, rather than having 500 constraint batches, you would have N constraint batches and 1 jacobi batch.

This would make use of the usual trick to ensure stability- treat each body as only part of the whole, and then average the velocities in a post-solve gather step. Equivalent to treating each body as a bunch of separate bodies connected by 6DOF weld constraints.

The downside to the mass splitting trick (and the reason why having a user-set threshold is a good idea) is that the jacobi batch will tend to be less rigid than PGS/SI batches. Fortunately, the case where you have a ton of constraints associated with one body also tends to be the case where that one body is very, very heavy, and it probably won't be noticed. Think about a bunch of characters running around on a spaceship.

BodyDescription convenience constructors are inconvenient

Too many overloads. Too hard to intuit parameter order. Often less confusing to just use the initializer style, which means the constructors have failed at their only job. Prune out some of them and try to lay out the parameters so adding more optional properties flows naturally.

NUnit tests

Just wondering... why not having NUnit tests?

They're very useful to debug parts of a library, and it can also help for learning how to use the library.

Batched child lookups for tree-involving collision tasks

While child lookups in mesh/compound tasks are not a dominant cost, they still show up quite clearly in profiling.

By virtue of dealing with a smattering of different convexes colliding with potentially different meshes or different areas of the same mesh, memory latency can be a concern. Currently, the per-pair lookups are handled sequentially, so the CPU spends a lot of time stalled. The intersection routines are also not vectorized.

Both of these issues can be addressed by dispatching a bunch of independent loads and then performing vectorized tests on the results.

This will require some overhead. There is no shared state between the different traversals, unlike the RayBatcher. Each step of the traversal will generate a new list of node targets which need to be gathered into a vector friendly format. Given the lack of knowns, that gathering process will likely need to also pull the query data into position.

This concept can also extend to ray tests. It's unlikely to beat the RayBatcher for bulk testing, but it's possible some variant could win on small packets. Ray tests will require a little more care in traversal ordering as usual, but there's nothing blocking that conceptually.

Potential issues:

While this can hide a lot of memory latency, it introduces a bunch of gather work and instruction overhead. It could easily end up slower. Platform intrinsics could help if it turns out codegen issues are the main concern.
The parallel traversal stack can become relatively large. If the batched test count is too high, it could easily start evicting other task batch data from L1. For collision tasks, batch size will be limited.

Consider cutting down on pool-related noise in Quick* collections.

Right now, in order to resize a QuickDictionary, the user has to supply three different pools, each specialized for the appropriate type. This is annoying.

You could avoid the need for multiple specialized pools by changing the pool interface to have a function that is able to return buffers of generic type. A little bit of trickiness required to minimize syntax noise from struct wrapping a BufferPool, but in the worst case it will still be a strict improvement over the status quo.

CCD type dispatchers need to be hooked up

While the contact reduction heuristics for substepped and inner sphere pairs are (ostensibly) implemented, the narrow phase does not yet create entries in the streaming batcher or continuations set for non-discrete pairs.

Check compound collision task child indices

There seems to be some wonkiness in the handling of the flip mask in compound collision task child indices. If they're wrong, it'll make callbacks pretty difficult to interpret. Double check it.