In the last post, we generated a constrained delaunay triangulation so that we can use it as a navmesh for some world entities. First, we want to make sure that we can generate this fast enough for a good number of points, because if we can’t we won’t be able to use it in any real time scenario. In my game, each entity is composed of 4 points (rectangles), so 2000 points means 500 entities. We will use this as our test case as this seems like a pretty high bar for one real case triangulation. I created a test case outside of the game for faster iterations, with a random splat of 2000 points.
The first implementation is very naive, everything is stored in a Triangle array, with triangle being:
union Triangle
{
struct
{
u32 a;
u32 b;
u32 c;
};
u32 E[3];
};
Triangle* trianglesArray = PushArray(arena, Triangle, maxTriangleCount);
u32 triangleCount = 0;
I have no acceleration structure of any kind, any search needs to search over all triangles everytime.
For measurements, I used msvc version 19.34.31937, with 02, and use the rdtsc intrinsics. We are testing triangulation, followed by a delaunay transformation, and applying very simple 3 edges constraints to it.
The very first naive implementation is horrrible. It takes around 570ms on average to do a triangulation, with 99.7% of the time being in the delaunay step.
Adding Functionnality: Adjacency
One of the main feature missing from the previous post, was triangle adjacency. We will store for each edge of the triangle that is on the other side of that edge. We will need this for navigation for sure, but this could be a significant speed up as we iterate over all triangles to find the triangle that shares a specific edge for now.
The main idea is that the algorithm is a hull expansion so each time we incorporate a new point, we share an edge with the hull but we know which one so we can store the information.
We store the hull as an array of point indices, but the edges are already represented as two consecutive indices in this array. We will store an array of the same size as the hull with the triangle index and each element will correspond to the triangle adjacent to the edge hull[i]- hull[i+1]. On the image, if our index is 0 we are on point 0, and the corresponding edge will be [0,1] which is an edge of triangle A.
With this set up, when we add a triangle, we know which hull edge generated it. So let’s store the same information for each triangle. Each time we add a triangle after selecting a point, we know that one edge is shared with the hull. And each time we expand the hull forward or backward 2 edges are shared. Then update the triangle we’re linked to.
With this implemented, we can speed up the delaunay pass, as instead of looking for shared edges with every triangle, we just take the 3 we are interested in with the adjacency structure.
Triangle* trianglesArray = PushArray(arena, Triangle, maxTriangleCount);
Triangle* adjacencyArray = PushArray(arena, Triangle, maxTriangleCount);
u32 triangleCount = 0;
// This is how you get the adjacent triangle
Triangle* adjacent = trianglesArray + adjacencyArray[triangleIndex].E[edgeIndex];
By adding this, we go down from 570ms to 5.4ms, which is a 10x speedup, making it better but not quite fast enough. Here is the breakdown after this pass :
Delaunay 8 600 000 cycles, 52% of the time
Enforcing Constraints, 5 900 000 cycles, 40% of the time
Triangulation 1 420 000 cycles, 8% of the time
Incorporating fix delaunay into the triangulation
My current version has the transformation into a delaunay triangulation as a second pass but this is wasteful, as we have to iterate over every triangle to check circumcircles until we see no ill triangles. But we can do this at a much better time: when creating the triangles. Even better, every time we know which edge is shared with the hull, so no need to check the other ones.
We must be careful thought to correctly update the hull triangles when you flip an edge, as its corresponding triangle may change. With this simple change, we see another speed up, which brings us to 3.5ms.
Gains: -3 500 000 cycles
Improving Line Intersection
According to my profiler, a lot of time is being spent in the setup of the enforce edge function, specifically in the line intersection. This is a pretty handwavy version, with conditions and divisions :
// @Note lines are f1-t1 vs f2-t2
internal bool intersects(Vector2 f1, Vector2 t1, Vector2 f2, Vector2 t2)
{
bool result = false;
Vector2 edgeDir = t1 - f1;
Vector2 dir = t2 - f2;
r32 m1 = edgeDir.y / edgeDir.x;
r32 c1 = f1.y - m1 * f1.x;
r32 m2 = dir.y / dir.x;
r32 c2 = f2.y - m2 * f2.x;
if(m1 - m2 != 0)
{
r32 interX = (c2 - c1) / (m1 - m2);
r32 interY = m1 * interX + c1;
Vector2 point = v2(interX, interY);
bool onfirst = (point.x >= Minimum(f1.x, t1.x)) && (point.x <= Maximum(f1.x, t1.x)) && (point.y >= Minimum(f1.y, t1.y)) && (point.y <= Maximum(f1.y, t1.y));
bool onsec = (point.x >= Minimum(f2.x, t2.x)) && (point.x <= Maximum(f2.x, t2.x)) && (point.y >= Minimum(f2.y, t2.y)) && (point.y <= Maximum(f2.y, t2.y));
result = onfirst && onsec;
}
return result;
}
Let’s write an intersection routine that is only multiplication and substractions:
// @Note lines are f1-t1 vs f2-t2
internal bool intersects(Vector2 f1, Vector2 t1, Vector2 f2, Vector2 t2)
{
r32 x1x3 = (f1.x - f2.x);
r32 x1x2 = (f1.x - t1.x);
r32 y3y4 = (f2.y - t2.y);
r32 x3x4 = (f2.x - t2.x);
r32 y1y3 = (f1.y - f2.y);
r32 y1y2 = (f1.y - t1.y);
r32 tUp = x1x3 * y3y4 - y1y3 * x3x4;
r32 tDown = x1x2 * y3y4 - y1y2 * x3x4;
#define SAME_SIGN_32(a, b) ((((u32)(a) ^ (u32)(b)) & 0b10000000000000000000000000000000) == 0)
bool tIntersec = SAME_SIGN_32(tDown, tUp) && (absolute(tDown) >= absolute(tUp));
r32 uUp = x1x3 * y1y2 - y1y3 * x1x2;
bool uIntersec = SAME_SIGN_32(tDown, uUp) && (absolute(tDown) >= absolute(uUp));
return tIntersec && uIntersec;
}
Gains: -200 000 cycles
Update: The version above is slightly wrong, as it does not handle parrallels correctly (we need to check that the result would not be 0), and the same sign is wrong for the [-1, 0] range, as -0.2f will be (u32)-0.2f = 0
So here the revised version:
internal bool intersects(Vector2 f1, Vector2 t1, Vector2 f2, Vector2 t2)
{
r32 x1x3 = (f1.x - f2.x);
r32 x1x2 = (f1.x - t1.x);
r32 y3y4 = (f2.y - t2.y);
r32 x3x4 = (f2.x - t2.x);
r32 y1y3 = (f1.y - f2.y);
r32 y1y2 = (f1.y - t1.y);
r32 tUp = x1x3 * y3y4 - y1y3 * x3x4;
r32 tDown = x1x2 * y3y4 - y1y2 * x3x4;
#define SAME_SIGN_32(a, b) (((*(u32*)(&(a))) ^ (*(u32*)(&(b)))) & 0b10000000000000000000000000000000) == 0
r32 tDownAbs = absolute(tDown);
r32 tUpAbs = absolute(tUp);
bool tIntersec = SAME_SIGN_32(tDown, tUp) && (tDownAbs >= tUpAbs) && (tUpAbs > GAME_FLOAT_EPSILON);
r32 uUp = x1x3 * y1y2 - y1y3 * x1x2;
r32 uUpAbs = absolute(uUp);
bool uIntersec = SAME_SIGN_32(tDown, uUp) && (tDownAbs >= uUpAbs) && (uUpAbs > GAME_FLOAT_EPSILON);
return (tDownAbs > GAME_FLOAT_EPSILON) && tIntersec && uIntersec;
}
Remove waste
One question I always ask myself when optimizing is : are we doing some operations that are just not necessary ? And there is indeed such a thing here. Each time before flipping an edge, we check if the 4 points forms a convex polygon, because flipping an edge of a concave polygon would not make sense. But it appears that isInCircleDefinedByTriangle
is a strong enough guarantee to not have ill triangles.
if(isInCircleDefinedByTriangle(p[t->a], p[t->b], p[t->c], p[t2->E[r2.uniqueT]])
&& formAConvexPolygon(r, r2, *t, *t2, p))
{
flipEdge(triangleIndex, j, triangles, adjacency, r, r2);
}
can become :
if(isInCircleDefinedByTriangle(p[t->a], p[t->b], p[t->c], p[t2->E[r2.uniqueT]]))
{
flipEdge(triangleIndex, j, triangles, adjacency, r, r2);
}
Gain : - 300 000 cycles
Hull point selection
Another improvement is to have a smater plan to find the first point of the hull to expand (reminder if you forgot how it works). The one that is used is a simple loop over everybody :
u32 hullIndex = 0;
r32 best = R32Min;
for(u32 j = 0; j < hullCount; ++j)
{
Vector2* v2 = points + hull[j];
r32 r = dotProduct(*v2, projectAgainst);
if(r > best)
{
best = r;
hullIndex = j;
}
}
The problem is that we iterate over all the points here, which we do not have to because we know one thing : the hull is convex so the derivative of our projections on the selection axis will change sign at some points, as we iterate over the hull. Let’s use a diagram to illustrate what I mean:
What I can do instead is take the first 2 consecutive points of the hull and project them. Taking the difference between the two, I can know which way to go to find the maximum. Next time I find a solution smaller than the one I have, I have found my closest point.
The code is a little uglier maybe, but we know that we will check only half of the points, which is an improvement, even if the hull is never really huge. Here what the code looks like :
u32 hullIndex = 0;
r32 best = R32Min;
r32 zeroScore = dotProduct(points[hull[0]], projectAgainst);
r32 oneScore = dotProduct(points[hull[1]], projectAgainst);
// @Note positive delta
if(zeroScore < oneScore)
{
hullIndex = 1;
best = oneScore;
r32 previous = oneScore;
for(u32 j = 2; j < hullCount; ++j)
{
Vector2* v2 = points + hull[j];
r32 r = dotProduct(*v2, projectAgainst);
if(r > best)
{
best = r;
hullIndex = j;
}
if(r < previous)
{
break;
}
previous = r;
}
}
else // @Note negative delta
{
best = zeroScore;
r32 previous = zeroScore;
for(u32 j = hullCount - 1; j > 1; --j)
{
Vector2* v2 = points + hull[j];
r32 r = dotProduct(*v2, projectAgainst);
if(r > best)
{
best = r;
hullIndex = j;
}
if(r < previous)
{
break;
}
previous = r;
}
Gain: - 130 000 cycles
Determining third point
Once we know the point on the hull that is closest, we need to find a second one to form a triangle, but we need to decide wether to take the next point on the hull or the previous. What I did to determine this was kinf of too complex. I took the edge of the hull from the selected to the next, took the perpendicular, and projected the third point onto it. If the result was greater than the original point, then this third point was on the correct side, and we selected it.
What I can do instead is a cheaper operation, to test if it is clockwise. If A-C-Next is clockwise, select next, otherwise select previous.
Gain: - 50 000 cycles
More direct adjacency indexing
Upon inspection, a lot of time in this version was spent trying to know the index of the shared edge in a triangle. A triangle is composed of 3 points, but my adjacency structure only contains an index to a triangle, but not which edge in that triangle it belongs to. Which means that when updating an adjacency triangle, we have to find the correct edge, which implies a small loop over the 3 edges, every time we add a triangle or flip an edge.
internal void fixAdjacency(u32 adjacencyIndex, u32 a, u32 b, u32 index, Triangle* triangles, Triangle* adjacency)
{
if(adjacencyIndex)
{
u32 fix = index + 1;
u32 ind = (adjacencyIndex - 1);
Triangle* tAdj = triangles + ind;
u32 adjIndex = 2;
for(u32 i = 0; i < 2; ++i) // @Speed
{
if(((tAdj->E[i] == a) && (tAdj->E[i + 1] == b)) || (((tAdj->E[i] == b) && (tAdj->E[i + 1] == a))))
{
adjIndex = i;
break;
}
}
adjacency[ind].E[adjIndex] = fix;
}
}
Instead of this, let’s store directly this sub index with 2 bits.
<30 bits: triangle Index> | <2 bits: edge index>.
This incurs a limitation of only 2^30 triangles but this will be fine for my purposes. Tis means that the above function is now only :
adjacency[triangleIndex].c = triangleFrom;
adjacency[(triangleFrom >> 2) - 1].E[triangleFrom & 0b11] = (triangleCount << 2) | 0b10;
Gain: -495 000 cycles.
Better sort strategy
This one may be due to a poor implementation on my part as, as everything else I wrote my own sort routines. And we need to sort points due to their distance to the the first circle center. I used mergesort, but just switching to a radix sort instead (which is very easy, as the key is already 32 bits) gives us a pretty big gain. Maybe my merge sort is slower than it should be, which I should check soon as it is the one use for sorting 2d layers.
Gain: -300 000 cycles.
Use look up table instead of mod operation
The adjacency structure now gives us directly the shared edge, but we do need the other two points for flipping edge which implies:
u32 pointIndex = adjacency.E[triangleIndex] & 0b11;
u32 pointIndex2 = (pointIndex + 1) & 3;
u32 unique = (pointIndex + 2) & 3;
This code gets called a lot, and the mod operation is not the cheapest. So using a very small look up table instead, give some gains
globalVariable u8 table[3][2] =
{
{1,2},
{2,0},
{0,1},
};
u32 pointIndex = adjacency.E[triangleIndex] & 0b11;
u32 pointIndex2 = table[pointIndex][0];
u32 unique = table[pointIndex][1];
Gain: -30 000 cycles
Shared fixed delaunay stack
When we flip an edge, we must check that the other adjacent triangles still holds the Delaunay condition. For this we use a small stack that looks like this.
struct DelaunayStack
{
u32 triangleIndex;
u32 pointIndex;
};
// And in fixDelaunay function
void fixDelaunay(u32 triangleIndex, u32 pointIndex, Triangle* triangles, Triangle* adjacency)
{
DelaunayStack stack[128];
stack[0].triangleIndex = triangleIndex;
stack[0].pointIndex = pointIndex;
u32 stackCount = 1;
while(stackCount)
{
DelaunayStack s = stack[--stackCount];
// @Note flipEdge stuff
}
}
And I also had multiple fixDelaunayCalls, as I could only pass one triangle index and pointIndex in. Instead I now inject the stack in the call preparing and sharing the stack outside.
Gains: -85 000
Failed SIMD attempt
I tried to accelerate the hull update when flipping an edge with SIMD. When you flip an edge, some edges of the hull have their correpsonding triangle change and for that I use a simple array iteration, looking for the desired edge. I though that maybe a SIMD version could speed that up, allowing a faster search comparing 4 elements at a time. But in the end my version was slower because of the number of indices in the hull probably not being large enough to get decent speed out of it.
Comparisons with existing Libraries
Now that we have pretty decent speed ( almost half a millisecond for 2000 points), let see how we compare to existing libraries in this space. Here are the one I tested against :
Delaunator, bl4ckb0ne, Geompack and CDT, all libraries being distirbuted on Github and easy to integrate. There are some more libraries out there like CGAL, but they are often huge, and hard to integrate in a small test project.
Each run is the average of 100 triangulations of 2 000 random points, generated with the rand() C library function. We calculate the number of cycles using the rdtsc intrinsic. There is no edge constraints here only triangulation, as all libraries does not provide this capability. Kiroxas is my version.
As you can see, I’m doing pretty good on this, as I’m the fastest of the bunch. It is good to keep in mind that this shows that my solution is the best for my needs, as the other libraries may provide lots of intersting features that I do not have or handle cases I do not handle. But if you flip this around, why would I want to pay the price for features that I do not need ? And if you do need a library for this, shoutout to Delaunator which is pretty fast, and very simple to integrate.
So, we now have a pretty fast triangulation routine which means I can now generate dynamic navmesh without it being a bottleneck. But we did not do anything interesting with them yet. This is now the next step, being able to generate path on those navmesh so that entity in the game can use them.
Thank you for reading.