Advertisement

The impossible mission of 1 pixel thick lines GL_LINES

Started by August 30, 2024 09:45 PM
35 comments, last by Aybe One 1 week, 1 day ago

Bad news…

I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS…
Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*

Tried to change thread groups sizes, send nothing, call itself is still as slow as before…
Doing nothing but assigning color in shader brings it down to 35% usage but that's useless.

Only thing I need is to draw 3840 vertical lines but that appears to be a problem today, lol.
This, beside the hackish approach I've posted before using GL class which works fast.

Honestly, I don't know what to think at this point…

* it is smooth but only renders 1 group (256x256), not sure why, not sure it's worth figuring it out either

I need is to draw 3840 vertical lines but that appears to be a problem today,

Nothing new to rendering.

Be sure you are making one call to render 3840 segments, not 3840 calls to render 1 segment each.

And again, make sure you learn about diamond exit that you keep avoiding. It is not a pixel placement rendering to a frame buffer, make sure your line segments actually fill pixels.

Advertisement

Aybe One said:
I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS… Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*

You iterate over ALL lines in each thread of each invocation of your shader:

for (int i = 0; i < LinesCount; i++)

Int other words: For each of (all) 8 million pixels iterate over 4 (all) thousand lines. /:O\

This is like saying ‘hey, i have 4k little cores - let's make sure that each of those cores processes the same and as much work as possible’.

Ofc. it's slow. It's as inefficient as possible. You deserve the bad results to remind you that you should do at least a bit of optimization.

A trivial approach would be:

Notice each invocation draws a tile of 8x8 pixels.

1. Bin the lines to those tiles, so each tile gets its own list of lines intersecting this tile.
Technically this can be done with a shader processing one line per thread, increasing a atomic counter per tile in VRAM. That's not the fastest option, but it's simple and should be fast enough.

After the counts for all tiles (or ‘bins’) are known, calculate a prefix sum over the counters to know the amount of memory you need to store all per tile lists. So each tile knows it's beginning and end of it local list of lines.
Technically this requires one thread per tile, so a dispatch of (horizontal res / 8) x (vertical res / 8) threads to increase the counters.
If it's only 4K lines, i would do the prefix sum in a single workgroup of maximum size (which is 1024), which should be faster and much simpler than using more threads by using and synchronizing multiple workgroups but increasing the number of disptaches as well.
At this point you can also set up indirect dispatches for the tiles which are not empty for the next step.

Finally, draw the lines. We get one invocation for each tile that is not empty, and we have a list of lines covering this tile. So we draw the lines, but we need to clip them so we do not draw pixels multiple times from lines that appear in multiple segments.

Alternatively we could make sure each line appears only on one bin, by treating it as a point (center or first point of a line segment). Then there is no need for clipping and one thread just draws one whole line.
I guess this would be a bit faster if all your segments have a similar length.

The whole process is very similar to the common example of binning many lights to a low res screenspace grid, often done in deferred rendering to handle large amounts of lights.
It's a good exercise to learn some parallel programming basics, but it's still a serious effort just to draw some lines. So if you have no interest in GPU and parallel programming right now, ther should be still another way that is fast enough.
I see you only use vertical lines. But then it should be possible to match your compute reference with standard line rendering. Since they are all vertical, fill rules and subpixel conventions should not cause issues or confusion i would assume.

Another property (which i have ignored in my general proposal above) is the fact that your lines are probably already sorted horizontally, and that for each vertical row of pixels there will be only one pixel drawn.
This property should allow for a simpler and specific way to optimize, ideally nor requiring multiple dispatches for binning (which always is some kind of sorting as well).
That's surely the way to go if conventional triangle reasterization indeed fails.

…thinking of it, the simplest compute solution would be probably:

Create one GPU thread per line, and draw the complete line within this single thread using a loop.
Only one dispatch is needed, but no binning and no regular grid acceleration structure.

Potential problems:

It's eventually inefficient, e.g. if all segments have a length of 3 pixels, but one line has a length of 1000 pixels. But i guess this can't happen to you.

Multiple threads might attempt to draw at the same pixel at the same time, creating flicker due to write hazards. So you probably need to use atomicMax to draw a pixel.

frob said:

I need is to draw 3840 vertical lines but that appears to be a problem today,

Nothing new to rendering.

Be sure you are making one call to render 3840 segments, not 3840 calls to render 1 segment each.

And again, make sure you learn about diamond exit that you keep avoiding. It is not a pixel placement rendering to a frame buffer, make sure your line segments actually fill pixels.

Yes… I think I am starting to understand that N calls issue for the compute shader, specifically…

I looked at the diamond exit rule but the OpenGL specs aren't very helpful (page 65)…

But one thing they've said, is that an implementation might differ, more on that later.

Looking at DirectX rasterization rules, it's already clearer as there's an image:

illustration of examples of aliased line rasterization

I have then been able to “live-test” using that handy line-rasterization project:

So in practice, I do the following for a pixel-perfect vertical line but with extra vagueness:

  • X fractional part must be 0.5 else it overlaps between 2 pixels, OK
  • Y2 must be Y1 + line height, i.e. 1 extra pixel that won't get drawn, OK
    • BUT that extra for Y2 can be from ~0.75 to ~1.25, WEIRD

In the end, the rules I inferred by looking at somewhat helpful docs I could find are no different than what I found by manually experimenting, i.e. the problem is solved although I don't fully get it I admit…

Also, to me, GL.LoadPixelMatrix is doing something special we're not told. To confirm this, I did a manual orthogonal projection and although it works, it isn't stable as theirs. Either result differs on MSAA VS no MSAA, or the rules for horizontal/vertical lines have to be swapped... No idea what but it's doing something.

In the end, I stick to GL.LoadPixelMatrix and abide by the rules it has put in place.

To be continued… 🤣

JoeJ said:

Aybe One said:
I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS… Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*

You iterate over ALL lines in each thread of each invocation of your shader:

for (int i = 0; i < LinesCount; i++)

Yes, yes and yes… 🤣

I yet have to fully digest these suggestions to try come up with a fix.

However, after quick-testing, I have conceptual issues I have hard time with:

  • in my case, I assume I can make [numthreads(128, 1, 1)] since it's horizontal work
    • but with that, only the 1st scan-line is drawn…
    • so I must increase Y threads but then GPU usage is up again…
  • although I tried segmenting the loop which immediately reduced GPU usage
    • the latter threads are still processing the first lines (guess)
    • so, the gain is there but it isn't amazing (should be as far as I understand)
  • currently dissecting a frame with RenderDoc, still getting acquainted with it

So yes, as you mentioned, I clearly overlooked that part, looking into it.

Meanwhile, I improved the initial GL approach and it's much more solid, that'll be my backup plan.

But I must get that compute shader right, to learn something new that should be useful in the future.

Advertisement

Aybe One said:
in my case, I assume I can make [numthreads(128, 1, 1)] since it's horizontal work but with that, only the 1st scan-line is drawn…

This 3D grid to index the workload can be very confusing.
Personally i have never used the 2nd and 3rd indeces. I always use just the first as quoted. But for image / volume processing the 3D index might help with ideal memory access patterns. At least i guess that's why they use 3D indices.

However, to make your exampel work, you then need to do your own 1D index → 2D pixel coords mapping. I'll try:

[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
	//int2 xy = id.xy;

	int2 xy = int2(id.x & 7, (id.x>>3)); // edit: fixed typo y->x // but likely it does not work, having confused global with local thread id

	Result[xy] = Clear;

	for (int i = 0; i < LinesCount; i++)
	{
		const Line data = Lines[i];

		if (xy.x == data.X1)
		{
			const int y1 = data.Y1;
			const int y2 = data.Y2;

			if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
			{
				Result[xy] = data.Color;
			}
		}
	}
}

After that it should work as before with 64 threads. But perf should not really change.

Just FYI.

[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{

	int i = id.x; // assuming this gives the global thread index

	if (i < LinesCount) // check is needed because some threads of the last workgroup will exceed, if LinesCount is not a multiple of 64 

	{
		//int2 xy = id.xy;

		//Result[xy] = Clear; // no - you must clear it before with its own dispatch
. here we only draw line pixels
	
		const Line data = Lines[i];

		
			const int x = data.X; // assuming you have that
			const int y1 = data.Y1;
			const int y2 = data.Y2;

			//if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
			for (int y=y1; y<y2; y++)
				Result[int2(x, y)] = data.Color;

	}		
}

I thought i could rewrite your shader as said above.

JoeJ said:
…thinking of it, the simplest compute solution would be probably: Create one GPU thread per line, and draw the complete line within this single thread using a loop. Only one dispatch is needed, but no binning and no regular grid acceleration structure. Potential problems: It's eventually inefficient, e.g. if all segments have a length of 3 pixels, but one line has a length of 1000 pixels. But i guess this can't happen to you. Multiple threads might attempt to draw at the same pixel at the same time, creating flicker due to write hazards. So you probably need to use atomicMax to draw a pixel.

Each thread drawns one line, so your dispatched size needs to be enough threads for all lines.
If all your lines are short or have similar length, this should e be the best solution and it's also the simplest one.

LOL, I was replying while you posted V2 so I thought I'd give it a try first:

Artifacts are okay but unfortunately GPU usage was ~85%.

Anyway… just had a revelation a few minutes ago…

Not sure what I had in mind before but this works perfectly:

[numthreads(32, 32, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
	int2 xy = id.xy;

	Result[xy] = Clear;

	const Line data = Lines[xy.x];

	if (xy.x == data.X1)
	{
		const int y1 = data.Y1;
		const int y2 = data.Y2;

		if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
		{
			Result[xy] = data.Color;
		}
	}
}

Perfectly… or almost… GPU is ~42% for drawing 3840 lines. 🤣

More or less feeling I had, that it'd work but would be very hungry for the outcome, indeed…

I think I'll polish my cheap approach with GL, it literally works for free.

Aybe One said:
Perfectly… or almost… GPU is ~42% for drawing 3840 lines. 🤣

Wait - you think it's an improvement becasue it does the same thing with less GPU utilization?

I think i can explain why it goes down.
By using 32*32*1, your workgroup size is at the maximum of 1024 threads.
The problem with large workgroups is that now a whole CU is needed for a single invocation, and so it can no longer hide VRAM latency by switching to other invocations while waiting on the memory.

So i guess you just made it slower? If so, you should use timesamps or a profiling tool instead observing utilization percentage.

Aybe One said:
I think I'll polish my cheap approach with GL, it literally works for free.

Personally i would surely be happy with a linestrip. With AA it should also give better quality. But i never noticed missing pixels artifacts when i did this.

However - the proposed ‘line per thread’ compute shader might be faster, since there is no need for triangle setup, culling, edge rasterization.

Advertisement