My Raytracer History

BlindSide · Post by **BlindSide** » Tue Aug 18, 2009 11:25 am

I posted this on another forum but I thought you guys might find it interesting. It details my work in progress raytracer driver for Irrlicht, which I haven't touched in a long time now.

Just thought I'd share something.

When I first started writing my raytracer I used a very simple test scene (I recommend you keep the scene you have now as a reference to when you improve your performance.)

This is what it looked like (But without the soft shadows, I can't seem to find the original image):

Now I'll tell you a timeline of my modifications and performance changes throughout the development cycle of the raytracer. I'd like to call this "My journey to realtime". All of this was conducted on a single core 1.6ghz AMD Turion laptop with 320mb of ram.

First I started off with a very simple raytracer, shoot rays for every pixel in the screen, loop over every single triangle in every mesh and intersect those with the ray, then do simple lighting and read a texture for those coordinates. Everything is pure C++ no SSE, etc. I also blitted to the screen every screenline so I can see the image update in realtime. This took approximately 8 seconds to render the above image at 512x512 (Without the shadows).

After this I did some tweaks here and there, added a simple KDTree and KDTree builder, lots of artifacts, missing triangles etc from my buggy KDTree code, but the render time suddenly jumped down to 3 seconds for the same image.

After this I removed the constant blitting to the screen, and this significantly reduced the render time to 1 second with the KDTree, and I did a few tweaks here and there to the KDTree algorithm (Early exit, first hit, etc), got rid of a few artifacts, but there were still some rendering problems here and there, and the time only got down to something like 900ms, which isn't a big improvement overall.

After this I got fed up with the KDTree and min/max. Sure the traversal loop only needed like 2 or 3 lines of code, but what's the point when memory bandwidth is usually the biggest bottleneck? So I moved to a BVH. Instantly my render artifacts dissappeared but my render time jumped back up to 1.5 seconds.

I finally bit the bullet and implemented SSE intrinsics. I wrote a simple wrapper vector class that could be switched with a simple compile define to use either SSE intrinsics or simple fpu code (For platforms that didn't support it), and another SOA vector wrapper on top of that that made dealing with 4 vectors at a time easy. This, as revealed by several threads on here, isn't an efficient way to implement SSE (Especially since I was using MSVC which is known for producing especially bad code in this case), but I was happy, with 4 ray packets and this SSE vector class traversing through a BVH, my render time jumped down to 500ms, I finally felt like I was getting really close.

After this I read a few papers and came across Wald et al's excellent "Using BVH with dynamic geometry paper" (Not the main Wald paper on interactive raytracing, but quite popular too). Following this as a guide I implemented frustum traversal, early hit exit, storying ray ids, and all sorts of lil tweaks described in the paper and some that I just thought of myself while staring blankly at the code. My render time after adding frustums went down to around 300ms, and then 250ms, and then finally after much tweaking and optimizing, 180ms (~5fps!). This number I recall very distinctly because I couldn't find any easy way to make it any faster than this. Soon after I got a new computer and that beast could render that same scene using the same code at a steady 20 fps (AMD Phenom 9950 with 2 GB ram). But this laptop's rendering speed I could not forget, as it was the first time I ray traced at an "interactive" frame rate.

Haha sorry that got a bit long but I hope theres something useful for you in there to reach your desired performance goals!

wITTus · Post by **wITTus** » Tue Aug 18, 2009 2:40 pm

I understand now why you stopped the development. You simply wait another few years till you've got a better machine and 100fps.

Lonesome Ducky · Post by **Lonesome Ducky** » Tue Aug 18, 2009 4:17 pm

Very interesting, do you still have it laying around on your hard drive somewhere?

lulzfish · Post by **lulzfish** » Tue Aug 18, 2009 4:21 pm

That's quite awesome.
Any idea how the rendering time would scale if you added another couple of scene nodes?

torleif · Post by **torleif** » Wed Aug 19, 2009 9:59 am

Nice! I'd love to see it put into the irrlicht engine for some fun :p

devsh · Post by **devsh** » Wed Aug 19, 2009 12:17 pm

You got a phenom... i think you "might have" (i think you didn't but anyway incase you did) forgotten about multiple cores. you could split the 512x512 into 256x256 and process 4 quads of 256x256 so all your CPU cores are busy.
I think if you have a quad core... you could have quadrupled your fps to almost 80. You coulg get ready for eight-cores by splitting the 512x512 into 256x128 segments and 8 threads.

Also some rays might fail more often or be traced quicker than others in one quad and the cpu that was challenged with the easier quad could be sitting around doing nothing... a solution to that would be asynchronous rendering/ray tracing meaning each thread out of the four has their own image buffer. The first thread to be finished with their portion of the screen ray traced... updates the camera and events and node movement + animation (note: the other threads function on copy's made from these, so changing these values while they are still ray tracing should have no imapact).
Then it starts tracing the new frame, while the others finish the last. If the thread(s) tracing the new scene finish before the last scene is finished, it/they store the new finished scene parts into buffers and "help out" to finish tracing the old scene. When the last thread is done... the threads that are already ahead one frame present their buffers to the screen and start on the next.

personally i think you could take a hybrid approach to ray tracing

render the scene onto two MRTs

1st - screen normals (E vector as the call it in specular highlights) reflected by normals (GLSL reflect(E,normal) )
2nd - color (or diffuse or just the texture thing)
3rd - depth buffer

then in software (CPU) download the image from VRAM and reconstruct pixel world position and there you have your bounced off rays, and the hardest part is yet ahead finding where they intersect other stuff. I would reccomend a concept similar to matrices, only that you dont contruct one. You multiply stuff with it as if it had no scale and was a 3x3 matrix and was orthogonal.
This way you have your 2 bounce raytracer.

ofcourse the render time would be 2 seconds and there would be absolutely no use to it, but it could be taken a step forward into OpenCL or nvidia CUDA where you wouldnt have to download the MRTs to RAM which cuts your fps to 3. In OCL or CUDA you would be able to use the texture straight in video RAM so no download overhead.

Valmond · Post by **Valmond** » Wed Aug 19, 2009 7:34 pm

It would be cool to check out though (if you have a 'lil demo).

And total respect

BlindSide · Post by **BlindSide** » Thu Aug 20, 2009 1:34 am

devsh I used Intel TBB to schedule the multi-core stuff, it automatically balances the load across cores. This is a pure software renderer, no GPUs were harmed in the making. I will look into OpenCL etc when the technology matures, I still feel it's not ready.

I posted some demos here a few months ago.

I was thinking of releasing a header/lib only package for testing sometime but I just haven't gotten around to it as I'm busy with other stuff.

devsh · Post by **devsh** » Thu Aug 20, 2009 11:22 am

nvidia optix is what you need... it will give you an interactive FPS for sure

bitplane · Post by **bitplane** » Thu Aug 20, 2009 11:31 am

*ignores devsh*

That's pretty interesting BlindSide. Any plans to support multi-threading? Machines are getting faster, but the number of cores is increasing faster than the clock speed is doubling.

BlindSide · Post by **BlindSide** » Thu Aug 20, 2009 12:03 pm

Looks like you ignored my post too because I just mentioned that I use TBB for multithreading.

Yeah NVidia Optix is nice but it's not officially released yet, I'll be sure to check it out eventually.

devsh · Post by **devsh** » Thu Aug 20, 2009 12:46 pm

I think that in 4 years it will be the end of "drawing engines", it will be sorrow...

hybrid · Post by **hybrid** » Thu Aug 20, 2009 2:10 pm

Just as tv killed radio, video killed cinema, ... So, it simply won't happen. Maybe most high-end games will provide a ray-tracer engine, but hand-helds and mobile phones still require rendering engines, because not more than a hand ful of cores will be available there. And what will happen in 15 years is not forseeable.

BlindSide · Post by **BlindSide** » Fri Aug 21, 2009 2:16 am

Yes but cellphones tend to have very small screens, that's why you see so many raycasting (Not raytracing) engines on cellphones, because the performance scales very well with smaller screen sizes (And very badly with larger ones

).

Mel · Post by **Mel** » Fri Aug 21, 2009 12:27 pm

I think that in 4 years it will be the end of "drawing engines", it will be sorrow...

I don't think so. The algorithm for raytracing takes into account for every pixel, all the lights the pixel can see. So, either you use a small amount of lights, or use simple scenes or else you might have a trouble when the number of objects/lights grows. Even with the optimizations, the complexity of the raytracing increases with each light and object you add, so, it would be very limiting to have an engine based only in raytracing.