tiegz
New Member
Posts: 5
|
Post by tiegz on Jul 21, 2020 1:11:52 GMT
My RTC implementation in Go originally used values everywhere. I was trying to avoid state issues in advance by using: value types for struct fields, value types for function parameters, and the Null object pattern instead of nil (you can't use nil with value fields in Go anyway).
This week I tried replacing the above with pointers instead, and the performance boost was way better than I expected! Some examples from the book:
Table Scene: 19:01 -> 1:20 Cover Scene: 11:15 -> 0:39 3 Spheres in Room: 0:20 -> 1.7
Anyone else tried this too?
|
|
|
Post by chrisd on Jul 22, 2020 1:39:28 GMT
I haven't done a lot of profiling since matrix inversion caching, so I don't know how much room I have for optimization. I have used a lot of value objects, allocating new objects every time instead of updating existing values. You might have given me incentive to go back and do some profiling.
Out of curiosity, what are your rendering resolutions, and CPU specs (model, # cores/threads, Ghz)?
I am by default using: 640x360, AMD A10-9850K (state-of-the-art 2014 technology! ha-ha), 4 cores (single threaded cores), 3.69Ghz; implemented in Java 8 with parallelized rendering (4 threads) Table scene: 00:48.73 Cover Scene: 00:43.81 Three spheres: 00:19.46
|
|
tiegz
New Member
Posts: 5
|
Post by tiegz on Jul 22, 2020 14:38:50 GMT
Neat that you're running w/4 threads! I picked Go because of its concurrency primitives but haven't even gotten to that point yet, so hopefully soon.
Table Scene: 1600x800 Cover: 400x400 Three Spheres (I think from Ch6): 320x240
I've only tested this on my MacbookPro so far: 3.1 GHz dual-core Intel i7, 16GB , Intel Iris Graphics 6100 1536MB (2015! 😅)
|
|
|
Post by signal11 on Aug 24, 2020 5:52:06 GMT
late reply, thought i would chime in anyways, i have a decade (at least) old "Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz" with 4 cores and 8 threads. my implementation is in c++ and so far i have been upto chapter-10 (taking the scenic route), with 3 spheres on the plane (with blended pattern) i am able to render a 1280x1024 pixel image in approx. 786ms. image is here: Render With Pattern
|
|
|
Post by bezdomniy on Sept 5, 2020 1:15:25 GMT
That's really great performance! I also have a C++ implementation that I am trying to optimise for speed.
I am getting about 640ms on the same type of 3 sphere 1280x1024 scene, but on a much bigger CPU (ryzen 3600x - 6 core, 12 thread).
I'm interested - are you using a library to multithread your ray tracer, or did you implement it yourself? I'm using TBB which isnt speeding it up as much as I'd like (about 2x). Also, do you roll your own vector maths objects? I'm using glm now, which is good, but wondering whether there is a faster way.
Cheers,
Ilia.
|
|
|
Post by signal11 on Sept 17, 2020 11:31:08 GMT
That's really great performance! I also have a C++ implementation that I am trying to optimise for speed.
I am getting about 640ms on the same type of 3 sphere 1280x1024 scene, but on a much bigger CPU (ryzen 3600x - 6 core, 12 thread).
I'm interested - are you using a library to multithread your ray tracer, or did you implement it yourself? I'm using TBB which isnt speeding it up as much as I'd like (about 2x). Also, do you roll your own vector maths objects? I'm using glm now, which is good, but wondering whether there is a faster way.
Cheers,
Ilia.
hi llia, first of all, please excuse me for the tardiness in my reply. it's been a while since i logged on here due to some other engagements. now, coming to the real thing: for this particular case, i am using a concurrent mpmc lockless queue, aka 'work-queue' for dispatching work items to workers/renderers. each 'work-item' comprises 'N' pixels worth of rendering in a specific row. 'N' is computed based on `std::thread::hardware_concurrency()` i.e. number of h/w cores on the machine, and number of x-pixels making up the scene. thus, if a machine has 8 physical cores, and the image is 320 pixels wide (in the x-direction), 'N' would be '320 / 8 == 40'. a worker picks an item from the queue, and renders all pixels specified in the work item. since all work items (that make up the entire canvas) contain disjoint set of pixels, no locking is required. once done with the current work-item, the renderer picks up the next one from the queue. this goes on till all work items in the queue are done. this _seems_ to be a better approach than slicing up the entire scene into multiple stripes, and than letting each of the threads handle a specific stripe. depending on the scene, some threads might get a 'free-ride' by just rendering stripes that contain few non-black pixels. yes, i did experiment with that model as well regarding the libraries: mpmc queue is from moodycamel (https://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++) [*]. everything else is much vanilla i.e. either stl or home-grown. please do let me know if you might need some more information. -- best regards signal-11 [*] sorry for posting the link as text, apparently the forum s/w is stripping the trailing '++' from the url.
|
|
|
Post by bezdomniy on Sept 21, 2020 9:50:32 GMT
That's a great approach - with splitting the scene, I thought the same thing about no letting any threads freeride on empty space pixels, though I just randomly sampled pixels (without replacement) and sent them to each thread. The lock-free queue sounds very interesting! I read the blogpost and the ops/second/thread charts look very impressive indeed. I am going to try and replace my TBB for each loop (which is not lock-free) with this and see if I can improve my concurrency. Thanks for sharing your approach!
Edit: just did it with a simple try_deque and it improved concurrency a lot! Small improvement on small scenes like the 3 sphere one mention above, but a pretty large improvement on rendering 3d model. Shaved about 1/3 off the time for a scene where I have about 300,000 triangles.
|
|
|
Post by signal11 on Sept 22, 2020 4:59:19 GMT
That's a great approach - with splitting the scene, I thought the same thing about no letting any threads freeride on empty space pixels, though I just randomly sampled pixels (without replacement) and sent them to each thread. The lock-free queue sounds very interesting! I read the blogpost and the ops/second/thread charts look very impressive indeed. I am going to try and replace my TBB for each loop (which is not lock-free) with this and see if I can improve my concurrency. Thanks for sharing your approach!
Edit: just did it with a simple try_deque and it improved concurrency a lot! Small improvement on small scenes like the 3 sphere one mention above, but a pretty large improvement on rendering 3d model. Shaved about 1/3 off the time for a scene where I have about 300,000 triangles.
hi llia, it's good to know that the approach works well for you. having a work-queue on which work-items are put and worked upon is a good general purpose approach for tasks that are, to use a cliched term, 'embarrassingly parallel'. rendering tasks, imho, in that regard, have an embarrassment of riches. i am quite sure that more can be squeezed out of processors if you really want f.e via intrinsics etc. but that is slightly more involved, and with modern optimizing compilers, i am not even sure if it would be worth the effort. another minor thing: for my implementation, the final output ppm-canvas is created in the main thread, and passed as a reference parameter to the renderers. and since (as pointed out earlier) each pixel on the canvas is independent, there is no locking at all. thus the work-queue, is of spmc (single-producer-multi-consumer) variant. currently i am still working on chapter-10 of the book and playing around with 2d textures (specifically perlin noise and such). it is so seductive ! -- best regards signal-11
|
|