We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose
a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively
parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums
for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized
ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to
process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages.
Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression.
Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and
parallel scan per each BVH level.
We demonstrate our algorithm with area light sources to get ...