Modern graphics hardware architectures excel at compute-intensive tasks such as ray-triangle intersection, making them attractive target platforms for raytracing. To date, most GPU-based raytracers have relied upon uniform grid acceleration structures. In contrast, the kd-tree has gained widespread use in CPU-based raytracers and is regarded as the best general-purpose acceleration structure. We demonstrate two kd-tree traversal algorithms suitable for GPU implementation and integrate them into a streaming raytracer. We show that for scenes with many objects at different scales, our kd-tree algorithms are up to 8 times faster than a uniform grid. In addition, we identify load balancing and input data recirculation as two fundamental sources of inefficiency when raytracing on current graphics hardware. Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Graphics processors I.3.1 [Computer Graphics]: Raytracing