We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splittingthreshold (sst) for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for finegrain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and easeof-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Au...
Alexandros Tzannes, George C. Caragea, Rajeev Baru