While modern CPUs offer an increasing number of cores with shared caches, prevailing execution engines for business processes, workflows, or Web service compositions have not been optimized for properly exploiting the abundant processing resources of such CPUs. One factor limiting performance is the inefficient thread scheduling by the operating system, which can result in suboptimal use of shared caches. In this paper we study performance of the JOpera business process execution engine on a recent multicore machine. By analyzing the engine's architecture and by binding threads that are likely to access shared data to cores with a common cache, we achieve speedups up to 13% for a variety of workloads, without modifying the engine's architecture and implementation, apart from binding threads to CPUs. As the engine is implemented in Java, we provide a new Java library to manage thread bindings and hardware performance counters. We also leverage hardware performance counters to ...