Sparse LU factorization with partial pivoting is important for many scienti c applications and delivering high performance for this problem is di cult on distributed memory machines. Our previous work has developed an approach called S that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping e ectively identi es dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low bu er space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS-3 level e ciency and avoids unnecessary arithmetic operations. The experiments show that S+ improves our previous code substantially and