Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization