We report on our work in developing a fine-grained multithreaded solution for the communicationintensive Conjugate Gradient (CG) problem. In our recent work, we developed a simple yet efficient program for sparse matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is integrated with the sparse MVM, resulting in a scalable implementation of the complete CG application. Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., absolute speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reductionbroadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3) By slowing down the network by a factor of 2, no notable degradation of overall CG performance was observed.
Kevin B. Theobald, Gagan Agrawal, Rishi Kumar, Ger