Heterogeneous supercomputers with combined general purpose and accelerated CPUs promise to be the future major architecture due to their wideranging generality and superior performance / power ratio. However, developing applications that achieve effective scalability is still very difficult and in fact unproven on large-scale machines in such combined setting—most past results have been limited to single-node applications. We show that effective large scale computation on such heterogeneous machines requires careful analysis of the application algorithm, and virtualizing the underlying compute resources so as to compensate for the differences in their performances, so that porting from applications written with homogeneous assumptions could be achieved. We demonstrate our methodology with High performance Linpack (HPL) on the TSUBAME heterogeneous supercomputer. We efficiently load balanced between over 10,000 general purpose CPU cores and 360 SIMD accelerators on TSUBAME and ach...