On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures