Purdue University operates one of the largest cycle recovery systems in existence in academia based on the Condor workload management system. This system represents a valuable and useful cyberinfrastructure (CI) resource supporting research and education for campus and national users. During the construction and operation of this CI, we encountered many unforeseen challenges and benefits unique to an actively used infrastructure of this size. The most significant problems were integrating Condor with existing campus HPC resources, managing resource and user growth, coping with the distributed ownership of compute resources around campus, and integrating this CI with the TeraGrid and Open Science Grid. In this paper, we describe some of our experiences and establish some best practices, which we believe will be valuable and useful to other academic institutions seeking to operate a production campus cyberinfrastrucure of a similar scale and utility.
Preston M. Smith, Thomas J. Hacker, C. X. Song