Extending SLURM with Support for GPU Ranges

Abstract: SLURM resource management system is used on many TOP500 supercomputers. In this work, we present enhancements that we added to our AUCSCHED heterogeneous CPU-GPU scheduler plug-in whose first version was released in December 2012. In this new version, called AUCSCHED2, two enhancements are contributed: The first is the extension of SLURM to support GPU ranges. The current version of SLURM supports specification of node range but not of GPU ranges. Such a feature can be very useful to runtime auto-tuning applications and systems that can make use of variable number of GPUs. The second enhancement involves the implementation of a new integer programming formulation in AUCSCHED2 that drastically reduces the number of variables. This allows faster solution and larger number of bids to be generated. SLURM emulation results are presented for the heterogeneous 1408 node Tsubame supercomputer which has 12 cores and 3 GPU's on each of its nodes. AUCSCHED2 is available at http://code.google.com/p/slurm-ipsched/.