Balancing act

October 2013
Filed under: Pacific Northwest

Krishnamoorthy and his team worked on porting or optimizing the codes to run on Jaguar, Oak Ridge National Laboratory’s Cray XT5 system. Now the PNNL scientists are doing something similar on Jaguar’s successor, Titan.

Recently, the research team demonstrated that the most computationally intensive portion of the calculation can be run on 210,000 processor cores of Titan, a Cray XK7 at Oak Ridge’s Leadership Computing Facility, achieving more than 80 percent parallel efficiency.

“When I joined PNNL we were still looking at codes that run on a few thousand cores,” Krishnamoorthy says. When Chinook, the lab’s computational chemistry machine, arrived, “we were immediately jumping up to 18,000 cores.” Doing one calculation at a time and constantly shuttling data to disk drives “was not going to suffice.”

Tackling and solving the jump to highly parallel computing and innovating how the system deals with workload and faults earned Krishnamoorthy a DOE Early Career Research Program award. It grants him $2.5 million over five years to explore ways he can extend his ideas to exascale computing. Shortly thereafter, Krishnamoorthy learned he had also been awarded PNNL’s 2013 Ronald L. Brodzinski Award for Early Career Exceptional Achievement.

He’s now broadening the methods developed for computational chemistry to apply to any algorithm that has load-imbalance issues.

“You want a dynamic environment where the execution keeps on going and the user doesn’t have to worry about statically scheduling everything – the run-time engine just automatically adapts to changes in the machine and in the problem itself,” he says.

The current framework is called Task Scheduling Library (TASCEL) for Load Balancing and Fault Tolerance.

“We are now trying to adapt this method to the new codes as they are developed,” Krishnamoorthy says. He wants moving one version of a program to the next generation to be seamless, by automating the process.

“You have to write the program in terms of collections of independent work or tasks and the relationships among them in terms of who depends on who and what data they access,” he says. “As long as it is written this way, the run-time can take over and do this load balancing, communication management and fault management automatically for you.”

His methods address two of the most daunting challenges facing exascale computing: load imbalance and fault tolerance. He’s contemplating not what computers will look like in the next two to three years but instead what challenges there’ll be with applications running on exascale computers eight to 10 years from now.

The cost of failure can be steep, but Krishnamoorthy is making it less so every day.

 

(Visited 3,308 times, 1 visits today)

About the Author

Karyn Hede is news editor of the Nature Publishing Group journal Genetics in Medicine and a correspondent for the Journal of the National Cancer Institute. Her freelance writing has appeared in Science, New Scientist, Technology Review and elsewhere. She teaches scientific writing at the University of North Carolina, Chapel Hill, where she earned advanced degrees in journalism and biology.

Leave a Comment

You must be logged in to post a comment.