This site requires Cookies enabled in your browser for login.
. . .
Alias List Editor
AOSM2022: Using actors to increase scalability and fault tolerance of SUMMA
Section 1: Publication
Authorship or Presenters
Kyle Klenk, Raymond J. Spiteri, Reza Zolfaghari, Kevin R. Green
Using actors to increase scalability and fault tolerance of SUMMA
Hydrology and Terrestrial Ecosystems
10-minute oral presentation
Kyle Klenk, Raymond J. Spiteri, Reza Zolfaghari, Kevin R. Green (2022). Using actors to increase scalability and fault tolerance of SUMMA. Proceedings of the GWF Annual Open Science Meeting, May 16-18, 2022.
Section 2: Abstract
Plain Language Summary
SUMMA is a modeling framework that is used for hydrological simulations over large-scale domains, such as the North American continent, which consists of more than half a million hydrological response units (HRUs). In the standard approach to perform such simulations on shared computing resources such as Compute Canada, the HRUs are divided into batches, and the batches then submitted as individual jobs. For the continental North America run described, the batch size is around 500 and results in approximately 1000 jobs. There are a few issues with this approach. First, each job can only utilize one CPU. Second, if any HRU fails, the job is halted. The failed HRU then has to be identified, have its settings adjusted, and be resubmitted manually. Besides the labour-intensive nature of this task, the resubmission to the queue risks further delay because the priority within the queue may decrease with each job submission. In other words, the current approach to running large simulations is neither scalable nor fault tolerant. To address these issues, we redesigned SUMMA to leverage the actor model to separate SUMMA's state from the global structure of HRUs. The actor model is an abstraction of concurrent computation that uses actors as the basic units of computation. An actor has a private state and its own thread of execution and can only communicate with other actors through messages. We developed a new implementation known as SUMMA-Actors that represents each HRU as an HRU-Actor. Separating HRUs into actor components allows us to run them concurrently, thus increasing scalability because jobs can utilize more CPUs resulting in decreased run-time. We have observed essentially perfect scaling when solving one job of 500 HRUs with one, two, and four CPUs, with run-times (HH:MM:SS) of 16:24:50, 08:17:48, and 04:06:53, respectively. By comparison, the standard implementation has a run-time result of 14:32:22. To enable fault tolerance, SUMMA-Actors uses state separation to contain failures within a single HRU and a hierarchical supervision strategy provided by the actor model. The former prevents HRU failures that result in job failures, allowing the remaining HRUs to continue. The latter allows for the implementation of a supervisor actor called the job-actor. The job-actor allows SUMMA-Actors to address failures at run-time, modify the HRU settings, and restart it without going back into the queue. All told, SUMMA-Actors provides a substantive reduction in wall clock time and human effort required to complete large-scale SUMMA simulations.
Section 3: Miscellany
University of Saskatchewan
First Author: Kyle Klenk, University of Saskatchewan
Additional Authors: Raymond J. Spiteri, University of Saskatchewan, Reza Zolfaghari, University of Saskatchewan, Kevin R. Green, University of Saskatchewan
Section 4: Download
T-2022-04-24-s108Wrs39s1EEqEN6h0hnMx4w Conference Publication 1.0