Using actors to increase scalability and fault tolerance of SUMMA

Section 1: Publication

Publication Type

Authorship

Klenk Kyle, Spiteri Raymond J., Zolfaghari Reza, Green Kevin R.

Title

Using actors to increase scalability and fault tolerance of SUMMA

Year

2022

Publication Outlet

AOSM2022

DOI

ISBN

ISSN

Citation

Kyle Klenk, Raymond J. Spiteri, Reza Zolfaghari, Kevin R. Green (2022). Using actors to increase scalability and fault tolerance of SUMMA. Proceedings of the GWF Annual Open Science Meeting, May 16-18, 2022.

Abstract

SUMMA is a modeling framework that is used for hydrological simulations over large-scale domains, such as the North American continent, which consists of more than half a million hydrological response units (HRUs). In the standard approach to perform such simulations on shared computing resources such as Compute Canada, the HRUs are divided into batches, and the batches then submitted as individual jobs. For the continental North America run described, the batch size is around 500 and results in approximately 1000 jobs. There are a few issues with this approach. First, each job can only utilize one CPU. Second, if any HRU fails, the job is halted. The failed HRU then has to be identified, have its settings adjusted, and be resubmitted manually. Besides the labour-intensive nature of this task, the resubmission to the queue risks further delay because the priority within the queue may decrease with each job submission. In other words, the current approach to running large simulations is neither scalable nor fault tolerant. To address these issues, we redesigned SUMMA to leverage the actor model to separate SUMMA's state from the global structure of HRUs. The actor model is an abstraction of concurrent computation that uses actors as the basic units of computation. An actor has a private state and its own thread of execution and can only communicate with other actors through messages. We developed a new implementation known as SUMMA-Actors that represents each HRU as an HRU-Actor. Separating HRUs into actor components allows us to run them concurrently, thus increasing scalability because jobs can utilize more CPUs resulting in decreased run-time. We have observed essentially perfect scaling when solving one job of 500 HRUs with one, two, and four CPUs, with run-times (HH:MM:SS) of 16:24:50, 08:17:48, and 04:06:53, respectively. By comparison, the standard implementation has a run-time result of 14:32:22. To enable fault tolerance, SUMMA-Actors uses state separation to contain failures within a single HRU and a hierarchical supervision strategy provided by the actor model. The former prevents HRU failures that result in job failures, allowing the remaining HRUs to continue. The latter allows for the implementation of a supervisor actor called the job-actor. The job-actor allows SUMMA-Actors to address failures at run-time, modify the HRU settings, and restart it without going back into the queue. All told, SUMMA-Actors provides a substantive reduction in wall clock time and human effort required to complete large-scale SUMMA simulations.

Plain Language Summary

Section 2: Additional Information

Program Affiliations

GWF: Global Water Futures

Project Affiliations

GWF-CS: Computer Science

Submitters

Name	Role	Email	Institution
Kyle Klenk	Submitter/Presenter	kyle.c.klenk@gmail.com	University of Saskatchewan

Publication Stage

Theme

Hydrology and Terrestrial Ecosystems

Presentation Format

10-minute oral presentation

Additional Information

AOSM2022 core-CS First Author: Kyle Klenk, University of Saskatchewan Additional Authors: Raymond J. Spiteri, University of Saskatchewan, Reza Zolfaghari, University of Saskatchewan, Kevin R. Green, University of Saskatchewan