Time to Update the Split-Sample Approach in Hydrological Model Calibration v1.1

Related items loading ...

Overview Research Site Status and Provenance Access and Downloads

Section 1: Overview

Name of Research Project

Related Project	Part

Program Affiliations

GWF: Global Water Futures

Related Research Project(s)

Related Project	Part
GWF-IMPC: Integrated Modelling Program for Canada

Dataset Title

Time to Update the Split-Sample Approach in Hydrological Model Calibration v1.1

Additional Information

Creators and Contributors

Name	Email	Institution
Hongren Shen	hongren.shen@uwaterloo.ca	University of Waterloo
Bryan Tolson		University of Waterloo
Juliane Mai		University of Waterloo

Abstract

Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling

Purpose

Plain Language Summary

Keywords

Keyword
CAMELS dataset
Split-sample test
Large-sample study
Raven hydrological modeling
Model calibration
Model validation
Model testing

Citations

All versions: https://doi.org/10.5281/zenodo.5915373

Dataset v1.1:
H. Shen, B. A. Tolson, and J. Mai (2022). Time to Update the Split-Sample Approach in Hydrological Model Calibration. Zenodo. http://doi.org/10.5281/zenodo.5915374

Article v1.1:
Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523

Original CAMELS dataset v1.0:
A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

Original Article v1.0:
A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample watershed-scale hydrometeorological dataset for the contiguous USA: dataset characteristics and assessment of regional variability in hydrologic model performance. Hydrol. Earth Syst. Sci., 19, 209-223, http://doi.org/10.5194/hess-19-209-2015

Section 3: Status and Provenance

Dataset Version

1.1

Dataset Creation Date

2022-05-19

Status of data collection/production

○ Planned

○ In Progress

○ Abandoned

◉ Complete

Dataset Completion or Abandonment Date

Data Update Frequency

○ Continually

○ Daily

○ Weekly

○ Biweekly

○ Monthly

○ Anually

○ As needed

○ Irregular

○ None planned

◉ Unknown

Creation Software

Software	Version	File Formats

Primary Source of Data

◻ Unknown/Unspecified

◻ Census

◻ Field collected samples

◻ Field experiment

◻ Field observation

◻ Field survey

◻ Human biological samples

◻ Lab experiment

◻ Model simulation

◻ Previously collected

◻ Qualitative (from observations or interviews)

◻ Social survey

◻ Traditional knowledge

◻ Other Source of Data (Please specify in field below)

Other Source of Data (if applicable)

Data Lineage (if applicable). Please include versions (e.g., input and forcing data, models, and coupling modules; instrument measurements; surveys; sample collections; etc.)

The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:
(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
(4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.

Section 4: Access and Downloads

Access to the Dataset

Does the data have access restrictions?

▣ No restriction (data is currently open to public)

◻ Limited (data is currently under embargo until publication)

◻ Limited (data involves intellectual property issues related to local or traditional knowledge)

◻ Limited (release of data may cause harm to the environment or to the public)

◻ Limited (pre-existing data has been used and is subject to access restrictions)

◻ Limited (data involves human subjects)

◻ Limited (data is supported by industry partnerships)

◻ Limited (data is supported by government partnerships)

Downloading and Characteristics of the Dataset

Download Links and Instructions

Version 1.1
https://zenodo.org/record/6578924#.ZDiNauzMJHY

Version 1.0

Total Size of all Dataset Files (GB)

File formats and online databases

◻ Link to online database or web services (e.g., WISKI, ECCC)

◻ Archive files (.zip, .rar, .7z, .tar, .tgz, .tar.gz, etc.)

◻ CSV files (.csv - comma or tab separated value files)

◻ Excel document files (.xlsx, .xls)

◻ Image files (e.g., .tiff, .jpeg, .png, .gif, etc.)

◻ NetCDF files (.netcdf, .nc)

◻ Text files (.txt)

◻ Word document files (.docx, .doc)

◻ Other (Please specify in field below)

Other Data Formats (if applicable)

List of Parameters and Variables

Parameter/Variable	Unit	Frequency	Source

T-2023-04-13-g1g3x8G5W2Gkum6XFBkHW5Ug1 Dataset 1.2

West Boundary Longitude
East Boundary Longitude
North Boundary Latitude
South Boundary Latitude

Begin Date	End Date