# Setup the PEST(++) interface around the modified Freyberg model

In this notebook, we will construct a complex model independent (non-intrusive) interface around an existing MODFLOW6 model using pyEMU. We assume that the reader is at least partially familiar with PEST(++) file formats and working philosophy.

The modified Freyberg groundwater flow model has been constructed and is described in a previous notebook from this series. We will construct the entire PEST(++) interface from scratch here. This setup will be built upon in subsequent tutorials.

We will rely heavily on the pyemu.PstFrom class. Although here we employ it with a MODFLOW6 model, PstFrom is designed to be general and software independent (mostly). Some features are only available for MODFLOW models (e.g. SpatialReference).

The PstFrom class automates the construction of high-dimensional PEST(++) interfaces with all the bells and whistles. It provides easy-to-use functions to process model input and output files into PEST(++) datasets. It can assist with setting up spatio-temporaly varying parameters. It handles the generation of geostatisical prior covariance matrices and ensembles. It automates writting a “model run” script. It provides tools to add custom pre- and post-processing functions to this script. It makes adding tweaks and fixes to the PEST(++) interface a breeze. All of this from the comfort of your favourite Python IDE.

During this tutorial we are going to construct a PEST dataset. Amongst other things, we will demonstrate:

• how to add observations & parameters from model output & input files;
• how to add pre- and post-processing functions to the “model run” script;
• how to generate geostatistical structures for spatialy and temporally correlated parameters;
• how to edit parameter/observation data sections;
• how to generate a prior parameter covariance matrix and prior parameter ensemble;

First, let’s get our model files and sort out some admin.

import sys
import os
import shutil
import platform
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
# for this course we use locally stored version of pyemu and flopy to avoid version conflicts
sys.path.insert(0,os.path.join("..", "..", "dependencies"))
import pyemu
import flopy
assert "dependencies" in flopy.__file__
assert "dependencies" in pyemu.__file__
# herebedragons.py contains a series of utility functions that help us manage the tutorial repository
sys.path.insert(0,"..")
import herebedragons as hbd


We will be calling a few external programs throughout this tutorial. Namely, MODFLOW 6 and PESTPP-GLM. For the purposes of the tutorial(s), we have included executables in the tutorial repository. They are in the bin folder, organized by operating system and will programmatically copied into the working dirs as needed.

Some may prefer that executables be located in a folder that is cited in your computer’s PATH environment variable. Doing so allows you to run them from a command prompt open to any other folder without having to include the full path to these executables in the command to run them.

However, in situations where someone has several active projects and each may use difference versions of compiled binary codes, this may not be practical. In such cases, we can simply place the executables in the folder from which they will be executed. So, let’s copy the necessary executables into our working folder using a simple helper function:

Let’s copy the original model folder into a new working directory, just to ensure we don’t mess up the base files.

# folder containing original model files
org_d = os.path.join('..', '..', 'models', 'monthly_model_files_1lyr_newstress')

# a dir to hold a copy of the org model files
tmp_d = os.path.join('freyberg_mf6')

if os.path.exists(tmp_d):
shutil.rmtree(tmp_d)
shutil.copytree(org_d,tmp_d)

# get executables
hbd.prep_bins(tmp_d)
# get dependency folders
hbd.prep_deps(tmp_d)


If you inspect the model folder, you will see that all the MODFLOW6 model files have been written “externally”. This is key for working with the PstFrom class (or with PEST(++) in general, really). Essentialy, all pertinent model inputs have been written as independent files in either array or list format. This makes it easier for us to programiatically access and re-write the values in these files.

Array files contain a data type (usually floating points). List files will have a few columns that contain index information and then columns of floating point values (they have a tabular format; think .csv files or DataFrames). The PstFrom class provides methods for processing these file types into a PEST(++) dataset.

os.listdir(tmp_d)


Now we need just a tiny bit of info about the spatial discretization of the model - this is needed to work out separation distances between parameters to build a geostatistical prior covariance matrix later.

Here we will load the flopy sim and model instance just to help us define some quantities later - flopy is not required to use the PstFrom class. Neither is MODFLOW. However, at the time of writting, support for SpatialReference to spatially locate parameters is limited to structured grid models.

Load the simulation. Run it once to make sure it works and to make sure that model output files are in the folder.

# load simulation
gwf = sim.get_model()

# run the model once to make sure it works
pyemu.os_utils.run("mf6",cwd=tmp_d)
# run modpath7
pyemu.os_utils.run(r'mp7 freyberg_mp.mpsim', cwd=tmp_d)


### Spatial Reference

Now we can instantiate a SpatialReference. This will later be passed to PstFrom to assist with spatially locating parameters (e.g. pilot points and/or cell-by-cell parameters). You can also use the flopy modelgrid class instance that is attached to the simulation, but SpatialReference is cleaner and faster for structured grids…

sr = pyemu.helpers.SpatialReference.from_namfile(
os.path.join(tmp_d, "freyberg6.nam"),
delr=gwf.dis.delr.array, delc=gwf.dis.delc.array)
sr


### Instantiate PstFrom

Now we can start to construct the PEST(++) interface by instantiating a PstFrom class instance. There are a few things that we need to specify up front:

• the folder in which we currently have model files (e.g. tmp_d). PstFrom will copy all the files from this directory into a new “template” folder.
• template folder: this is a folder in which the PEST dataset will be constructed - this folder will hold the model files plus all of the files needed to run PEST(++). This folder/dataset will form the template for subsequent deployment of PEST(++).
• longnames: for backwards compatibility with PEST and PEST_HP (i.e. non-PEST++ versions), which have upper limits to parameter/obsveration names (PEST++ does not). Setting this value to False is only recommended if required.
• Whether the model is zero based or not.
• (optional) the spatial reference, as previously discussed. This is only required if using pyEMU to define parameter spatial correlation. Alternatively, you can define these yourself or use utilities available in the PEST-suite.
# specify a template directory (i.e. the PstFrom working folder)
template_ws = os.path.join("freyberg6_template")
start_datetime="1-1-2008"
# instantiate PstFrom
pf = pyemu.utils.PstFrom(original_d=tmp_d, # where the model is stored
new_d=template_ws, # the PEST template folder
remove_existing=True, # ensures a clean start
longnames=True, # set False if using PEST/PEST_HP
spatial_reference=sr, #the spatial reference we generated earlier
zero_based=False, # does the MODEL use zero based indices? For example, MODFLOW does NOT
start_datetime=start_datetime, # required when specifying temporal correlation between parameters
echo=False) # to stop PstFrom from writting lots of infromation to the notebook; experiment by setting it as True to see the difference; usefull for troubleshooting

os.listdir(template_ws)


So we see that when PstFrom is instantiated, it starts by copying the original_d to the new_d. sweet as!

### Observations

We now have a PstFrom instance assigned to the variable pf. For now it is only an empty container to which we can start adding “observations”, “parameters” and other bits and bobs.

Lets start with observations because they are easier. MODFLOW6 makes life even easier by recording observations in nicely organized .csv files. Isn’t that a peach!

#### Freyberg Recap

As you may recall from the “intro to Freyberg” tutorial, the model is configured to record time series of head at observation wells, and flux at three locations along the river. These are recorded in external .csv files named heads.csv and sfr.csv, respectively. You should be able to see these files in the model folder.

Recall that each .csv houses records of observation time-series. Outputs are recorded for each simulated stress-period. The model starts with a single steady-state stress-period, followed by 24 monthly transient stress-periods. The steady-state and first 12 transient stress-periods simulate the history-matching period. The last 12 transient stress periods simulate future conditions (i.e. the prediction period).

# check the output csv file names
for i in gwf.obs:
print(i.output.obs_names)


df = pd.read_csv(os.path.join(template_ws,"heads.csv"),index_col=0)


As you can see, there are many columns, one for each observation site. Conveniently, * cough * they are named according to the cell layer, row and column.

The values in the .csv file were generated by running the model. (IMPORTANT!) However, PstFrom assumes that values in this file are the target observation values, and they will be used to populate the PEST(++) dataset. This lets the user quickly verify that the PstFrom process reproduces the same model output files - an important thing to test!

Now, you can and should change the observation values later on for the quantities that correspond to actual observation data. This is the standard workflow when using PstFrom because it allows users to separate the PEST interface setup from the always-important process of setting observation values and weights. We address this part of the workflow in a separate tutorial.

First, we will use the PstFrom.add_observations() method to add observations to our pf object. This method can use list-type files, where the data are organized in column/tabular format with one or more index columns and one or more data columns. This method can also use array-type files, where the data are organized in a 2-D array structure (we will see this one later…)

We are going to tell pf which columns of this file contain observations. Values in these columns will be assigned to observation values.

We can also inform it if there is an index column (or columns). Values in this column will be included in the observation names.

We could also specify which rows to include as observations. But observations are free…so why not keep them all!

Let’s add observations from heads.csv. The first column of this file records the time at which the value is simulated. Let’s use that as the index column (this becomes useful later on to post-process results). We want all other columns as observation values.

hds_df = pf.add_observations("heads.csv", # the model output file to read
insfile="heads.csv.ins", #optional, the instruction file name
index_cols="time", #column header to use as index; can also use column number (zero-based) instead of the header name
use_cols=list(df.columns.values), #names of columns that include observation values; can also use column number (zero-based) instead of the header name
prefix="hds") #prefix to all observation names; choose something logical and easy o find. We use it later on to select obsevrations


Let’s inspect what we just created.

We can see that the .add_observations() method returned a dataframe with lots of useful info:

• the observation names that were formed (see obsnme column); note that these inlcude lots of usefull metadata like the column name, index value and so on;
• the values that were read from heads.csv (see obsval column);
• some generic weights and group names; note that observations are grouped according to the column of the model output .csv. Alternatively, we could have specified a list of observation group names.
hds_df.head()


At this point, no PEST control file has been created, we have simply prepared to add these observations to the control file later. Everything is still only stored in memory. However, a PEST instruction file has been created in the template folder (template_ws):

[f for f in os.listdir(template_ws) if f.endswith(".ins")]


Blimey, wasn’t that easy? Automatically monitoring thousands of model output quantities as observations into a PEST dataset becomes a breeze!

Let’s quickly do the same thing for the SFR observations.

df = pd.read_csv(os.path.join(template_ws, "sfr.csv"), index_col=0)

# add the observations to pf
insfile="sfr.csv.ins", #optional, the instruction file name
index_cols="time", #column header to use as index; can also use column number (zero-based) instead of the header name
use_cols=list(df.columns.values), #names of columns that include observation values; can also use column number (zero-based) instead of the header name
prefix="sfr") #prefix to all observation names


We also want to add observations of particle travel time and status. Unfortuantely the file written by MODPATH7 is not easily injestible by PstFrom. So we are going to need to “manually” construct an instruction file. We could add that now with the PstFrom.add_observations_from_ins() method, but we will wait and add these after constructing the Pst object - soon!

### Parameters

The PstFrom.add_parameters() method reads model input files and adds parameters to the PEST(++) dataset. Parameterisation can be configured in several ways.

• model input files can be in array or list format;
• parameters can be setup as different “types”. Each value in model input files can (1) each be a separate parameter (“grid” scale parameters), (2) be grouped into “zones” or (3) all be treated as a single parameter (“constant” type). Alteratvely, (4) parameters can be assigned to pilot points, from which individual parameter values are subsequently interpolated. PstFrom adds the relevant pre-processing steps to assign paramter values directly into the “model run” script.
• parameter values can be setup as “direct”, “multiplier” or “addend”. This means the “parameter value” which PEST(++) sees can be (1) the same value the model sees, (2) a multiplier on the value in the existing/original model input file, or (3) a value which is added to the value in the existing/original model input file. This is very nifty and allows for some pretty advanced parameterization schemes by allowing mixtures of different types of parameters. PstFrom is designed to preferentially use parameters setup as multipliers (that is the default parameter type). This let us preserve the existing model inputs and treat them as the mean of the prior parameter distribution. Once again, relevant pre-processing scripts are automatically added to the “model run” script (discussed later) so that the multiplicative and additive parameterization process is not something the user has to worry about.

#### Freyberg Recap

As discussed, all model inputs are stored in external files. Some are arrays. Others are lists. Recall that our model has 1 layer. It is transient. Hydraulic properties (Kh, Kv, Ss, Sy) vary in space. Recharge varies over both space and time. We have GHBs, SFR and WEL boundary conditions. GHB parameters are constant over time, but vary spatially. SFR inflow varies over time. Pumping rates of individual wells are uncertain in space and and time.

All of these have some degree of spatial and/or temporal correlation.

#### Geostatistical Structures

Parameter correlation plays a role in (1) regularization when giving preference to the emergence of patterns of spatial heterogeneity and (2) when specifying the prior parameter probability distribution (which is what regularization is enforcing!). Since we are all sophisticated and recognize the importance of expressing spatial and temporal uncertainty (e.g. heterogeneity) in the model inputs (and the corresponding spatial correlation in those uncertain inputs), let’s use geostatistics to express uncertainty. To do that we need to define “geostatistical structures”.

For the sake of this tutorial, let’s assume that heterogeneity in all spatially distributed parameters share the same statistical characteristics. Likewise for temporally varying parameters. We will therefore only construct two geostatisitcal structures.

# exponential variogram for spatially varying parameters
v_space = pyemu.geostats.ExpVario(contribution=1.0, #sill
a=1000, # range of correlation; length units of the model. In our case 'meters'
anisotropy=1.0, #name says it all
bearing=0.0 #angle in degrees East of North corresponding to anisotropy ellipse
)

# geostatistical structure for spatially varying parameters
grid_gs = pyemu.geostats.GeoStruct(variograms=v_space, transform='log')

# plot the gs if you like:
_ = grid_gs.plot()

# exponential variogram for time varying parameters
v_time = pyemu.geostats.ExpVario(contribution=1.0, #sill
a=60, # range of correlation; length time units (days)
anisotropy=1.0, #do not change for 1-D time
bearing=0.0 #do not change for 1-D time
)

# geostatistical structure for time varying parameters
temporal_gs = pyemu.geostats.GeoStruct(variograms=v_time, transform='none')


Let’s start by adding parameters of hydraulic properties that vary in space (but not time) and which are housed in array-type files (e.g. Kh, Kv, Ss, Sy). We will start by demonstrating step-by-step for Kh.

First, find all the external array files that contain Kh values. In our case, these are the files with “npf_k_” in the file name.

tag = "npf_k_"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
print(files)


Let’s setup multiple spatial scales of parameters for Kh. To do this we will use three of the parameter “types” described above. The coarse scale will be a constant single value for each array. The medium scale will pilot points. The finest scale will use parameters as the grid scale (a unique parameter for each model cell!)

Each scale of parameters will work with the others as multipliers with the existing Kh arrays. (This all happens at runtime as part of the “model run” script.) Think of the scales as dials that PEST(++) can turn to improve the fit. The “coarse” scale is one big dial that alows PEST to move everything at once - that is, change the mean of the entire Kh array. The “medium” dials are few (but not too many) that allow PEST to adjust broad areas, but not making eveything move. The “fine” scales are lots of small dials that allow PEST(++) to have very detailed control, tweaking parameter values within very small areas.

However, because we are working with parameter multipliers, we will need to specify two sets of parameter bounds:

• upper_bound and lower_bound are the standard control file bounds (the bounds on the parameters that PEST sees), while
• ult_ubound and ult_lbound are bounds that are applied at runtime to the resulting (multiplied out) model input array that MODFLOW reads.

Since we are using sets of multipliers, it is important to make sure we keep the resulting model input arrays within the range of realistic values.

#### Array Files

We will first demonstrate steb-by-step for freyberg6.npf_k_layer1.txt. We will start with grid scale parameters. These are multipliers assigned to each individual value in the array.

We start by getting the idomain array. As our model has inactive cells, this helps us avoid adding unncessary parameters. It is also required later when generating pilot points.

# get the IIDOMAIN array; in our case we only have one layer
ib = gwf.dis.idomain.array[0]
plt.imshow(ib)

f = 'freyberg6.npf_k_layer1.txt'

# grid (fine) scale parameters
zone_array=ib, #as we have inactie model cells, we can avoid assigning these as parameters
par_type="grid", #specify the type, these will be unique parameters for each cell
geostruct=grid_gs, # the gestatisical structure for spatial correlation
par_name_base=f.split('.')[1].replace("_","")+"gr", #specify a parameter name base that allows us to easily identify the filename and parameter type. "_gr" for "grid", and so forth.
pargp=f.split('.')[1].replace("_","")+"gr", #likewise for the parameter group name
lower_bound=0.2, upper_bound=5.0, #parameter lower and upper bound
ult_ubound=100, ult_lbound=0.01 # The ultimate bounds for multiplied model input values. Here we are stating that, after accounting for all multipliers, Kh cannot exceed these values. Very important with multipliers
)


As when adding observations, pf.add_parameters() returns a dataframe. Take a look. You may recognize alot of the information that appears in a PEST *parameter data section. All of this is still only housed in memory for now. We will write the PEST control file later on.

df_gr.head()


This add_parameters() call also wrote a template file that PEST(++) will use to populate the multiplier array at runtime:

[f for f in os.listdir(template_ws) if f.endswith(".tpl")]


Remember! no pest control file has been made yet. PstFrom is simply preparing to make a control file later…

Now, we add pilot point (medium scale) multiplier parameters to the same model input file. These multipliers are assigned to pilot points, which are subsequently interpolated to values in the array.

You can add pilot points in two ways:

1. PstFrom can generate them for you on a regular grid or
2. you can supply PstFrom with existing pilot point location information in the form of a dataframe or a point-coverage shapefile.

When you change par_type to “pilotpoints”, by default, a regular grid of pilot points is setup using a default pp_space value of 10 (which is every 10th row and column). You can chnge this spacing by passing a integer to pp_space (as demonstrated below).

Alternatively you can specify a filename or dataframe with pilot point locations. If you supply pp_space as a str it is assumed to be a filename. The extension is the guide: “.csv” for dataframe, “.shp” for shapefile (point-type). Anything else and the file is assumed to be a pilot points file type. The dataframe (or .csv file) must have “name”, “x”, and “y” as columns - it can have more, but must have those.

# pilot point (medium) scale parameters
zone_array=ib,
par_type="pilotpoints",
geostruct=grid_gs,
par_name_base=f.split('.')[1].replace("_","")+"pp",
pargp=f.split('.')[1].replace("_","")+"pp",
lower_bound=0.2,upper_bound=5.0,
ult_ubound=100, ult_lbound=0.01,
pp_space=5) # PstFrom will generate a unifrom grid of pilot points in every 4th row and column

fig,ax = plt.subplots(1,1,figsize=(4,6))
ax.set_aspect("equal")
ax.pcolormesh(sr.xcentergrid, sr.ycentergrid,ib)
ax.scatter(df_pp.x,df_pp.y)


Lastly, add the constant (coarse) parameter multiplier. This is a single multiplier value applied to all values in the array.

# constant (coarse) scale parameters
zone_array=ib,
par_type="constant",
geostruct=grid_gs,
par_name_base=f.split('.')[1].replace("_","")+"cn",
pargp=f.split('.')[1].replace("_","")+"cn",
lower_bound=0.2,upper_bound=5.0,
ult_ubound=100, ult_lbound=0.01)


Now we see three template files have been created:

[f for f in os.listdir(template_ws) if f.endswith(".tpl")]


Feel free to navigate to the template_ws and inspect these files.

We are going to be repeating this for each parameter type, so lets write a function.

def add_mult_pars(f, lb=0.2, ub=5.0, ulb=0.01, uub=100, add_coarse=True):
if isinstance(f,str):
base = f.split(".")[1].replace("_","")
else:
base = f[0].split(".")[1]
# grid (fine) scale parameters
zone_array=ib,
par_type="grid", #specify the type, these will be unique parameters for each cell
geostruct=grid_gs, # the gestatisical structure for spatial correlation
par_name_base=base+"gr", #specify a parameter name base that allows us to easily identify the filename and parameter type. "_gr" for "grid", and so forth.
pargp=base+"gr", #likewise for the parameter group name
lower_bound=lb, upper_bound=ub, #parameter lower and upper bound
ult_ubound=uub, ult_lbound=ulb # The ultimate bounds for multiplied model input values. Here we are stating that, after accounting for all multipliers, Kh cannot exceed these values. Very important with multipliers
)

# pilot point (medium) scale parameters
zone_array=ib,
par_type="pilotpoints",
geostruct=grid_gs,
par_name_base=base+"pp",
pargp=base+"pp",
lower_bound=lb, upper_bound=ub,
ult_ubound=uub, ult_lbound=ulb,
pp_space=5) # PstFrom will generate a unifrom grid of pilot points in every 4th row and column
# constant (coarse) scale parameters
zone_array=ib,
par_type="constant",
geostruct=grid_gs,
par_name_base=base+"cn",
pargp=base+"cn",
lower_bound=lb, upper_bound=ub,
ult_ubound=uub, ult_lbound=ulb)
return


Let’s speed through the other array parameter files.

# for Kv
tag = "npf_k33"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
for f in files:

# for Ss
tag = "sto_ss"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
# only for layer 2 and 3; we aren't monsters
for f in files[1:]:

# For Sy
tag = "sto_sy"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
# only for layer 1
f = files[0]

# For porosity
tag = "ne_"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
for f in files:


len([f for f in os.listdir(template_ws) if f.endswith(".tpl")])

[f for f in os.listdir(template_ws) if f.endswith(".tpl")]


Boom! We just conquered property parameterization in a big way!

#### Spatial and Temporal Correlation

Now, you may be thinking “shouldn’t recharge have temporal correlation as well?”.

Damn straight it should. Now, this requires a little trickery because native handling in spatiotemporal correlation is hard to do. So what we are going to do is split this correlation into two setup of multiplier parameters. One set of parameters will be constant in space but vary (and be correlated) in time. The other set of multiplier parameters will be constant in time but vary (and be correlated) in space. Since both of these sets of parameters are multipliers, we implicitly represent the concept that recharge is uncertain and correlated in both space and time. Easy as!

First we need to construct a container of stress period datetimes. (This relies on specifying the start_datetime argument when instantiating PstFrom.) These datetime values will specify the postion of parameters on the time-axis.

# build up a container of stress period start datetimes - this will
# be used to specify the datetime of each multipler parameter

dts = pd.to_datetime(start_datetime) + pd.to_timedelta(np.cumsum(sim.tdis.perioddata.array["perlen"]),unit='d')

dts


If you use the same parameter group name (pargp) and same geostruct, PstFrom will treat parameters setup across different calls to add_parameters() as correlated - WARNING do not try to express spatial and temporal correlation together - as discussed above, #badtimes. In this case, we want to express temporal correlation in the recharge multiplier parameters that are “constant” type in space so that there is one recharge multiplier parameter for each stress period that shares a parameter group name across different calls to add_parameters. So, we use the same parameter group names for each stress period data file, and specify the datetime and geostruct arguments.

Including temporal correlation introduces an additional challenge. Interpolation between points that share a common coordinate creates all types of trouble. We are going to have many parameters during each stress period (a single point on the time-axis). To get around this challenge we need to be a bit sneaky.

First, we will apply the multiple spatial scales of parameter multiplers (constant, pilot point and grid) as we did for hyraulic properties. We do this for all recharge files at once, which tells PstFrom to broadcast (e.g. share) the same parameters for all of those files:

# for Recharge;
tag = "rch_recharge"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
sp = [int(f.split(".")[1].split('_')[-1]) for f in files]
d = {s:f for s,f in zip(sp,files)}
sp.sort()
files = [d[s] for s in sp]
print(files)
# the spatial multiplier parameters; just use the same function
for f in files:

len([f for f in os.listdir(template_ws) if f.endswith(".tpl")])


Then, we will asign an additional constant multiplier parameter for each recharge stress-period file (so, a single multiplier for all recharge parameters for each stress period). We will specify temporal correlation for these constant multipliers. These will all have the same parameter group name, as discussed above.

for f in files:
# multiplier that includes temporal correlation
# get the stress period number from the file name
kper = int(f.split('.')[1].split('_')[-1]) - 1
# add the constant parameters (with temporal correlation)
zone_array=ib,
par_type="constant",
par_name_base=f.split('.')[1]+"tcn",
pargp=f.split('.')[1]+"tcn",
lower_bound=0.5, upper_bound=1.5,
ult_ubound=1e-3, ult_lbound=0,
datetime=dts[kper], # this places the parameter value on the "time axis"
geostruct=temporal_gs)


Sweet! Done!

### List Files

Adding parameters from list-type files follows similar principles. As with observation files, they must be tabular. Certain columns are specified as index columns and are used to populate parameter names, as well as provide the parameters’ spatial location. Other columns are specified as containing parameter values.

Parameters can be grid or constant. As before, values can be assigned directly, as multipliers or as additives.

We will demonstrate for the boundary-condition input files.

Starting off with GHBs. Let’s inspect the folder. As you can see, there is a single input file (here we assume GHB parameters do not vary over time).

tag = "ghb_stress_period_data"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
print(files)


Since these boundaries are likely to be very influential, we want to include a robust representation of their uncertainty - both head and conductance and at multiple scales.

Let’s parameterize both GHB conductance and head:

• For conductance, we shall use two scales of multiplier parameters (constant and grid).

• For heads, multipliers are not ideal. Insead we will use additive parameters. Again, with a coarse and fine scale.

ATTENTION!

Additive parameters by default get assigned an initial parameter value of zero. This can be problematic later on when computing the derivatives. Be sure to either apply a parameter offset, or use “absolute” increment types in the parameter group section (we will implement the latter option further on in the current tutorial.)

tag = "ghb_stress_period_data"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]

for f in files:
# constant and grid scale multiplier conductance parameters
name = 'ghbcond'
par_type="grid",
geostruct=grid_gs,
par_name_base=name+"gr",
pargp=name+"gr",
index_cols=[0,1,2], #column containing lay,row,col
use_cols=[4], #column containing conductance values
lower_bound=0.1,upper_bound=10.0)
par_type="constant",
geostruct=grid_gs,
par_name_base=name+"cn",
pargp=name+"cn",
index_cols=[0,1,2],
use_cols=[4],
lower_bound=0.1,upper_bound=10.0,
ult_lbound=0.01, ult_ubound=100) #absolute limits

par_type="grid",
geostruct=grid_gs,
par_name_base=name+"gr",
pargp=name+"gr",
index_cols=[0,1,2],
use_cols=[3],   # column containing head values
lower_bound=-2.0,upper_bound=2.0,
transform="none", # specify not log-transform
ult_lbound=32.5, ult_ubound=42) #absolute limits; make sure head is never lower than the bottom of layer1
par_type="constant",
geostruct=grid_gs,
par_name_base=name+"cn",
pargp=name+"cn",
index_cols=[0,1,2],
use_cols=[3],
lower_bound=-2.0,upper_bound=2.0,
par_style="a",
transform="none",
ult_lbound=32.5, ult_ubound=42)


Easy peasy.

Now, this will make some people uncomfortable, but how well do we really ever know historic water use flux rates in space and in time? hmmm, not really! And just a little uncertainty in historic water use can result in large changes in simulated water levels…So lets add parameters to represent that uncertainty in the model inputs.

For wells it may not (or it may…) make sense to include spatial correlation. Here we will assume temporal correlation - its reasonable that pumping rates today will be similar to pumping rates yesterday.

Pumping rates for different stress periods are in separate files. We will call .add_parameters() for each file. But we want to specify correlation between parameters in different files. As explained above for recharge, we do this with the parameter group name.

OK, let’s get started.

As discussed above, including temporal correlation introduces an additional challenge. We use the same approach described for recharge parameters:

• First, we will asign a constant multiplier parameter for each WEL stress-period file (so, a single multiplier for all well pumping rates for each stress period). We will specify temporal correlation for these constant multipliers.

• Then, we will also have grid type multiplier parameters for each WEL stress period file (so, multipliers for individual well pumping rate during each stress period). These will not include (temporal) correlation. (We could in principle include spatial correlation here if we wanted to; but let’s not).

files = [f for f in os.listdir(template_ws) if "wel_stress_period_data" in f and f.endswith(".txt")]
sp = [int(f.split(".")[1].split('_')[-1]) for f in files]
d = {s:f for s,f in zip(sp,files)}
sp.sort()
files = [d[s] for s in sp]

for f in files:
# get the stress period number from the file name
kper = int(f.split('.')[1].split('_')[-1]) - 1

# add the constant parameters (with temporal correlation)
index_cols=[0,1,2], #columns that specify cell location
use_cols=[3],       #columns with parameter values
par_type="constant",    #each well will be adjustable
par_name_base="welcst",
pargp="welcst",
upper_bound = 4, lower_bound=0.25,
datetime=dts[kper], # this places the parameter value on the "time axis"
geostruct=temporal_gs)

# add the grid parameters; each individual well
index_cols=[0,1,2], #columns that specify cell location
use_cols=[3],       #columns with parameter values
par_type="grid",    #each well will be adjustable
par_name_base="welgrd",
pargp="welgrd",
upper_bound = 4, lower_bound=0.25,
datetime=dts[kper]) # this places the parameter value on the "time axis"



And finally, our favourite (not!) boundary-condition: SFR.

Let’s parameterize conductance (time-invariant) and inflow (time-variant).

# SFR conductance
tag = "sfr_packagedata"
files = [f for f in os.listdir(template_ws) if tag in f.lower() and f.endswith(".txt")]
assert len(files) == 1 # There can be only one! It is tradition. Jokes.
print(files)

f = files[0]
# constant and grid scale multiplier conductance parameters
name = "sfrcond"
par_type="grid",
geostruct=grid_gs,
par_name_base=name+"gr",
pargp=name+"gr",
index_cols=[0,2,3],
use_cols=[9],
lower_bound=0.1,upper_bound=10.0)
par_type="constant",
geostruct=grid_gs,
par_name_base=name+"cn",
pargp=name+"cn",
index_cols=[0,2,3],
use_cols=[9],
lower_bound=0.1,upper_bound=10.0,
ult_lbound=0.001, ult_ubound=100) #absolute limits

# SFR inflow
files = [f for f in os.listdir(template_ws) if "sfr_perioddata" in f and f.endswith(".txt")]
sp = [int(f.split(".")[1].split('_')[-1]) for f in files]
d = {s:f for s,f in zip(sp,files)}
sp.sort()
files = [d[s] for s in sp]
print(files)
for f in files:
# get the stress period number from the file name
kper = int(f.split('.')[1].split('_')[-1]) - 1
index_cols=[0], #reach number
use_cols=[2],   #columns with parameter values
par_type="grid",
par_name_base="sfrgr",
pargp="sfrgr",
upper_bound = 10, lower_bound=0.1, #don't need ult_bounds because it is a single multiplier
datetime=dts[kper], # this places the parameter value on the "time axis"
geostruct=temporal_gs)

[f for f in os.listdir(template_ws) if f.endswith(".tpl")]


Damn! we just parameterized many recognized sources of model input uncertainty at several spatial and temporal scales. And we expressed spatial and temporal correlation in those parameters. One last set of parameters that we will need later for sequential data assimilation - initial conditions:

files = [f for f in os.listdir(template_ws) if "ic_strt" in f and f.endswith(".txt")]
files

for f in files:
base = f.split(".")[1].replace("_","")
pargp=base,par_name_base=base,upper_bound=50,
lower_bound=15,zone_array=ib,transform="none")
print(df.shape)


### The Forward Run Script

OK! So, we almost have all the base building blocks for a PEST(++) dataset. We have some (1) observations and some (2) parameters. We are still missing (3) the “forward run” script. Recall that in the PEST world, the “model” is not just the numerical model (e.g. MODFLOW). Instead it is a composite of the numerical model (or models) and pre- and post-processing steps, encapsulated in a “forward run” script which can be called from the command line. This command line instruction is what PEST(++) sees as “the model”. During execution, PEST(++) writes values to parameter files, runs “the model”, and then reads values from the observation files.

PstFrom automates the generation of such a script when constructing the PEST control file. The script is written to file named forward_run.py. It is written in Python (this is not a PEST(++) requirement, merely a convenience…we are working in Python after all…).

How about we see that in action? Magic time! Let’s create the PEST control file.

pst = pf.build_pst()


Boom! Done. (Well almost.) Check the folder. You should see a new .pst file and the forward_run.py file. By default, the .pst file is named after the original model folder name.

[f for f in os.listdir(template_ws) if f.endswith(".py") or f.endswith(".pst") ]


We will get to the pst object later on (see also the “intro to pyemu” tutorial notebook). For now, let’s focus on the forward_run.py script. It is printed out below.

This script does a bunch of things:

• it removes model output files to avoid the possibility of files from a previous model run being read by mistake;
• it runs pre-processing steps (see pyemu.helpers.apply_list_and_array_pars();
• it executes system commands (usually running the simulator, i.e. MODFLOW). (This is still missing. We will demonstrate next.)
• it executes post-processing steps; (for now there aren’t any)
• …it washes the dishes (sorry, no it doesn’t…this feature is still in development).
_ = [print(line.rstrip()) for line in open(os.path.join(template_ws,"forward_run.py"))]


That’s pretty amazing. But as we just saw, we still need to add commands to actualy run the model.

PstFrom allows you to pass a list of system commands which will be executed in sequence. It also has methods for including Python functions that run before or after the system commands. These make pre-/post-processing a piece of cake. In fact, we have already started to add to it. Remember all of the multiplier and additive parameters we setup? These all require pre-processing steps to convert the PEST-generated multipliers into model input values. PstFrom will automatically add these functions to the forward_run.py script. Nifty, hey?

Next we will demonstrate how to specify the system commands and add Python functions as processing steps.

#### Sys Commands

Let’s start by adding a command line instruction. These are stored as a list in PstFrom.mod_sys_cmds, which is currently empty.

pf.mod_sys_cmds


To run a MODFLOW6 model from the command line, you can simply execute mf6 in the model folder. So, we can add this command by appending it to the list. (Do this only once! Every time you append ‘mf6’ results in an additional call to MODFLOW6, meaning the model would be run multiple times.)

PstFrom will add a line to forward_run.py w

pf.mod_sys_cmds.append("mf6") #do this only once
pf.mod_sys_cmds


We also need to run MODPATH7, so we need to add that to the list of system commands. In this case we also need to specify the modpath sim file:

pf.mod_sys_cmds.append("mp7 freyberg_mp.mpsim") #do this only once
pf.mod_sys_cmds


OK, now let’s re-build the Pst control file and check out the changes ot the forward_run.py script.

You should see that pyemu.os_utils.run(r'mf6') has been added after the pre-processing functions.

pst = pf.build_pst()

_ = [print(line.rstrip()) for line in open(os.path.join(template_ws,"forward_run.py"))]


#### Extra pre- and post-processing functions

You will also certainly need to include some additional processing steps. These are supported through the PstFrom.pre_py_cmds and PstFrom.post_py_cmds, which are lists for pre and post model run python commands and PstFrom.pre_sys_cmds and PstFrom.post_sys_cmds, which are lists for pre and post model run system commands (these are wrapped in pyemu.os_utils.run().

But what if your additional steps are actually an entire python function? Well, we got that too! PstFrom.add_py_function(). This method allows you to get functions from another (pre-prepared) python source file and add them to the forward_run.py script. We will demonstrate this to post-process model observations after each run.

Now lets see this py-sauce in action: we are going to add a little post-processing function to extract the final simulated water level for all model cells for the last stress period from the MF6 binary headsave file and save them to ASCII format so that PEST(++) can read them with instruction files. And, while we are at it, lets also extract the global water budget info from the MF6 listing file and store it in dataframes - these are ususally good numbers to watch! We will need the simulated water level arrays later for sequential data assimilation (wouldnt it be nice if MF6 supported the writing of ASCII format head arrays?). Anyway, this function is stored in the “helpers.py” script which you can find in the tutorial folder.

pf.add_py_function("helpers.py","extract_hds_arrays_and_list_dfs()",is_pre_cmd=False)


That last argument - is_pre_cmd tells PstFrom if the python function should be treated as a pre-processor or a post-processor. So we have added that post-processor, but we still need to setup pest observations for those ASCII head arrays. Let’s do that by first calling that function to operate once within the template_ws to generate the arrays and then we can add them with add_observations():

import helpers
helpers.test_extract_hds_arrays(template_ws)

files = [f for f in os.listdir(template_ws) if f.startswith("hdslay")]
files

for f in files:

for f in ["inc.csv","cum.csv"]:
prefix=f.split('.')[0],obsgp=f.split(".")[0])


Crushed it!

#### Secondary Observations

Often it is usefull to include “secondary model outcomes” as observations. These can be important components in a history-matching dataset to tease out specific aspects of system behaviour (e.g. head differences between aquifer layers to inform vertical permeabilities). Or they may be simple summaries of modelled outputs which are of interest for a prediction (e.g. minimum simulated head over a given period).

If you inspect the tutorial folder you will find a file named helpers.py. This is a python source file which we have prepared for you. (Open it to see how it is organized.) It contains a function named process_secondary_obs(). This function reads the model output .csv files, processes them and writes a series of new observation .csv files. These new files contain the temporal-differences between head and SFR observations. The new .csv files are named heads.tdiff.csvand sfr.tdiff.csv, respectively.

First, lets load the function here and run it so you can see what happens. (And to make sure that the observation files are in the template folder!)

Run the next cell, then inspect the template folder. You should see several new csv files. These are the new secondary observations calculated by the post-processing function.

# run the helper function
helpers.process_secondary_obs(ws=template_ws)

[f for f in os.listdir(template_ws) if f.endswith(".csv")]


OK, so now let’s add this function to the forward_run.py script.

pf.add_py_function("helpers.py", # the file which contains the function
"process_secondary_obs(ws='.')", #the function, making sure to specify any arguments it may requrie
is_pre_cmd=False) # whether it runs before the model system command, or after. In this case, after.


And, boom! Bob’s your uncle. As easy as that.

Now, of course we want to add these observations to PstFrom as well:


insfile="sfr.tdiff.csv.ins", #optional, the instruction file name
index_cols="time", #column header to use as index; can also use column number (zero-based) instead of the header name
use_cols=list(df.columns.values), #names of columns that include observation values; can also use column number (zero-based) instead of the header name
prefix="sfrtd") #prefix to all observation

insfile="heads.tdiff.csv.ins", #optional, the instruction file name
index_cols="time", #column header to use as index; can also use column number (zero-based) instead of the header name
use_cols=list(df.columns.values), #names of columns that include observation values; can also use column number (zero-based) instead of the header name
prefix="hdstd") #prefix to all observation names


Remember to re-build the Pst control file:

pst = pf.build_pst()

_ = [print(line.rstrip()) for line in open(os.path.join(template_ws,"forward_run.py"))]


Now we see that extract_hds_array_and_list_dfs() has been added to the forward run script and it is being called after MF6 runs.

obs = pst.observation_data
obs


### After Building the Control File

At this point, we can do some additional modifications that would typically be done that are problem specific. Here we can tweak the setup, specifying things such as observation weights, parameter bounds, transforms, control data, etc.

Note that any modifications made after calling PstFrom.build_pst() will only exist in memory - you need to call pf.pst.write() to record these changes to the control file on disk. Also note that if you call PstFrom.build_pst() after making some changes, these changes will be lost.

For the current case, the main thing we haven’t addressed are the observations from custom *.ins files, observation weights, parameter group INCTYP’s and forecasts.

We will do so now.

#### Add Observations from INS files

Recall that we wish to include observations of particle end time and status. As mentioned earlier, MP7 output files are not ina nicely organized tabular format - so we need to construct a custom instruction file. We will do this now:

# write a really simple instruction file to read the MODPATH end point file
out_file = "freyberg_mp.mpend"
ins_file = out_file + ".ins"
with open(os.path.join(template_ws, ins_file),'w') as f:
f.write("pif ~\n")
f.write("l7 w w w w !part_status! w w !part_time!\n")


Now add these observations to the Pst:

pst.add_observations(ins_file=os.path.join(template_ws, ins_file),
out_file=os.path.join(template_ws, out_file),
pst_path='.')

# and then check what changed
obs = pst.observation_data
obs.loc[obs.obsnme=='part_status', 'obgnme'] = 'part'
obs.loc[obs.obsnme=='part_time', 'obgnme'] = 'part'

obs.iloc[-2:]


#### Parameters with Zero as Intial Value

Recall that we assigned additive parameters to the GHB heads. Our initial parameter values for these parameter types were set as 0 (zero). This creates a wee bit of trouble when calculating derivatives. There are a couple of ways we could get around it. One way is to add an “offset” to the parameter intial values and to the parmaeter bounds. Another is to use “absolute” increment types (INCTYP). See the PEST manual or PEST++ user guides for descriptions of increment types.

We will apply both here.

We will assign INCTYP as ‘absolute’. We will leave DERINC as 0.01 (the default). It is a reasonable value in this case.

head_pargps = [i for i in pst.adj_par_groups if 'head' in i]

pst.parameter_groups.loc[head_pargps, 'inctyp'] = 'absolute'


Now add the “offset” to parameter data entries:

par = pst.parameter_data
par_names = par.loc[par.parval1==0].parnme


offset = -10
par.loc[par_names, 'offset'] = offset
par.loc[par_names, ['parval1', 'parlbnd', 'parubnd']] -= offset



#### Forecasts

For most models there is a forecast/prediction that someone needs. Rather than waiting until the end of the project, the forecast should be entered into your thinking and workflow right at the beginning. Here we do this explicitly by monitoring the forecasts as “observations” in the control file. This way, for every PEST(++) analysis we do, we can watch what is happening to the forecasts - #winning

The optional PEST++ ++forecasts control variable allows us to provide the names of one or more observations featured in the “observation data” section of the PEST control file; these are treated as predictions in FOSM predictive uncertainty analysis by PESTPP-GLM. It is also a convenient way to keep track of “forecast” observations (makes post-processing a wee bit easier later on).

Recall that, for our synthetic case we are interested in forecasting:

• groundwater level in the upper layer at row 9 and column 1 (site named “trgw-0-9-1”) in stress period 22 (time=640);
• the “tailwater” surface-water/groundwater exchange during stress period 13 (time=367); and
• the “headwater” surface-water/groundwater exchange at stress period 22 (time=640).
• the particle travel time.
forecasts =[
'oname:sfr_otype:lst_usecol:tailwater_time:4383.5',
'oname:hds_otype:lst_usecol:trgw-0-9-1_time:4383.5',
'part_time'
]

forecasts

fobs = obs.loc[forecasts,:]
fobs


We will just set this optional pest++ argument because it will trigger certain automatic behavior later in PESTPP-GLM

pst.pestpp_options['forecasts'] = forecasts


### Re-write the Control File!

Make sure to re-write the PEST control file. But beware, if you re-build the Pst, all these changes will be lost.

pst.write(os.path.join(template_ws, 'freyberg_mf6.pst'))


So that was pretty epic. We now have a (very) high-dimensional PEST interface that includes secondary observations, as well as forecasts, ready to roll.

If you inspect the folder, you will see PEST control file and all the necessary instruction and template files. Because we have >10k parameters, version 2 of the PEST control file was written by default.

Shall we check that it works? Let’s run PEST once (i.e. with NOPTMAX=0). Now, by default, noptmax is set to zero. But just to check:

pst.control_data.noptmax


OK, so when we run PEST it will call the model once and then stop. If the next cell is sucessfull, then eveything is working. Check the folder, you should see PEST output files. (We will go into these and how to process PEST outcomes in subsequent tutorials).

pyemu.os_utils.run('pestpp-glm freyberg_mf6.pst', cwd=template_ws)


Recall that we assigned observation values generated from the “base model run”? If we setup everything correctly, this means that PEST should have obtained residuals very close to zero. As mentioned, this is a good way to check for problems early on.

Let’s check the Phi recorded in the *.iobj file (could also check the *.rec or *.rei files).

# read the file

# check value in phi column
iobj.total_phi


Sweet! Zero. All is well.

### Prior Parameter Covariance Matrix

One the major reasons PstFrom was built is to help with building the Prior - both covariance matrix and ensemble - with geostatical correlation. Remember all that business above related to geostatical structures and correlations? This is where is pays off.

Let’s see how this works. For cases with less than about 30,000 parameters, we can actually generate and visualize the prior parameter covariance matrix. If you have more parameters, this matrix may not fit in memory. But, not to worry, PstFrom has some trickery to help generate the geostatistical prior ensemble.

# build the prior covariance matrix and store it as a compresed bianry file (otherwise it can get huge!)
# depending on your machine, this may take a while...
if pf.pst.npar < 35000:  #if you have more than about 35K pars, the cov matrix becomes hard to handle
cov = pf.build_prior(fmt='coo', filename=os.path.join(template_ws,"prior_cov.jcb"))
# and take a peek at a slice of the matrix
try:
x = cov.x.copy()
x[x==0] = np.NaN
plt.imshow(x[:1000,:1000])
except:
pass


snap! That big block must be a grid-scale parameter group…

cov.row_names[:10]


And now generate a prior parameter ensemble. This step is relevant for using pestpp-ies in subsequent tutorials. Note: you do not have to call build_prior() before calling draw()!

pe = pf.draw(num_reals=500, use_specsim=True) # draw parameters from the prior distribution
pe.enforce() # enforces parameter bounds
pe.to_binary(os.path.join(template_ws,"prior_pe.jcb")) #writes the paramter ensemble to binary file
assert pe.shape[1] == pst.npar


Let’s now test-run one of these geostatistical realizations (always a good idea!). We do this by replacing the parval1 values in the control with a row from pe:

pst.parameter_data.loc[:,"parval1"] = pe.loc[pe.index[0],pst.par_names].values
pst.parameter_data.parval1.values

pst.control_data.noptmax = 0
pst.write(os.path.join(template_ws,"test.pst"))
pyemu.os_utils.run("pestpp-glm test.pst",cwd=template_ws)


If all went well, that’s it! The PEST-interface is setup, tested and we have our prior preprared. We should be good to go!

### Bonus: Understanding Multiplier-Parameters

Now the multiplier files in the “template_ws/mult” folder and the MF6 input files in the template_ws folder contain the values cooresponding to this realization, so we can visualize the multipler parameter process:

df = pd.read_csv(os.path.join(template_ws,"mult2model_info.csv"))
kh1_df = df.loc[df.model_file.str.contains("npf_k_layer1"),:]
kh1_df

org_arr = np.loadtxt(os.path.join(template_ws,kh1_df.org_file.iloc[0]))
mlt_arrs = [np.loadtxt(os.path.join(template_ws,afile)) for afile in kh1_df.mlt_file]
arrs = [org_arr]
arrs.extend(mlt_arrs)
arrs.append(inp_arr)
names = ["org"]
names.extend([mf.split('.')[0].split('_')[-1] for mf in kh1_df.mlt_file])
names.append("MF6 input")
fig,axes = plt.subplots(1,kh1_df.shape[0]+2,figsize=(5*kh1_df.shape[0]+2,5))
for i,ax in enumerate(axes.flatten()):
arr = np.log10(arrs[i])
arr[ib==0] = np.NaN
cb = ax.imshow(arr)
plt.colorbar(cb,ax=ax)
ax.set_title(names[i],loc="left")
plt.tight_layout()