# Distributed Parallel Computing for Pressurized-Flow Dynamic Simulations

Gerardo Riaño-Briceño, Lina Sela, Ben Hodges National Center for Infrastructure Modeling The University of Texas at Austin

## Abstract

Modeling the dynamics of fluids flowing in pressurized pipes is crucial to design and operate networked pipelines. For example, in water distribution systems, dynamic models can be used to determine how actuators such as valves and pumps have to be operated to optimally regulate pressure levels. While excessive pressure causes pipe ruptures and water losses, low pressure prevents the system from delivering water to consumers. Computing the solution of hyperbolic partial differential equations (PDEs) is necessary to describe this phenomenon. Although commercial software packages with dynamic models are available, current tools do not scale up to large realistic networks unless decreasing simulation accuracy. In this work, a parallel algorithm to compute the dynamics of pressurized flow in pipes is developed. The proposed algorithm can run on supercomputers with multiple computing nodes, providing highly-accurate results by relying on python libraries such as NumPy, Numba, and mpi4py.

## Key Findings

• Models of networked systems whose dynamics are described by hyperbolic partial differential equations (PDEs) can be solved using a vectorized implementation of the method of characteristics, which can be further optimized using distributed parallel computing.
• The vectorized implementation of the method of characteristics achieves maximum performance when optimized using just-in-time compilation with Numba and aligned memory access. The optimized code runs almost twice as fast as a vectorized code with Numpy for which memory access is not aligned.
• Simulation results for a test case with 3.2 M solution points and 138,000 time steps, running with up to 100 processors, demonstrate that python tools are scalable and fast, given that the algorithm runs thousands of times faster than the traditional sequential implementation when computing large sets of coupled hyperbolic PDEs.

## Modeling Water Distribution Systems

The equations that describe the motion of pressurized water in a single pipe are given by a set of partial differential equations (PDEs) corresponding to laws of conservation of mass and momentum [1] as follows,

where $H(x,t)$﻿ denotes the head, $\omega$﻿ is the speed at which pressure waves propagate in a pipe, $Q(x,t)$﻿ is the flow rate, $g$﻿ denotes the acceleration of gravity, and $A$﻿ corresponds to the cross-section area of the pipe. The term $F(x,t)$﻿ in equation $(2)$﻿, corresponds to the friction head loss per unit length, which is estimated based on a friction model. Friction models for WDSs can be categorized as steady-state, quasi-steady, and unsteady friction models [1]. In this work, we solve equation $(2)$﻿ using a steady-state friction model for which

$f$﻿ is the steady-state friction factor coefficient, which is found using the Darcy-Weisbach equation [1], and $d$﻿ is the diameter of the pipe. The resulting system of PDEs is hyperbolic.

The system of hyperbolic PDEs is solved using the method of characteristics (MOC) so that an equivalent system of ordinary differential equations (ODEs) is found and then numerically solved. The resulting ODEs are composed by characteristic curves whose solution coincide with the solution of equations $(1)$﻿ and $(2)$﻿ as shown in Figure 1. Characteristic curves are categorized as $\mathcal{C}^+$﻿ and $\mathcal{C}^-$﻿ curves and described by the following equations:

Characteristic equations are numerically solved using Euler's explicit method. For this, we discretize the spatial and time domains such that the Courant–Friedrichs–Lewy (CFL) condition is saisfied for every pipe in the system, thus avoiding instabilities in the numerical scheme. First a time step $\tau$﻿ is defined and then, the pipe is divided in $n$﻿ segments of length $\delta$﻿ such that $c = \omega\frac{\tau}{\delta}\leq 1$﻿, where $c$﻿ is the Courant number.

The resulting system of discretized equations is given by:

where $D(x,t)$﻿, $U(x,t)$﻿, $B(x,t)$﻿, and $W(x,t)$﻿ correspond to auxiliary variables which are defined to provide a more compact representation of pressure and flow equations. Auxiliary variables are given by the following expressions:

where $b = \frac{\omega}{gA}$﻿ and $r =\frac{f\delta}{2gdA^2}$﻿.

The system of discrete equations is solved using an explicit time-marching scheme, such that the values of head and flow for time $t+\tau$﻿ depend on values of $Q$﻿ and $H$﻿ at time $t$﻿ only. Head and flow values are computed at the extremes of pipe segments, hereafter denominated as solution points.

The dynamics of interconnected pipes emerge with the definition of boundary conditions. Particularly, boundary conditions can be used to model reservoirs, junctions, and control elements such as valves. While reservoirs correspond to points with constant head, junctions refer to the interconnection of two or more pipes, and valves correspond to points with fixed flow. Regardless of the type of element, boundary conditions are coupled with the MOC equations as done in [1], which results in equations (3), (4), and (5) for reservoirs, junctions, and valves respectively.

Reservoirs

Reservoirs refer to sources of water whose head remains constant at all times. Therefore,

Junctions

Junctions refer to connections of two or more pipes at their extremes. The head at a junction $j$﻿ is given by

where $\mathcal{J}_j$﻿ is the set of extreme points connected to junction $j$﻿.

Valves

The flowrate at a valve $v$﻿ is given by

where $C_d$﻿ is the valve loss coefficient, and $A$﻿ corresponds to the cross-section area of the valve.

## Vectorization

Solving the MOC equations can be done using vector operations to take advantage of SIMD instructions running in parallel at the core level. For this, it is necessary to properly arrange the values of $Q$﻿, $H$﻿, $U$﻿, $B$﻿, $D$﻿, and $W$﻿ such that solution points remain allocated next to their neighbors. To illustrate, the water distribution network shown in Figure 2 is discretized as shown in Figure 3. Solution points for each of the pipes are defined based on the values of $\delta$﻿ and $\tau$﻿ as described in [1]. Then, memory is allocated for each solution point using a partial order, such that the $i$﻿-th entry corresponds to a solution point that is downstream the point associated to the $(i-1)$﻿-th entry and upstream the point associated to the $(i+1)$﻿-th entry.

Based on the partial order defined in Figure 3 and the previously defined MOC and boundary equations, we define two functions to compute an iteration of the vectorized MOC method, namely moc_step_vector, junction_step_vector, and valve_step. Each function updates the values of $Q$﻿ and $H$﻿ for each solution point in the network based on initial conditions.

In order to run a simulation step, the model presented in Figure 2 has to be loaded first. The model is provided as a pickled object containing all the necessary initial conditions to run the simulation as well as the model properties, which can be downloaded here

Once downloaded, the following script can be used to load the model and to define vector functionalities that solve both MOC and boundary equations.

import pickle
import numpy as np

with open('model.pkl', 'rb') as f:

# Vectorized MOC Equations
def moc_step_vector(H, Q, U, B, D, W, r, b, not_end, not_start):
# Auxiliary variables for array extremes will always be associated with boundaries
U[0] = H[0,1] - b[0]*Q[0,1]; D[-1] = \
H[0,-2] + b[-2]*Q[0,-2]
B[0] = b[0] + r[0]*abs(Q[0,1]); W[-1] = \
b[-2] + r[-2]*abs(Q[0,-2])

# --- Auxiliary variables
# Recall that U and B are not defined for end boundary points, since x <= pipe length
# Similarly, D and W are not defined for start boundary points, since x >= 0
# Because of that, not_end and not_start are used such that undefined auxiliary
# variables take a value of zero
# The entries of not_end are equal to 1 if the i-th entry is not associated to an end
# boundary point and zero otherwise
# The entries of not_start are equal to 1 if the i-th entry is not associated to a
# start boundary point and zero otherwise
U[1:-1] = (H[0,2:] - b[1:-1]*Q[0,2:]) * not_end[1:-1]
B[1:-1] = (b[1:-1] + r[1:-1]*abs(Q[0,2:])) * not_end[1:-1]
D[1:-1] = (H[0,0:-2] + b[1:-1]*Q[0,0:-2]) * not_start[1:-1]
W[1:-1] = (b[1:-1] + r[1:-1]*abs(Q[0,0:-2])) * not_start[1:-1]

# --- Update Q and H
H[1,1:-1] = (D[1:-1]*B[1:-1] + U[1:-1]*W[1:-1]) \
/ (W[1:-1] + B[1:-1])
Q[1,1:-1] = (D[1:-1] - U[1:-1]) / (W[1:-1] + B[1:-1])

# Junction Boundary Conditions
def junction_step_vector(\
H, Q, U, B, D, W, r, b, \
not_end, not_start, junction_points, \
junction_nodes, pipe_junctions, \
start_points, end_points, reservoirs):

U[start_points] /= B[start_points]
D[end_points] /= W[end_points]
B[start_points] = 1 / B[start_points]
W[end_points] = 1 / W[end_points]
U[junction_points] + D[junction_points], pipe_junctions)
B[junction_points] + W[junction_points], pipe_junctions)
HH =  sc / sb
H[1,junction_points] = HH[junction_nodes]
H[1,reservoirs] = H[0,reservoirs] # Reservoir
Q[1,start_points] = \
H[1,start_points] * B[start_points] - U[start_points]
Q[1,end_points] = \
D[end_points] - H[1,end_points] * W[end_points]

def valve_step(model, valve_flow):
model.Q[1,model.valve] = valve_flow
model.H[1,model.valve] = \
(model.Q[1,model.valve] / (model.valve_coefficient*model.valve_max_area))**2 / \
(2*model.g)

# --- Runing a Simulation step
moc_step_vector(
model.H,
model.Q,
model.U,
model.B,
model.D,
model.W,
model.r,
model.b,
model.not_end,
model.not_start)

junction_step_vector(\
model.H, model.Q, model.U, model.B, model.D, model.W, model.r, model.b, \
model.not_end, model.not_start, model.junction_points, \
model.junction_nodes, model.pipe_junctions, \
model.start_points, model.end_points, model.reservoirs)

valve_step(model, model.Q[0,model.valve])

Notice that the moc_step_vector function computes the MOC equations for all the solution points in the system, producing erroneous updates of $H$﻿ and $Q$﻿ for boundary points. Nonetheless, erroneous values are then corrected using the junction_step_vector function. Although some may argue that this approach is inneficient due to the unnecessary computations of solutions for boundary points, it actually offers a significant improvement in performance.

Even though there is an overhead due to unnecessary computations, the inefficiencies are compensated by the fact that operations based on SIMD instructions achieve maximum performance when memory is coalesced. To illustrate, we developed three additional functions that compute the MOC equations using three different approaches. Namely, a sequential implementation, a vectorized implementation with non-sequential memory access, and an optimized version of the vector implementation using Numba's just-in-time compiler. Among all the test cases the optimized vector implementation with Numba is the fastest.

from numba import njit

# Sequential
def moc_step_sequential(H, Q, U, B, D, W, r, b, not_end, not_start):
U[0] = H[0,1] - b[0]*Q[0,1]; D[-1] = H[0,-2] + b[-2]*Q[0,-2]
B[0] = b[0] + r[0]*abs(Q[0,1]); W[-1] = b[-2] + r[-2]*abs(Q[0,-2])
for i in range(1,len(U)-1):
U[i] = (H[0,i+1] - b[i]*Q[0,i+1]) * not_end[i]
B[i] = (b[i] + r[i]*abs(Q[0,i+1])) * not_end[i]
D[i] = (H[0,i-1] + b[i]*Q[0,i-1]) * not_start[i]
W[i] = (b[i] + r[i]*abs(Q[0,i-1])) * not_start[i]
H[1,i] = (D[i]*B[i] + U[i]*W[i]) / (W[i] + B[i])
Q[1,i] = (D[i] - U[i]) / (W[i] + B[i])

# Non-aligned memory access
def moc_step_nonaligned(H, Q, U, B, D, W, r, b, not_end, not_start, inner_points):
U[0] = H[0,1] - b[0]*Q[0,1]; D[-1] = H[0,-2] + b[-2]*Q[0,-2]
B[0] = b[0] + r[0]*abs(Q[0,1]); W[-1] = b[-2] + r[-2]*abs(Q[0,-2])
U[1:-1] = (H[0,2:] - b[1:-1]*Q[0,2:]) * not_end[1:-1]
B[1:-1] = (b[1:-1] + r[1:-1]*abs(Q[0,2:])) * not_end[1:-1]
D[1:-1] = (H[0,0:-2] + b[1:-1]*Q[0,0:-2]) * not_start[1:-1]
W[1:-1] = (b[1:-1] + r[1:-1]*abs(Q[0,0:-2])) * not_start[1:-1]
H[1,inner_points] = \
(D[inner_points]*B[inner_points] + U[inner_points]*W[inner_points]) \
/ (W[inner_points] + B[inner_points])
Q[1,inner_points] = \
(D[inner_points] - U[inner_points]) / (W[inner_points] + B[inner_points])

# Optimized implementation with Numba
@njit
def run_step_numba(H, Q, U, B, D, W, r, b, not_end, not_start):
U[0] = H[0,1] - b[0]*Q[0,1]; D[-1] = H[0,-2] + b[-2]*Q[0,-2]
B[0] = b[0] + r[0]*abs(Q[0,1]); W[-1] = b[-2] + r[-2]*abs(Q[0,-2])
for i in range(1,len(U)-1):
U[i] = (H[0,i+1] - b[i]*Q[0,i+1]) * not_end[i]
B[i] = (b[i] + r[i]*abs(Q[0,i+1])) * not_end[i]
D[i] = (H[0,i-1] + b[i]*Q[0,i-1]) * not_start[i]
W[i] = (b[i] + r[i]*abs(Q[0,i-1])) * not_start[i]
H[1,i] = (D[i]*B[i] + U[i]*W[i]) / (W[i] + B[i])
Q[1,i] = (D[i] - U[i]) / (W[i] + B[i])

The four implementations of the MOC step functionality are tested for several test cases, varying the number of solution points from 30 to 1,000,000. Wall-clock times are measured to quantify the time spent when running a single MOC step. The results presented in Figures 4 and 5 show that the fastest function is the optimized version of the vectorized MOC implementation using Numba. Moreover, it is evidenced that even though the implementation with non-aligned memory access is significantly more efficient than the sequential implementation, it is not as efficient as the vector implementation. Overall, vectorized implementations run several hundreds of times faster than the sequential implementation.

All the tests with the vectorized MOC implementation were computed using an Intel Core i7-6600U processor.

## Distributed Method of Characteristics

The MOC algorithm is now implemented using distributed parallel computing. The vectorized implementation of MOC provided significant improvements in performance, though it ran using a single processor. Therefore, it is possible to leverage on the usage of multiple processors to achieve higher speedups.

The distributed MOC implementation consists of subdividing the discretized WDS in almost even parts so that multiple processors solve the flow and head equations for each subset of solution points. For example, for the system presented in Figure 6, the network is divided in three parts so that three processors concurrently solve the MOC problem. Notice that some solution points of processor 1 depend on information of processor 2. Because of this, it is necessary that both processor 1 and 2 exchange information at every iteration. For the points that do not depend on external data, the vectorized MOC implementation can be used as previously described.

In order to implement the distributed MOC algorithm, we used the mpi4py library which includes functionalities to exchange information among processors, allowing us to define our own partitions. For simplicity and considering that partitioning algorithms are out of the scope of this work, we performed a simple partitioning algorithm evenly dividing the arrays that contain the the simulation variables.

## Large-scale Test Case

The distributed algorithm is tested for a large-scale network retrieved from [2], with 3.2 M solution points, running 138,000 time steps. The test case consists of rapidly closing a valve in the main pipeline of a system that initially remains in steady-state. As a result, the flowrate is no longer steady and a pressure wave propagates throughout the system as shown in Figure 7. The location of the valve is marked by a red cross in Figure 7, and the change in flowrate is plotted for each of the pipes. The large-scale test case is computed using the Stampede 2 supercomputer [3].

Using multiple processors improves the performance of the algorithm eight-fold. While the speedup with respect to the single-core MOC implementation with Numba increases as the number of processors increases, the efficiency [4] of the distributed algorithm drops i.e., processors are not optimally utilized. This has to do with the fact that the overhead of communication increases as more partitions are generated.

Even though communcation becomes the bottleneck of the algorithm, the distributed MOC algorithm achieves outstanding speedups with respect to the traditional sequential implementation. Moreover, the computation times are close to real-time, modeling the dynamics of the large-scale test case at 5 frames per second.

## References

[1] Wylie, E. B., Streeter, V. L., & Suo, L. (1993). Fluid transients in systems (Vol. 1, p. 464). Englewood Cliffs, NJ: Prentice Hall.

[2] Ostfeld, A., Uber, J. G., Salomons, E. (2008). The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms. Journal of Water Resources Planning and Management, 134(6), 556-568.

[3] Stanzione, D., Barth, B., Gaffney, N. (2017). Stampede 2: The evolution of an xsede supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact (pp. 1-8).

[4] Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction to parallel computing (Vol. 110). Redwood City, CA: Benjamin/Cummings.

## Questions?

If you have questions or would like to get early access to the open source library that runs the distributed MOC algorithm send us an email (griano@utexas.edu)

## Acknowledgements

This work was developed under Cooperative Agreement No. 83595001 awarded by the U.S. Environmental Protection Agency to The University of Texas at Austin. It has not been formally reviewed by EPA. The views expressed in this website are solely those of the authors, and do not necessarily reflect those of the Agency. EPA does not endorse any products or commercial services mentioned in this publication. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu