TNO Intern

Commit d4b17916 authored by Hen Brett's avatar Hen Brett 🐔
Browse files

adding documentation on parralelization and on assessment

parent a1640676
Loading
Loading
Loading
Loading
Loading
+31.7 KiB
Loading image diff...
+1 −0
Original line number Diff line number Diff line
@@ -11,6 +11,7 @@ The underlying theory of the geothermal doublet simulation is explained in the [
- [map run and analysis](maprun_analysis.md) page demonstrating running on maps, including some plotting examples to illustrate the outputs.
- [portfolio run and analysis](portfoliorun_analysis.md) page demonstrating running on a portfolio of prospective locations, including some plotting examples to illustrate the outputs.
- [customised stochastic simulation](customised_stochastic_simulations.md) page demonstrates how to develop your own stochastic frameworks around the core pythermogis doublet simulation functionality.
- [parallelization](parallelization.md) page describes how to parallelize simulations and determine the optimal chunk size for parallelization for your hardware

!!! info "Plotting, calculations and result analysis"
    pyThermoGIS is designed to enable users to run geothermal doublet simulations. 
+24 −0
Original line number Diff line number Diff line
# Parallelization

When running `calculate_doublet_performance` or `calculate_doublet_performance_stochastic` (described in more detail in [deterministic doublet](deterministic_doublet.md) and [stochastic_doublet](stochastic_doublet.md)) then each combination of input reservoir properties is an independent simulation.

This makes this a good target for parallelization, where you use more hardware resources to run processes simultaneously to decrease the execution time.

Traditionally trying to parallelize code in python has been tricky and custom in built modules such as [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) have been developed to handle this task, still a lot of custom code usually had to be developed to get the right setup for your specific problem.

pythermogis however uses [xarray](https://docs.xarray.dev/en/latest/index.html) which under the hood uses [dask](https://www.dask.org/) to run paralel operations. For more details on how xarray utilizes dask for easy parallelization we direct the reader to the following: [Parallel Computing with Dask](https://docs.xarray.dev/en/latest/user-guide/dask.html).

The framework has already been implemented in pythermogis, the user simply has to define the `chunk_size` parameter when calling either `calculate_doublet_performance` or `calculate_doublet_performance_stochastic` and then the doublet simulations will occur in parallel. 

See below for an explaination of what `chunk_size` is and how to determine the optimal size.

## What is chunk size and how to determine it?
dask parallelization works by applying an operation (in this case `simulate_doublet`) across 'chunks' of input data, which are a collection of data that are run in parralel.

Lets say we wish to compute 1000 doublet simulations our smallest `chunk_size` would be 1, meaning that every simulation is sent as an independent job to a processor, while the largest `chunk_size` is 1000, meaning one job is sent to a processor and the simulations are run in series.

The first example would be inefficient as there is a computational cost to chunking when it comes to organising the input and the output, while the second example is also inefficient as each simulation is run in series, the optimal `chunk_size` is likely to be between these two values.

The following figure shows how different chunk sizes affects the overall compute time. It can be seen that the most efficient chunk size (for the hardware this example was run on) is by having 100 simulations per chunk.

![parallelization](../images/parallelization.png)
+2 −1
Original line number Diff line number Diff line
@@ -7,4 +7,5 @@ from pythermogis.postprocessing.pos import *
from pythermogis.doublet_simulation.deterministic_doublet import *
from pythermogis.doublet_simulation.stochastic_doublet import *
from pythermogis.plotting.plot_exceedance import plot_exceedance
from pythermogis.dask_utils.chunk_utils  import auto_chunk_dataset
 No newline at end of file
from pythermogis.dask_utils.auto_chunk  import auto_chunk_dataset
from pythermogis.dask_utils.assess_optimal_chunk_size import assess_optimal_chunk_size
 No newline at end of file
+73 −0
Original line number Diff line number Diff line
from os import path
from pathlib import Path
import timeit

import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

from pythermogis.doublet_simulation.deterministic_doublet import calculate_doublet_performance

def assess_optimal_chunk_size(n_simulations: int = 1000, plot_outfile : str | Path = None):
    """Run the same simulation with different chunk sizes to find the optimal chunk size for this hardware"""

    # generate simulation samples across desired reservoir properties
    thickness_samples = np.random.uniform(low=150, high=300, size=n_simulations)
    porosity_samples = np.random.uniform(low=0.5, high=0.8, size=n_simulations)
    ntg_samples = np.random.uniform(low=0.25, high=0.5, size=n_simulations)
    depth_samples = np.random.uniform(low=1800, high=2200, size=n_simulations)
    permeability_samples = np.random.uniform(low=200, high=800, size=n_simulations)
    reservoir_properties = xr.Dataset(
        {
            "thickness": (["sample"], thickness_samples),
            "porosity": (["sample"], porosity_samples),
            "ntg": (["sample"], ntg_samples),
            "depth": (["sample"], depth_samples),
            "permeability": (["sample"], permeability_samples),
        },
        coords={"sample": np.arange(n_simulations)}
    )
    n_attempts = 3 # do the same operation n_attempts time and take an average of their times.

    # normal
    time_attempt = []
    for attempt in range(n_attempts):
        start = timeit.default_timer()
        simulation_benchmark = calculate_doublet_performance(reservoir_properties)
        time_attempt.append(timeit.default_timer() - start)
    normal_time = np.mean(time_attempt)
    print(f"non-parralel simulation took {np.mean(time_attempt):.1f} seconds, {n_simulations / np.mean(time_attempt):.1f} samples per second")

    sample_chunks = np.arange(1, n_simulations, 100)

    # run in parallel:
    mean_time = []
    std_time = []
    for sample_chunk in sample_chunks:
        time_attempt = []
        for attempt in range(n_attempts):
            start = timeit.default_timer()
            simulations_parallel = calculate_doublet_performance(reservoir_properties, chunk_size=sample_chunk, print_execution_duration=False)
            time_attempt.append(timeit.default_timer() - start)

            xr.testing.assert_allclose(simulation_benchmark, simulations_parallel)
            xr.testing.assert_equal(simulation_benchmark, simulations_parallel)
        mean_time.append(np.mean(time_attempt))
        std_time.append(np.std(time_attempt))
        print(f"parralel simulation, chunk size: {sample_chunk}, took {np.mean(time_attempt):.1f} seconds to run {n_simulations} simulations, {n_simulations / mean_time[-1]:.1f} samples per second")

    if plot_outfile is None:
        return

    fig, ax = plt.subplots(1, 1, figsize=(8, 5))

    ax.errorbar(sample_chunks, mean_time, yerr=std_time, fmt='o', capsize=5, label='parralel simulation')

    ax.scatter(sample_chunks, mean_time)
    ax.plot(sample_chunks, mean_time, label="parralel simulation")
    ax.axhline(normal_time, label="non-parralel simulation", color="tab:orange", linestyle="--")
    ax.set_xlabel("chunk size")
    ax.set_ylabel("time (s)")
    ax.set_title("Chunk size assessment")
    ax.legend()
    plt.savefig(plot_outfile)
 No newline at end of file
Loading