1. ExaNBody: Framework for N-Body Simulations on HPC Platforms

This framework, developed at the French Atomic Agency (CEA), is tailored for N-Body simulations on High-Performance Computing (HPC) platforms. Originally designed for the ExaSTAMP Molecular Dynamics code, it has been extended to cater to various N-Body problems.

Key Characteristics:

  • Language: Implemented in C++17, leveraging modern language features for efficiency and versatility.

  • Parallelization:: Hybrid approach integrating:
    • Vectorization for CPU optimization.

    • Thread-parallelization using OpenMP for multi-core architectures.

    • GPU-parallelization via CUDA to harness GPU computational power.

    • MPI-parallelization for distributed memory systems.

  • Spatial Domain Decomposition: Utilizes spatial domain decomposition techniques for efficient workload distribution among processors.

  • Load Balancing (RCB): Implements Load Balancing using Recursive Coordinate Bisection (RCB) for optimal task distribution among processing units.

  • Parallel IO: Enables efficient handling of parallel Input/Output operations. Supports checkpoint files, parallel Paraview files, and diagnostics in a parallelized manner.

  • In-situ Analysis: Provides real-time data analysis capabilities during simulation execution, minimizing data movement and storage overhead.

1.1. ExaDEM Variant

ExaDEM is a software solution in the field of computational simulations. It’s a Discrete Element Method (DEM) code developed within the exaNBody framework. This framework provides the basis for DEM functionalities and performance optimizations. A notable aspect of ExaDEM is its hybrid parallelization approach, which combines the use of MPI (Message Passing Interface) and Threads (OpenMP). This combination aims to enhance computation times for simulations, making them more efficient and manageable.

Additionally, ExaDEM offers compatibility with MPI``+``GPUs, using the CUDA programming model (Onika layer). This feature provides the option to leverage GPU processing power for potential performance gains in simulations. Written in C++17, ExaDEM is built on a contemporary codebase. It aims to provide researchers and engineers with a tool for adressing DEM simulations.

1.2. ExaSPH Variant

1.3. ExaSTAMP Variant

2. Installing ExaNB

To proceed with the installation, your system must meet the minimum prerequisites. The first step involves the installation of exaNBody:

git clone https://github.com/Collab4exaNBody/exaNBody.git
mkdir build-exaNBody/ && cd build-exaNBody/
cmake ../exaNBody/ -DCMAKE_INSTALL_PREFIX=path_to_install
make install
export exaNBody_DIR=path_to_install

3. N-Body Background

3.1. N-body methods

N-body methods encompass a variety of techniques used to model the behavior and interactions of a set of particles over time. These methods consist in solving Newton’s equation of motion f = ma for each particle at each time step, where f corresponds to the sum of the forces applied to the particle, a its acceleration and m its mass. The forces are deduced from the interactions between particles according to their types, i.e. contact, short-range, or long-range interactions, and external forces applied to the sample (i.e. gravity). Velocities are then deduced from the accelerations and subsequently used to update the particle positions at the next time step. This process is repeated, typically with a fixed time step ∆t, according to an integration scheme4 until the desired duration is reached. The collection of particle configurations over time allows to study a wide range of phenomena, from granular media movements, with the Discrete Element Method (Dem), to material crystal plasticity at the atomic scale using Molecular Dynamics (Md), going up to the galaxy formation with the Smoothed-Particle Hydrodynamics (Sph).

3.2. N-body simulation codes

The development of a N-body code is led by the need to figure out the neighborhood of a given particle for every timestep in order to process interactions of different kinds.

Particle interactions can be categorized as short-range and long-range. Short-range interactions are considered negligible beyond a specified cut-off radius. To optimize calculations, neighboring particle detection algorithms are employed to eliminate unnecessary computations with distant particles. Each N-Body method employs a wide variety of short-range interactions that capture different particle physics. For example, visco-elastic contacts in Discret Element Method follow Hooke’s law or Hertz law to model contact elasticity between rigid particles, while pair potentials like Lennard-Jones or Morse are used for gas or liquid atom interactions in classical Md. Long-range interactions, on the other hand, can sometimes not be neglected and result in algorithmic complexity of O(N 2 ). Such interactions, like gravitation in astrophysics, or electrostatic forces in Molecular Dynamics, are typically modeled using the Ewald summation method. Fortunately, calculation approaches such as the fast multipole method can achieve a complexity of O(N ), thanks to an octree structure, and can be efficiently parallelized. Although this paper primarily focuses on short-range interactions, both types of interaction can be dealt with in exaNBody.

Neighbor lists are built, using different strategies, to shorten the process of finding out the neighbors of a particle within the simulation domain. It helps optimizing the default algorithm having a complexity of O(N^2) that tests every pair of particles (if N is the number of particles). The most common strategy to deal with any kind of simulation (static or dynamic, homogeneous and heterogeneous density) is a fusion between the linked-cell method, see figure 2, and the Verlet list, see figure 2 method. The combination of these methods has a complexity of O(N) and a refresh rate that depends on the displacement of the fastest particle. This algorithm is easily thread-parallelized. Others less-used neighbor search strategies have been developed to address specific simulations

map to buried treasure

Figure 1: A method for building neighbor lists using the Verlet lists. rcut is the radius cut-off and rVerlet is the radius of Verlet.

map to buried treasure

Figure 2: A method for building neighbor lists using linked cells. rcut is the radius cut-off.

Domain decomposition is usually employed in N-Body methods to address distributed memory parallelization, assigning one subdomain to each MPI process. This implies the addition of ghost areas (replicated particles) around subdomains to ensure each particle has access to its neighborhood. Over time, many algorithms have been designed to improve load-balancing such as: Recursive Coordinate Bisection (RCB), the Recursive Inertial Bisection (RIB), the Space Filling Curve (SFC), or graph method with ParMETIS. Note that the library Zoltan gathers the most popular methods. To ease neighbor list construction (i.e. employing the linked-cell method), the simulation domain is described as a cartesian grid of cells, each of which containing embedded particles. Each subdomain then consists in a grid of entire cells assigned to one Mpi process. In contrast, concurent iteration over the cells of one subdomain’s grid provides the basis for thread parallelization at the NUMA node level.

Commonalities are shared across N-body simulation codes, such as numerical schemes, neighbor particle detection, or short/long-range interactions. The computation time dedicated to interaction and force calculations can be significant (over 80% of the total time) depending on the studied hpenomenon complexity. Additional factors, such as neighbor lists construction computational cost, can impact overall simulation time. For dynamic simulations involving rapidly moving particles and computationally inexpensive interactions, more than 50% of the total time may be spent on neighbor list construction. The computationally intensive sections of the code vary depending on methods and phenomena studied, requiring optimizations such as:

  • MPI parallelization

  • Vectorization

  • Multi-threading

  • GPU usage

4. Software stack of ExaNBody

4.1. Software Stack Layers:

  1. External Libraries: This layer encompasses external dependencies utilized by the software. These libraries are leveraged to access functionalities that are not internally developed, enhancing the overall capabilities of the system.

    • HPC Libraries: High-Performance Computing (HPC) libraries serve as powerful tools within the software stack, providing optimized routines and functionalities specifically designed for high-computational tasks.

  2. Low-Level and Runtime Support (Onika): Foundational infrastructure and runtime support are provided by this layer. Onika serves as the backbone, offering essential utilities for the execution and management of the software. It’s important to note that ongoing efforts are directed towards integrating frameworks like OpenMP or CUDA.

  3. N-Body Framework: Positioned at the core of the software stack, the N-Body Framework proposes a large variety of N-Body features to build efficiently N-Body simulation such as traditionnal integration schemes, IO, or partiionning methods.

  4. Application Layer: This layer encapsulates the specific applications built upon the software stack. It provides the interface for users to interact with the functionalities offered by the stack and serves as the platform for N-Body application-specific implementations and functionalities.

exanb-st

4.2. Guidelines for operators:

  • Common Operators: These operators serve as N-Body fundamental operations utilized across various components of the software, facilitating consistent functionality and reducing redundancy.

  • Application-Specific Operators: Tailored operators designed to cater to the unique requirements of specific applications built on top of the software stack.

  • Mutualization: This process involves the consolidation and centralization of specific operators shared by multiple applications. It aims to optimize resources and streamline the execution of these shared functionalities.

  • Specific Operators Shared Among Applications: For increased efficiency and ease of access, specific operators that are shared among multiple applications are centralized within exaNBody. This consolidation not only ensures optimized utilization but also simplifies management and maintenance, benefiting all associated applications.

5. Performance and portability

The complex and ever-changing architectures of modern supercomputers make it difficult to maintain software performance. exaNBody aims at providing performance portability and sustainability on those supercomputers with robust domain decomposition, automated inter-process communications algorithms, adaptable particle data layout, and a set of hybrid (CPU/GPU) parallelization templates specialized for N-Body problems.

5.1. Application level specialization

First of all, the internal units to be used are specified as well as the physical quantities to be stored as particle attributes. These quantities (or fields), are defined using a symbolic name associated with a type, e.g. velocity as a 3D vector. A field set is a collection of declared fields. One of the available field sets is selected and used at runtime, depending on simulation specific needs. As depicted in figure below, particles are dispatched in cells of a cartesian grid spanning the simulation domain. In short, the data structure containing all particles’ data will be shaped as a cartesian grid of cells, each cell containing all fields for all particles it (geometrically) contains. More specifically, the reason why fields and field sets are defined at compile time is that particle data storage at the cell level is handled via a specific structure guaranteeing access performance and low memory footprint.

5.2. Spatial domain decomposition and inter-process communications

The coarsest parallelization level can become the main bottleneck due to network latencies and load imbalance issues. To take advantage of this first level of parallelization, the simulation domain is divided into subdomains using an recursive coordinate bisection (RCB) algorithm, as depicted in figure below, assigning one subdomain to each MPI process.

map to buried treasure

Figure 1: Overview of domain decomposition and inter process communications in exaNBody framework.

This is achieved thanks to three main components: cell cost estimator, RCB domain decomposition, and particle migration. Particle migration can be used as-is by any N-Body application, thanks to the underlying generic particle data storage. It supportsheavily multi-threaded, large scale, simulations while lowering peak memory usage. Additionally, the migration algorithm is also customizable to fit specific application needs, keeping unchanged the core implementation. For instance, molecualr dynamics simulations may transport per-cell data fields and discret element method simulations may migrate friction information related to pair of particles. Finally, ghost particle updates are available to any N-Body application, via customizable components.

5.3. Intra-node parallelization API

This API is available in exaNBody to help developers express parallel computations within a MPI process. This API offers a set of parallelization templates associated with three types of computation kernels:

  • Local calculations on a particle (such as resetting force operator)

  • calculations coupled with reduction across all particles (such as getting the total number of particles or the current temperature)

  • calculations involving each particle and its neighbors (such as computing potential forces in molecular dynamics or contact forces in discret element method)

When a developer injects a compute function into these templates, computation may be routed to CPU or GPU, as illustrated in figure below.

map to buried treasure

Figure 2: Example of a particle centered computation executable on both COU and GPU. Three ingredients: a user functor (the kernel), static execution properties (via traits specialization), a ready to use parallelization function template.

While thread parallelization on the CPU is powered by OpenMP, Cuda is employed to execute the computation kernel on the Gpu, using the same provided function. The main difference between the two execution modes is that each cell is a unitary work unit for a single thread in OpenMP context but it is processed by a block of threads in Cuda. Those two parallelization levels (multi-core and GPU) are easily accessible to developers thanks to the execution support layer of Onika. Onika is the low-level software interface that powers exaNBody building blocks. It is responsible for aforementioned data containers, memory management (unified with GPU), and it is the foundation for hybrid execution abstraction layer.

5.4. Particle data layout and auxiliary data structures

These two points are two essential features to maximize performance at the NUMA node level. In exaNBody, particle data are packed independently in each cell using a container specifically designed to match both CPU’s SIMD and GPU’s thread blocks requirements concerning data alignment and vectorization friendly padding. This generic container, available in ONIKA toolbox, not only adapts to specific hardware characteristics at compile time, but ensures minimal memory footprint with as low as 16 bytes overhead per cell regardless of the number of data fields, allowing for very large and sparse simulation domains. N-Body simulations also heavily depend on particles’ neighbors search algorithm and storage structure. The search usually leverages the grid of cell structure to speed up the process, and neighbors lists data structure holds information during several iterations. However, depending on the simulation, particles may move rapidly while their distribution may be heterogeneously dense. Those two factors respectively impact neighbor list update frequency and its memory footprint. On the one hand, exaNBody takes advantage of an Adaptive Mesh Refinement (AMR) grid to accelerate (frequent) neighbor list reconstructions. On the other hand, a compressed neighbor list data structure saves up to 80% of memory (compared to uncompressed lists) while still ensuring fast access from both the CPU and the GPU.

5.5. Flexible and user friendly construction of N-Body simulations

A crucial aspect for software sustainability is to maintain performance over time while managing software complexity and evolution. Complex and rapidly evolving scientific software often encounter common pitfalls, such as code duplication, uncontrolled increase in software inter-dependencies, and obscure data/control flows. This observation has led us to develop our component-based model to avoid these pitfalls. In our model, individual software components are implemented using C++17 and are application structure oblivious, meaning they only know their input and output data flows. Application obliviousness is a crucial aspect of the present design, promoting reusability while preventing uncontrolled growth of internal software dependencies. Each component is developed as a class, inheriting from a base class OperatorNode and explicitly declares its input and output slots (data flow connection points). Once compiled, these components are hierarchically assembled at runtime using a Sequential Task Flow (STF), with a YAML syntax, as shown in figure below.

map to buried treasure

Figure 3: Illustrative sample of components assembly using YAML description. 1) C++ developed components are assembled and connected in the manner of a STF, creating a batch component. 2) and 3) illustrate batch components aggregation to higher and higher level components, up to full simulation task flow.

A set of base components are already available to developers, embedded within exaNBody, such as: common computations, checkpoint/restart, visualization and In-Situ analytics, allowing developers to focus on their application specific components. We also observed that this component based approach not only prevents some development pitfalls, but enables various simulation code structures. YAML formatted component configuration makes it simple for a user to amend or fine tune the simulation process. For instance it can be used to change the numerical scheme or even to insert In-Situ analysis components at specific stages of the simulation process, leveraging In-Situ processing to limit disk I/O. Finally, this component based splitting of the code gives exaNBody the opportunity to provide integrated profiling features that automatically give meaningful performance metrics for each part of the simulation. It allows the user to access computation time spent on CPU and GPU, as well as imbalance indicator. It can also interoperate with nSight System from NVIDIA and summarize memory footprint with detailed consumption.

6. Algorithm Patterns

This is Developer Documentation.

The aim of this section is to explain to developers the tools used in exaNBody to propose default parallel algorithms. For this purpose, we will describe the 4 main algorithmic patterns:

  • block_parallel_for: exposes a simple parallelization scheme (like a parallel for in OpenMP), but with a notion of “thread blocks” and the possibility of asynchronous execution. It is defined directly in the Onika layer, as it is not specific to particle systems.

  • compute_cell_particles: performs a parallel processing on all particles contained within a grid (Grid type pattern).

  • reduce_cell_particles: performs an operation on each particle returning a value, the values obtained being reduced to a single one (e.g., a sum).

  • compute_cell_particle_pairs: the most important and most used pattern, allows calculations on particles and their neighbors.

Guidelines:

  • Support hybrid execution, automatically directing calculations either to the CPU cores or to the GPU.

  • Avoid code duplication between components.

  • Avoid the complexities of setting up OpenMP/CUDA parallelism.

  • Allow asynchronous parallel operations (with results retrieved later in the code).

  • Automatically adapt to the different OpenMP parallelism modes supported by the component system (simple, nested, and task-based).

  • Facilitate the implementation of a single computation function for both CPU and GPU.

  • Manage dynamic load balancing among the SMs of the GPUs themselves.

6.1. Block Parallel For

This construction is the most rudimentary; it is equivalent to an OpenMP parallel for directive but with many additional possibilities. We will look at a step-by-step example where a fixed value is added to all elements of an array of arrays. The example on a two-dimensional array demonstrates the use of two-level multi-threaded parallelism we have discussed, which resembles a bit the array of cells, each containing a set of particles typically processed in exaNBody.

6.1.1. Basic usage

Before calling block_parallel_for, it is necessary to define the functor (the calculation function) that will be applied in parallel to our data.

namespace exaStamp
{
  using namespace exanb;
  // a functor = function applied in parallel
  // it is defined by a class (or struct) with the call operator ()
  struct ValueAddFunctor
  {
    double **array;      // our 2D array
    size_t columns;      // number of columns in our array
    double m_value_to_add; // value to add

    ONIKA_HOST_DEVICE_FUNC       // works on CPU and GPU
    void operator () (size_t i)  // call operator with i in [0;n[
    {                            // a whole block (all its threads) execute iteration i
      ONIKA_CU_BLOCK_SIMD_FOR(size_t, j, 0, columns)  // parallelization among the threads of the current block
      {                                               // for iterations on j in [0;columns[
        array[i][j] += m_value_to_add;               // each thread executes 0, 1, or multiple iterations of j
      }
    }
  };
}
  • ONIKA_CU_BLOCK_SIMD_FOR: Defines a macro that facilitates the use of the second level of parallelism within threads of the same CUDA block.

  • Usage:

    • GPU :Ensures that the functor’s code is executed across all threads within a CUDA block with consistent iteration index i. When compiled for the GPU, it distributes iterations based on ONIKA_CU_BLOCK_SIZE and ONIKA_CU_THREAD_IDX.

    • CPU: On the host, this construct transforms into a straightforward for loop. However, on the host, the loop includes the OpenMP directive #pragma omp simd for, ensuring loop vectorization, even with only one CPU thread per block, provided CPU and compiler support is available.

Next, we need to inform the compiler whether our functor is compatible with CUDA. In some situations, our code may have a strong incompatibility (such as calling a non-compatible library) that would prevent it from compiling correctly with the nvcc layer. This needs to be known at compile time (not just at runtime) because otherwise, compilation errors would occur.

namespace onika
{
  namespace parallel
  {
    template<>
    struct BlockParallelForFunctorTraits<ValueAddFunctor>
    {
      static constexpr bool CudaCompatible = true; // or false to prevent the code from being compiled with CUDA
    };
  }
}

The final step is to launch the parallel operation in the code of our component:

namespace exaStamp
{
  class MyValueAdd : public OperatorNode
  {
    ADD_SLOT(Array2D, my_array, INPUT, REQUIRED);
    ADD_SLOT(double, my_value, INPUT, 1.0);
    public:
    inline void execute() override final {
      ValueAddFunctor func = { my_array->data()     // data of our array (double**), initializes the 'array' member of our functor
                             , my_array->columns()  // number of columns in our 2D array
                             , *my_value };         // value to add to the elements of the array
      // Launching the parallel operation, which can execute on GPU if the execution context allows
      onika::parallel::block_parallel_for(my_array->rows()             // number of iterations, parallelize at the first level over rows
                                         , func                         // the function to call in parallel
                                         , parallel_execution_context() // returns the parallel execution context associated with this component
                                         );
    }
  };
}

6.1.2. Asynchrone usage

In the following example, we are still using the simplest version of block_parallel_for. Optional parameters can be added to control asynchronicity or even disable execution on the GPU.

namespace exaStamp
{
  class MyValueAdd : public OperatorNode
  {
    ADD_SLOT(Array2D, my_array, INPUT, REQUIRED);
    ADD_SLOT(double, my_value, INPUT, 1.0);
public:
    inline void execute() override final {
      size_t cols = my_array->columns();
      size_t rows = my_array->rows();
      bool enable_gpu = (cols * rows) > 1000;     // enable GPU execution only if the matrix has more than 1000 values
      ValueAddFunctor func = { my_array->data(), cols, *my_value };
      auto control = onika::parallel::block_parallel_for(
                       rows
                     , func
                     , parallel_execution_context()
                     , enable_gpu                    // enable or disable GPU execution
                     , true                          // request asynchronous execution
                     );
      std::cout << "Meanwhile, the parallel operation is executing..." << std::endl;
      control->wait();                               // wait for the operation to complete and results to be ready to read
      std::cout << "Parallel operation completed!" << std::endl;
    }
  };
}

If the execution context allows it, the parallel operation will proceed in the background, occupying either the free threads (other than those executing this code) or the GPU. This can be very useful, especially for overlapping computations and MPI message sends.

Warning

The operation is asynchronous only if the execution context permits it. Otherwise, it will proceed synchronously and complete before the block_parallel_for function returns. In such cases, calling control->wait() will simply have no effect.

6.2. Compute Cell Particles

This is the first specialized algorithmic pattern for particle systems. Therefore, it can only be applied to an instantiation of the Grid type from exaNBody. This pattern is the simplest; it will perform an independent operation in parallel on all particles in the system. To understand how this works in the implementation of a component, the following code presents a fully commented example aimed at increasing the velocity of all particles by a constant vector.

#include <exanb/core/operator.h>
#include <exanb/core/operator_slot.h>
#include <exanb/core/operator_factory.h>
#include <exanb/core/make_grid_variant_operator.h> // because we use make_grid_variant_operator for component registration
#include <exanb/compute/compute_cell_particles.h>  // provides the compute_cell_particles function

namespace exaStamp {
  struct AddVec3Functor        // our functor, adds a vector to the 3 components of particle velocity
  {
    const Vec3d vec_to_add;    // functor parameter: the velocity to add

    ONIKA_HOST_DEVICE_FUNC
    inline void operator ()(double& vx, double& vy, double& vz) const // parameters defined by compute_fields_t declared below
    {
      vx += vec_to_add.x;  // perform the desired operation
      vy += vec_to_add.y;  // particle attributes are passed by reference (&), so they can be modified
      vz += vec_to_add.z;  // here, we only need to implement the local operation on the particle, without worrying about thread block levels
    }
  };
}

namespace exanb {
  template<> struct ComputeCellParticlesTraits<exaStamp::AddVec3Functor>
  {
    static inline constexpr bool RequiresBlockSynchronousCall = false; // no collaboration between threads of the same block
    static inline constexpr bool CudaCompatible = true;                // compatible with Cuda (thanks to ONIKA_HOST_DEVICE_FUNC usage)
  };
}

namespace exaStamp
{
  template< class GridT                                                        // our component adapts to any grid type
          , class = AssertGridHasFields<GridT,field::_vx,field::_vy,field::_vz> > // as long as particles have the vx, vy, and vz attributes
  class ParticleVelocityAdd : public OperatorNode
  {
      ADD_SLOT(Vec3d, vec, INPUT, REQUIRED);    // vector to add to particle velocities
      ADD_SLOT(GridT, grid, INPUT_OUTPUT);      // grid containing cells and all particles within the sub-domain
public:
    inline void execute() override final {
      using compute_fields_t = FieldSet<field::vx, field::vy, field::vz>; // fields on which our functor operates
      AddVec3Functor func = { *vec };                                     // instantiate the functor with the velocity provided as input to the component
      compute_cell_particles(*grid                    // grid containing particles
                            , false                    // do not apply our function in ghost zones
                            , func                     // the functor to apply
                            , compute_fields_t{}       // attributes on which to compute ⇒ defines the functor's call parameters
                            , parallel_execution_context() // component's parallel execution context
                            );
    }
  };
  // component registration
  template<class GridT> using ParticleVelocityAddTmpl = ParticleVelocityAdd<GridT>;
  CONSTRUCTOR_FUNCTION {
    OperatorNodeFactory::instance()->register_factory("add_velocity", make_grid_variant_operator<ParticleVelocityAddTmpl>);
  }
}

6.3. Reduce Cell Particles

This algorithm pattern is very similar to the previous one, but it allows for computing the reduction of a calculation result across all particles. As we will see in the commented example below, the usage of this construction will be very similar to compute_cell_particles with one important exception: the functor can be called in multiple ways (multiple implementations of the operator () are possible). This reflects the need to perform reduction in three stages for performance reasons: local reduction within each thread, reduction of partial sums computed by threads within the same thread block, and finally the overall reduction that sums the partial contributions from different thread blocks.

#include <exanb/core/operator.h>
#include <exanb/core/operator_slot.h>
#include <exanb/core/operator_factory.h>
#include <exanb/core/make_grid_variant_operator.h> // because we use make_grid_variant_operator for component registration
#include <exanb/compute/reduce_cell_particles.h>   // provides the reduce_cell_particles function

namespace exaStamp {
  struct ReduceVec3NormFunctor // our functor calculates the sum of norms of forces
  {
    ONIKA_HOST_DEVICE_FUNC                                    // operator for local reduction within a thread
    inline void operator ()( double & sum_norm                // reference to accumulate contributions
                           , double fx, double fy, double fz  // particle force, parameters determined by reduce_fields_t declared below
                           , reduce_thread_local_t            // phantom parameter to differentiate call forms (here local thread reduction)
                           ) const {
      sum_norm += sqrt( fx*fx + fy*fy + fz*fz );              // compute norm and add contribution
    }

    ONIKA_HOST_DEVICE_FUNC                          // operator for internal reduction within a thread block to a single block value
    inline void operator ()( double& sum_norm       // reference to accumulate contributions from threads within the block
                           , double other_sum       // one of the partial sums to accumulate
                           , reduce_thread_block_t  // indicates block-level reduction
                           ) const {
      ONIKA_CU_ATOMIC_ADD( sum_norm , other_sum );  // atomic addition function (thread-safe), works in both CUDA and CPU context
    }

    ONIKA_HOST_DEVICE_FUNC                       // operator for reduction across all thread blocks
    inline void operator ()( double& sum_norm    // reference to global result
                           , double other_sum    // contribution from one block to add
                           , reduce_global_t     // indicates final reduction
                           ) const {
      ONIKA_CU_ATOMIC_ADD( sum_norm , other_sum );
    }
  };
}

namespace exanb {
  template<> struct ReduceCellParticlesTraits<exaStamp::ReduceVec3NormFunctor>
  {
    static inline constexpr bool RequiresBlockSynchronousCall = false; // does not use intra-block thread collaboration
    static inline constexpr bool RequiresCellParticleIndex = false;    // no additional parameters (cell/particle indices)
    static inline constexpr bool CudaCompatible = true;                // CUDA compatible
  };
}

namespace exaStamp {
template< class GridT                                                      // our component adapts to any grid type
        , class = AssertGridHasFields<GridT,field::_fx,field::_fy,field::_fz> > // as long as particles have fx, fy, and fz attributes
class SumForceNorm : public OperatorNode
{
    ADD_SLOT(GridT, grid, INPUT, REQUIRED);    // grid containing cells and all particles within the sub-domain
    ADD_SLOT(double, sum_norm, OUTPUT);       // output value = sum of force norms
public:
  inline void execute() override final {
    using reduce_fields_t = FieldSet<field::fx, field::fy, field::fz>; // fields on which our functor operates
    ReduceVec3NormFunctor func = {};                                    // instantiate the functor for summing force norms
    *sum_norm = 0.0;
    reduce_cell_particles(*grid                         // grid containing particles of the system
                         , false                         // do not compute in ghost zones
                         , func                          // functor to use
                         , *sum_norm                     // initial value for reduction input, final result output
                         , reduce_fields_t{}             // fields used for reduction, defines call parameters
                         , parallel_execution_context()  // current component's parallel execution context
                         );
  }
};

// component registration similar to the one in the example for compute_cell_particles
}

This code defines a component (SumForceNorm) that computes the sum of norms of forces acting on particles within a grid (grid). Key points to note:

  • It uses a functor (ReduceVec3NormFunctor) with multiple operator overloads to perform reduction across particles in three stages: local to each thread, within each thread block, and globally across all thread blocks.

  • Traits (ReduceCellParticlesTraits) are specialized to specify that the functor supports CUDA and does not require intra-block thread synchronization.

  • The component is registered using OperatorNodeFactory, making it accessible under a specific name for instantiation.

This example demonstrates advanced parallel computation techniques within a particle system framework (exaNBody), focusing on efficient reduction operations across large datasets.

6.4. Compute Pair Interaction

Pattern Overview:

  • Purpose: Compute interactions (potentials or forces) between pairs of particles based on their proximity (rij < rcut). - Components: Utilizes both particle grids and neighbor lists (see neighbor lists section). - Two Invocation Modes:

    • With Buffer: Collects attributes of neighboring particles into a buffer before invoking the user-defined functor once with this buffer.

    • Without Buffer: Computes interactions “on-the-fly” without pre-collecting neighbors’ attributes.

  • Usage Examples: - Scenario 1: Interaction potentials require knowledge of all neighboring particles of at least one of the particles involved. Here, compute_cell_particle_pairs first accumulates neighbors’ attributes into a buffer and then invokes the user functor. - Scenario 2: Interaction potentials only require attributes of the two particles involved, making it efficient to compute interactions without using an intermediate buffer.

  • Flexibility: - Functors can be implemented to support both invocation protocols (with and without buffer), allowing the subsystem to choose the optimal method based on execution environment and computational requirements. - Functors can choose to implement only one of these protocols if the scenario allows for a straightforward implementation.

  • Implementation: - The pattern involves defining a functor (ComputePairInteractionFunctor) that encapsulates the logic for computing interactions between pairs of particles. - Traits (ComputeCellParticlePairsTraits) are used to specify compatibility with CUDA and whether synchronous block calls are required.

Header files:

#pragma xstamp_enable_cuda                           // Enable compilation with nvcc, allowing GPU code generation
#include <exanb/core/operator.h>                     // Base to create a component
#include <exanb/core/operator_factory.h>             // For registering the component
#include <exanb/core/operator_slot.h>                // For declaring component slots
#include <exanb/core/make_grid_variant_operator.h>   // For registering a template component

#include <exanb/core/grid.h>                         // Defines the Grid type containing particles
#include <exanb/core/domain.h>                       // Domain type, represents the simulation domain
#include <exanb/particle_neighbors/chunk_neighbors.h>// GridChunkNeighbors type ⇒ lists of neighbors
#include <exanb/compute/compute_cell_particle_pairs.h>// For compute_cell_particle_pairs function

Functor Creation:

For this example (lennard jones potential), we implement both possible forms of invocation so that our functor can be used with or without an intermediate buffer. This allows compute_cell_particle_pairs to choose the most efficient method based on the context. In practice, you can implement only one of these forms (if your potential formulation allows) or both-either way, it will work on both CPU and GPU.

namespace microStamp {
  using namespace exanb;

  struct LennardJonesForceFunctor {
    const LennardJonesParms m_params;  // parameters of our interaction potential
    // Operator for using an intermediate buffer containing all neighbors
    template<class ComputePairBufferT, class CellParticlesT>
    ONIKA_HOST_DEVICE_FUNC
    inline void operator () (
        size_t n,                                // number of neighboring particles
        const ComputePairBufferT& buffer,         // intermediate buffer containing neighbor information
        double& e,                                // reference to energy, where we accumulate energy contribution
        double& fx, double& fy, double& fz,       // references to force components where we add interaction contribution
        CellParticlesT* cells                    // array of all cells, in case additional information is needed
    ) const {
      double _e = 0.0;                           // local contributions, initialized to 0
      double _fx = 0.0;
      double _fy = 0.0;
      double _fz = 0.0;

      for (size_t i = 0; i < n; ++i) {           // loop over neighbors in the buffer
        const double r = std::sqrt(buffer.d2[i]);
        double pair_e = 0.0, pair_de = 0.0;

        lj_compute_energy(m_params, r, pair_e, pair_de);  // calculate energy and its derivative

        const auto interaction_weight = buffer.nbh_data.get(i);
        pair_e *= interaction_weight;                    // weight the interaction and normalize over distance rij
        pair_de *= interaction_weight / r;

        _fx += pair_de * buffer.drx[i];                  // add contributions from the i-th neighbor
        _fy += pair_de * buffer.dry[i];
        _fz += pair_de * buffer.drz[i];
        _e += 0.5 * pair_e;
      }

      e += _e;   // add local contributions to energy and force fields of the central particle
      fx += _fx;
      fy += _fy;
      fz += _fz;
    }

    // Operator for without buffer: one call per neighboring particle
    template<class CellParticlesT>
    ONIKA_HOST_DEVICE_FUNC
    inline void operator () (
        Vec3d dr,                           // relative position of the neighboring particle
        double d2,                          // square of the distance to the neighboring particle
        double& e, double& fx, double& fy, double& fz,  // references to variables to update with energy and force contributions of the pair
        CellParticlesT* cells,              // array of all cells, in case additional information is needed
        size_t neighbor_cell,               // index of the cell where the neighboring particle resides
        size_t neighbor_particle,           // index of the neighboring particle within its cell
        double interaction_weight           // weighting to apply on the interaction
    ) const {
      const double r = sqrt(d2);
      double pair_e = 0.0, pair_de = 0.0;

      lj_compute_energy(m_params, r, pair_e, pair_de);  // calculate energy and its derivative
      pair_e *= interaction_weight;                    // weight and normalize by distance
      pair_de *= interaction_weight / r;

      fx += pair_de * dr.x;                            // add contributions of the pair to energy and forces of the central particle
      fy += pair_de * dr.y;
      fz += pair_de * dr.z;
      e += 0.5 * pair_e;
    }
  };
}

Define the compile-time characteristics of the functor:

namespace exanb
{
  template<> struct ComputePairTraits< microStamp::LennardJonesForceFunctor > // specialization for our functor
  {
    static inline constexpr bool RequiresBlockSynchronousCall = false; // no collaboration between threads within a block
    static inline constexpr bool ComputeBufferCompatible = true;  // allows invocation with a compute buffer
    static inline constexpr bool BufferLessCompatible = true;     // allows invocation without a buffer, for each neighbor
    static inline constexpr bool CudaCompatible = true;           // compatible with Cuda
  };
}

Performing the calculation in parallel within the execute method of our component:

template<class GridT,                                                            // our component adapts to any type of grid
  class=AssertGridHasFields<GridT,field::_ep,field::_fx,field::_fy,field::_fz> > // which particles have attributes ep, fx, fy, and fz
class LennardJonesForce : public OperatorNode {
  ADD_SLOT( LennardJonesParms        , config         , INPUT, REQUIRED);  // parameters of our potential
  ADD_SLOT( double                   , rcut           , INPUT, 0.0);       // cutoff radius
  ADD_SLOT( exanb::GridChunkNeighbors, chunk_neighbors, INPUT, exanb::GridChunkNeighbors{}); // neighbor lists
  ADD_SLOT( bool                     , ghost  , INPUT , false);            // indicates if we calculate in ghost zones
  ADD_SLOT( Domain                   , domain , INPUT , REQUIRED);         // simulation domain
  ADD_SLOT( GridT                    , grid   , INPUT_OUTPUT );            // grid containing particles in the local subdomain
  using ComputeBuffer = ComputePairBuffer2<>;                              // shortcut for the type "buffer containing neighbors"
  using CellParticles = typename GridT::CellParticles;                     // shortcut for the type "grid cell"
  using NeighborIterator = exanb::GridChunkNeighborsLightWeightIt<false>;  // shortcut for the type "neighbor iterator"
  using ComputeFields = FieldSet<field::_ep    // type defining the list of fields on which our functor operates
                                ,field::_fx    // this list determines part of the functor's call parameters
                                ,field::_fy    // we declare ep, fx, fy, and fz, which are of type double, so we will find
                                ,field::_fz>;  // references to these attributes as parameters of our functor.
public:
  inline void execute () override final
  {
    auto optional = make_compute_pair_optional_args(        // builds the list of particle traversal options
                      NeighborIterator{ *chunk_neighbors }  // iterator over neighbor lists
                    , ComputePairNullWeightIterator{}       // iterator over weighting values (no weighting here)
                    , LinearXForm{ domain->xform() }        // transformation to apply on particle positions (domain deformation here)
                    , ComputePairOptionalLocks<false>{});   // accessor to particle access locks (no locking here)
    compute_cell_particle_pairs( *grid                                     // grid containing particles
                               , *rcut                                     // cutoff radius
                               , *ghost                                    // whether to calculate in ghost zones or not
                               , optional                                  // particle traversal options
                               , make_compute_pair_buffer<ComputeBuffer>() // creates neighbor buffers
                               , LennardJonesForceFunctor{ *config }       // instantiation of our functor with user parameters
                               , ComputeFields{}                           // attributes on which our functor operates
                               , DefaultPositionFields{}                   // uses default attributes (rx, ry, and rz) for particle positions
                               , parallel_execution_context() );           // parallel execution context of the current component
  }
};

7. List of Plugins

7.1. DefBox

7.1.1. Operator a

7.1.2. Operator b

7.1.3. Operator c

7.2. Core features

Independently from any loaded plugins, a set of core feature are always available for use. This includes, but is not limited to, domain description, particle grid, mpi core features, and more.

7.2.1. Domain

The domain struct holds information about simulation domain size and properties. It is initialized by an operator named “domain” in the YAML input file. Main domain properties are :

  • domain lower and upper bounds

  • cell size

  • cell grid dimensions

  • boundary condtions

Here after is an exemple of domain initialization using the “domain” operator :

domain:
  cell_size: 10.0 ang
  bounds: [ [ 0.0 , 0.0 , 0.0 ] , [ 10.0 , 10.0 , 10.0 ] ]
  periodic: [false,true,true]
  mirror: [ X- ]

Periodic indicates which directions have periodic boundary conditions. Mirror tells wich sides have mirror (reflective) boundary conditions. A reflective mirror side can be applied on any side of the domain box. A direction with at leat one mirror boundary cannot be periodic, and vice-versa. As an exemple, if X direction is not periodic, it can have mirror condition and the lower end (X-) and/or the upper end (X+). If user set X to be periodic and add a X-/X+ mirror, X periodicity is disabled automatically. Finally, mirror property is a list of sides on which mirror conditions are applied, among X-, X+, X, Y-, Y+, Y, Z-, Z+ and Z. mirror sides X is a shortcut for enabling X- and X+ at the same time, so is Y and Z.

7.3. Logic

7.4. AMR

7.5. ParticleNeighbors

7.6. GridCellParticles

7.6.1. Cartesian Field Grid

  • Operator Name: set_cell_values

  • Description: This operator initializes values of a specific cell value field, and creates it if needed. optionally, initialization can be bounded to a specified region, the rest of field beeing set to all 0. This operator can also be used to refine the grid.

  • Parameters:

    • region: Region of the field where the value is to be applied.

    • grid_subdiv: Number of (uniform) subdivisions required for this field. Note that the refinement is an octree.

    • field_name: Name of the field.

    • value: List of the values affected to the field.

YAML example:

- set_cell_values:
   field_name: jet
   grid_subdiv: 30
   value: [1, 0, 0, 20]
   region: GEYSERE

7.7. IO

7.8. Extra Storage

The purpose of this plugin is to add dynamic storage of any type associated with a particle. This package provides the data structure for storage, the associated MPI buffers and operators to move data between grid cells (move particle and migrate cell particles). This package was designed to meet the needs of exaDEM, as friction force values must be preserved from one time step to another. Additionally, this package allows for additional storage in the dump.

To utilize this additional storage, you need to use the data structure ExtraDynamicDataStorageCellMoveBufferT with your data type and the slot name ges for grid extra storage. This structure is a grid of CellExtraDynamicDataStorageT, which itself contains an array of your data type (m_data) and a particle information array (m_info) containing the start index in the m_data array, the number of stored data, and the particle index.

template<typename ItemType> struct CellExtraDynamicDataStorageT
using UIntType = uint64_t;
using InfoType = ExtraStorageInfo;
onika::memory::CudaMMVector<InfoType> m_info; /**< Info vector storing indices of the [start, number of items, particle id] of each cell's extra dynamic data in m_data. */
onika::memory::CudaMMVector<ItemType> m_data; /**< Data vector storing the extra dynamic data for each cell. */

The filling of these data structures is your responsibility; however, it is possible to instantiate an operator to verify that the filling has been done correctly:

namespace exanb
{
  template<class GridT> using CheckInfoConsistencyYourDataTypeTmpl = CheckInfoConsistency<GridT, GridExtraDynamicDataStorageT<YOUR_DATA_TYPE>>;

  // === register factories ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "check_es_consistency_your_data_type", make_grid_variant_operator< CheckInfoConsistencyYourDataTypeTmpl > );
  }
}

The following code is an example of how to correctly fill the data structure (type = Interaction):

typedef GridExtraDynamicDataStorageT<Interaction> GridCellParticleInteraction;
ADD_SLOT( GridT, grid, INPUT_OUTPUT , REQUIRED );
ADD_SLOT( GridCellParticleInteraction , ges , INPUT_OUTPUT );
auto& g = *grid;
const auto cells = g.cells();
const size_t n_cells = g.number_of_cells(); // nbh.size();
auto & ces = ges->m_data;
assert( ces.size() == n_cells );
const IJK dims = g.dimension();
const int gl = g.ghost_layers();

#pragma omp parallel
{
  Interaction item;
  GRID_OMP_FOR_BEGIN(dims-2*gl,_,block_loc, schedule(guided) )
  {
    IJK loc_a = block_loc + gl;
    size_t cell_a = grid_ijk_to_index( dims , loc_a );
    const unsigned int n_particles = cells[cell_a].size();
    auto& storage = ces[cell_a];
    auto& data = storage.m_data;
    auto& info = storage.m_info;
    // auto& history = extract_history(data);
    // You can extract data before initialize.
    storage.initialize(n_particles);
    for(size_t i = 0 ; i < n_particles ; i++)
    {
      // Do some stuff and fill item.
      // You can add several items here.
      auto& [offset, size, id] = info[i];
      size++
      m_data.push_back(item);
                        // you can update the particle offset here.
    }
  }
  GRID_OMP_FOR_END
  // you can fit offsets here instead of in the omp loop. (offset(i) = offset(i-1) + size(i-1))
}

Warning:

  • This package allows for as many external storages as there are types; however, it’s not possible to have two additional storages of the same type.

  • Don’t forget to adjust the size of this storage to the number of cells in the grid when first using it.

  • This package does not integrate with routines for particle-level calculations such as compute_cell_particles.

Tip:

  • Before sending or writing data, consider removing unnecessary information. For example, in DEM, if the friction is equal to (0,0,0), you can overwrite this data to save space. (more details, see in exaDEM compress_interaction operator).

7.8.1. Extra Data Checker

  • Operator: check_es_consistency_double

    • Description : This opertor checks if for each particle information the offset and size are correct

    • ges : Your grid of addictionnal data storage.

YAML example:

check_es_consistency_double

7.8.2. Migrate Cell Particles With Extra Storage

  • Operator: migrate_cell_particles_double (example)

    • Description : migrate_cell_particles does 2 things:

        1. it repartitions the data accross mpi processes, as described by lb_block.

        1. it reserves space for ghost particles, but do not populate ghost cells with particles. The ghost layer thickness (in number of cells) depends on ghost_dist. Inputs from different mpi process may have overlapping cells (but no duplicate particles). the result grids (of every mpi processes) never have overlapping cells. The ghost cells are always empty after this operator.

    • ges : Your grid of addictionnal data storage.

    • bes : Your buffer used for particles moving outside the box

    • buffer_size : Performance tuning parameter. Size of send/receive buffers in number of particles.

    • copy_task_threshold : Performance tuning parameter. Number of particles in a cell above which an asynchronous OpenMP task is created to pack particles to send buffer.

    • extra_receive_buffers: Performance tuning parameter. Number of extraneous receive buffers allocated allowing for asynchronous (OpenMP task) particle unpacking. A negative value n is interpereted as -n*NbMpiProcs

    • force_lb_change : Force particle packing/unpacking to and from send buffers even if a load balancing has not been triggered

    • otb_particles : Particles outside of local processor’s grid

  • In practice, do not tune this operator yourself.

How to create your operator:

#include <exanb/extra_storage/migrate_cell_particles_es.hpp>
namespace exanb
{
  template<class GridT> using MigrateCellParticlesYourDataTypeTmpl = MigrateCellParticlesES<GridT, GridExtraDynamicDataStorageT<your_data_type>>;

  // === register factory ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "migrate_cell_particles_your_data_type", make_grid_variant_operator<MigrateCellParticlesYourDataTypeTmpl> );
  }
}

YAML example:

migrate_cell_particles_double

7.8.3. Move Particles With Extra Storage

  • Operator: migrate_cell_particles_double (example)

    • Description : This operator moves particles and extra data storage (es) across cells.

    • ges : Your grid of addictionnal data storage.

    • bes : Your buffer used for particles moving outside the box

    • otb_particles ; Particles outside of local processor’s grid

    • In practice, do not tune this operator yourself

How to create your operator:

#include <exanb/extra_storage/move_particles_es.hpp>
namespace exanb
{
  template<class GridT> using MoveParticlesYourDataTypeTmpl = MigrateCellParticlesWithES<GridT, GridExtraDynamicDataStorageT<your_data_type>>;

  // === register factory ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "migrate_cell_particles_your_data_type", make_grid_variant_operator<MoveParticlesYourDataTypeTmpl> );
  }
}

YAML example:

move_particles_double

7.8.4. IO Writer With Extra Data

There is no operator in exaNBody for writing dump files with storage because you need to explicitly specify the fields to store. However, we propose a non-instantiated templated operator for this purpose. We provide an example with exaDEM and Interaction data type.

#include <exaDEM/interaction/grid_cell_interaction.hpp>
#include <exanb/extra_storage/sim_dump_writer_es.hpp>
#include <exanb/extra_storage/dump_filter_dynamic_data_storage.h>

namespace exaDEM
{
  using namespace exanb;
  using DumpFieldSet = FieldSet<field::_rx,field::_ry,field::_rz, field::_vx,field::_vy,field::_vz, field::_mass, field::_homothety, field::_radius, field::_orient , field::_mom , field::_vrot , field::_arot, field::_inertia , field::_id , field::_shape >;

  template<typename GridT> using SimDumpWriteParticleInteractionTmpl = SimDumpWriteParticleES<GridT, exaDEM::Interaction, DumpFieldSet>;

  // === register factories ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "write_dump_particle_interaction" , make_grid_variant_operator<SimDumpWriteParticleInteractionTmpl> );
  }
}

For the description of operator slots, see write_dump_particle_interaction in exaDEM documentation. Tip: compress extra storage before write dump data file.

YAML example:

dump_data_particles:
  - timestep_file: "exaDEM_%09d.dump"
  - message: { mesg: "Write dump " , endl: false }
  - print_dump_file:
      rebind: { mesg: filename }
      body:
        - message: { endl: true }
  - compress_interaction
  - stats_interactions
  - write_dump_particle_interaction
  - chunk_neighbors_impl

7.8.5. IO Reader With Extra Data

There is no operator in exaNBody for reading dump files with storage because you need to explicitly specify the fields to store. However, we propose a non-instantiated templated operator for this purpose. We provide an example with exaDEM and Interaction data type.

#include <exaDEM/interaction/grid_cell_interaction.hpp>
#include <exanb/extra_storage/sim_dump_reader_es.hpp>

namespace exaDEM
{
  using namespace exanb;
  using DumpFieldSet = FieldSet<field::_rx,field::_ry,field::_rz, field::_vx,field::_vy,field::_vz, field::_mass, field::_homothety, field::_radius, field::_orient , field::_mom , field::_vrot , field::_arot, field::_inertia , field::_id , field::_shape >;

  template<typename GridT> using SimDumpReadParticleInteractionTmpl = SimDumpReadParticleES<GridT, exaDEM::Interaction, DumpFieldSet>;

  // === register factories ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "read_dump_particle_interaction" , make_grid_variant_operator<SimDumpReadParticleInteractionTmpl> );
  }
}

For the description of operator slots, see read_dump_paricle_interaction in exaDEM documentation.

YAML example:

read_dump_particle_interaction:
   filename: last.dump
   override_domain_bounds: false
   #scale_cell_size: 0.5

7.9. MPI

7.9.1. Update Ghost Layers

  • Operator: ghost_update_r and ghost_update_all
    • Description : These operators are in charge of updating ghost zones between two sub-domains and copying the information required at sub-domains boundaries and for periodic conditions. The ghost_update_r operator copies the position while ghost_update_all copies all fields defined in your grid type.

    • gpu_buffer_pack : boolean value [false] to decide if you want to port pack/unpack routines on GPU.

    • async_buffer_pack : boolean value [false] triggering to overlap several calls to pack and unpack (send buffers as soon as possibles).

    • staging_buffer : boolean value [false] triggering the copy to a pure CPU buffer before MPI calls (highly recommended if packaging on GPU)

    • serialize_pack_send : boolean value [false] triggering to wait that all send buffers are built up before sending the first one.

Example in your msp file:

- ghost_update_r:
   gpu_buffer_pack: true
   async_buffer_pack: true
   staging_buffer: true

Note that you can customize a ghost_update_XXX operator for your application such as :

namespace exaDEM
{
        using namespace exanb;
        using namespace UpdateGhostsUtils;
        // === register factory ===
        template<typename GridT> using UpdateGhostsYourFields = UpdateGhostsNode< GridT , FieldSet<field::_rx, field::_ry, field::_rz , list_of_your_fields > , false >;

        CONSTRUCTOR_FUNCTION
        {
                OperatorNodeFactory::instance()->register_factory( "ghost_update_XXX",     make_grid_variant_operator<UpdateGhostsYourFields> );
        }
}

7.9.2. MPI Barrier

This operator is used to create synchronization points between MPI processes. In practice, it is utilized to obtain accurate timing information from operators during performance studies. Otherwise, timing accumulate in operators containing MPI collective routines such as displ_over.

  • Operator : mpi_barrier

    • Description : Add a MPI_Barrier(MPI_COMM_WORLD).

    • mpi : MPI_Comm, default is MPI_COMM_WORLD

YAML Example:

- mpi_barrier

8. ExaNBody Command Lines

ExaNBody offers a range of command-line options to facilitate debugging, profiling, and configuring OpenMP settings, providing users with enhanced control and flexibility over the behavior and performance of the code. These options enable developers to diagnose issues, analyze performance characteristics, and optimize parallel execution on multicore architectures. From enabling debug mode to specifying thread affinity and scheduling policies, users can tailor ExaNBody’s behavior to suit their specific requirements and achieve optimal performance. This section outlines the various command-line options available, empowering users to leverage ExaNBody’s capabilities effectively.

8.1. Command line and input file interaction

exaNBody based apps treat command line the same way as input files, the command line just being YAML elements expressed with another syntax. YAML docuement, as processed by an exaNBody app is formed by a set of included files (either implicitly or explcitly through the includes list) and the user input file passed as the first argument of command line, as in the following exemple :

./exaStamp myinput.msp

The YAML document built from user input and its included files have 3 reserved dictionary entries, namely configuration reserved for configuring execution sub-system, includes reserved to include other YAML files and simulation wich is interpreted as the the root batch operator representing the whole simulation to run.

When it comes to interpreting command line arguments, the exaNBody based application parses it and internally converts it to a YAML document, merged with previously read YAML inputs (user input file and included files). Conversion from command line argument to YAML proceeds as follows :

  • If argument starts with –set-, it is understood as a generic YAML dictionary entry, and each ‘-’ is interpreted as a marker for an inner dictionary entry. The following value is understood as beeing the value associated with the given dictionary key. If no value found after a –set-xxx style argument, the value true is implicitly used. For instance, –set-global-rcut_inc ‘1.2 m’ is equivalent to including a YAML file wich contains

global:
  rcut_inc: 1.2 m
  • If a command line argument starts with –xxx with xxx being anything but ‘set’ , then similar rules apply as for the –set-xxx args, but the YAML block is understood as being a dictionary entry inside the configuration block. For instance, –logging-debug true is equivalent to inclusion of a YAML file containing :

configuration:
  logging:
    debug: true

8.2. Commands ‘Help’ for your application

exaNBody provides several help commands to assist users in understanding the available command lines, plugins, and operators. Below are the descriptions and usage of each help command.

8.2.1. General Help Command

To display the general help information:

./exaDEM --help

This command shows the available help command lines.

8.2.2. Command-Line Options Help

To display the command line options available:

./exaDEM --help command-line

This command lists the command line options you can use to customize the configuration block. See the next section named “Index of command line options to customize configuration block” for detailed information.

8.2.3. Plugins Help

To print all operators and plugins available in your application:

./exaDEM --help plugins

This command prints every operator and plugin available in exaNBody.

8.2.4. Show Plugins Help

To show detailed information for every operator, including descriptions and slot operators:

./exaDEM --help show-plugins

This command provides detailed information for all operators, including their descriptions and slot operators.

8.2.5. Operator-Specific Help

To show details for a specific operator, including its description and slot operators:

./exaDEM --help [operator-name]

Replace [operator-name] with the name of the operator you want details for.

8.3. Index of command line options to customize configuration block

Option

Value type

Default

–logging-parallel

bool

false

–logging-debug

bool

false

–logging-log_file

std::string

“”

–logging-err_file

std::string

“”

–logging-dbg_file

std::string

“”

–profiling-resmem

bool

false

–profiling-exectime

bool

false

–profiling-summary

bool

false

–profiling-filter

StringVector

{}

–profilingtrace-enable

bool

false

–profilingtrace-format

std::string

“yaml”

–profilingtrace-file

std::string

“trace”

–profilingtrace-color

std::string

“operator”

–profilingtrace-total

bool

false

–profilingtrace-idle

bool

true

–profilingtrace-trigger

std::string

“”

–profilingtrace-trigger_interval

IntVector

{}

–profilingtrace-idle_resolution

long

8192

–profilingtrace-idle_smoothing

long

32

–debug-plugins

bool

false

–debug-config

bool

false

–debug-yaml

bool

false

–debug-graph

bool

false

–debug-ompt

bool

false

–debug-graph_addr

bool

false

–debug-graph_lod

int

1

–debug-graph_fmt

std::string

“console”

–debug-graph_rsc

bool

false

–debug-files

bool

false

–debug-rng

std::string

“”

–debug-graph_filter

StringVector

{}

–debug-filter

StringVector

{}

–debug-particle

UInt64Vector

{}

–debug-particle_nbh

bool

false

–debug-particle_ghost

bool

false

–debug-fpe

bool

false

–debug-verbose

int

0

–debug-graph_exec

bool

false

–onika-parallel_task_core_mult

int

ONIKA_TASKS_PER_CORE

–onika-parallel_task_core_add

int

0

–onika-gpu_sm_mult

int

ONIKA_CU_MIN_BLOCKS_PER_SM

–onika-gpu_sm_add

int

0

–onika-gpu_block_size

int

ONIKA_CU_MAX_THREADS_PER_BLOCK

–onika-gpu_disable_filter

StringVector

{}

–nogpu

bool

false

–mpimt

bool

true

–pinethreads

bool

false

–threadrotate

int

0

–omp_num_threads

int

-1

–omp_max_nesting

int

-1

–omp_nested

bool

false

–plugin_dir

std::string

USTAMP_PLUGIN_DIR

–plugin_db

std::string

“”

–plugins

StringVector

{}

–generate_plugins_db

bool

false

–help

std::string

“”

–run_unit_tests

bool

false

–set

YAML::Node

8.4. Tune your run with OpenMP

Harnessing the power of OpenMP parallelization, ExaNBody provides users with the ability to fine-tune their execution environment for optimal performance on multicore architectures. Through a selection of command-line options, users can customize thread management, affinity settings, and loop scheduling to maximize parallel efficiency. This subsection introduces the command-line options available for configuring OpenMP behavior within ExaNBody.

ExaNBody OpenMP Command Lines

Type of tools

Command line

Description

Default

Pine OMP Threads

–pinethreads true

Controls thread affinity settings within the OpenMP runtime, influencing how threads are bound to CPU cores for improved performance, particularly on NUMA architectures.

false

Set the number of threads

–omp_num_threads 10

Specifies the number of threads to be utilized for parallel execution, allowing users to control the degree of parallelism based on system resources and workload characteristics.

By default it takes the maximum number of threads available

Maximum level of nested parallelism

–omp_max_nesting [max_nesting_level]

Specifies the maximum level of nested parallelism allowed within OpenMP, controlling the depth at which parallel regions can be nested.

-1

Nested parallelism within OpenMP

–omp_nested [true/false]

Enables or disables nested parallelism within OpenMP, allowing parallel regions to spawn additional parallel regions. true may be replaced by an integer indicating how many nested parallelism levels are allowed.

false

MPI_THREAD_MULTIPLE feature request

–mpimt [true|false]

If set to true, requires that MPI implementation supports MPI_THREAD_MULTIPLE feature

true

8.5. Tune GPU execution options

Harnessing the power of GPU parallelization, ExaNBody provides users with the ability to fine-tune their execution environment for optimal performance on GPU accelerators. Through a selection of command-line options, users can customize GPU configuration management.

ExaNBody GPU Command Lines

Type of tools

Command line

Description

Default

disable GPU

–nogpu

disbales use of GPU accelerators, even though some are available.

false

workgroup / block size

–onika-gpu_block_size N

sets default thread block size to N.

128

selectively disable GPU

–onika-gpu_disable_filter [ “sim.loop.compute_force.hook_force” , “.*hook_force” ]

a list of regular expressions matching operator paths for which GPU execution will be disabled

[]

8.6. Profiling tools available in exaNBody

ExaNBody offers a comprehensive suite of performance profiling tools designed to empower users in analyzing and optimizing their parallel applications. These tools provide valuable insights into runtime behavior, resource utilization, and performance bottlenecks, enabling developers to fine-tune their code for maximum efficiency. From CPU profiling to memory analysis, ExaNBody’s profiling tools offer a range of capabilities to meet diverse profiling needs. This section introduces the profiling tools available within ExaNBody, equipping users with the means to gain deeper understanding and enhance the performance of their parallel applications. some profiling tools are accessible through conifguration flags, others need to insert specific operator nodes in simulation description.

ExaNBody Profiling Tools Command Lines

Type of tools

Command line / Operator

Defaults

Description

Timers (configuration flag)

–profiling-summary [true|false]

true

Displays timer informtaions for every bottom level or batch operator nodes in simaulation graph. if two percentages are shown, second one is the portion of time spent in the nearest enclosing loop batch operator (i.e. compute_loop)

VITE Trace (configuration flag)

–profilingtrace-file [true|false]

“”

Generates a VITE trace describing CPU threads occupation (not available GPU threads).

Resident memory (configuration flag)

–profiling-resmem [true|false]

false

Displays resident memory evolution during the execution. Helps tracking of memory leaks.

Performance adviser (operator)

not present by default, can be inserted after initialization phase or after first iteration, i.e. using +first_iteration: [ perfomance_advisor ]

performance_adviser: { verbose: true }

Displays some tips about parameters to be tweaked to improve your simulations performance (cell size, number of MPI processes, etc.)

8.7. Using Timers with MPI and GPU

In ExaNBody, timers are essential tools for measuring performance in MPI and GPU-accelerated computations. This section explores their use within ExaNBody’s parallel implementations, providing insights into runtime behavior and performance characteristics.

This tools provides the list of timers for every operators in a hierarchical form.
  • Number of calls

  • CPU Time

  • GPU Time

  • Imbalance time between mpi processes (average and maximum)

  • execution time ratio

The Imbalance value is computed as : ` I = (T_max - T_ave)/T_ave - 1 `

With the variables:
  • T_max is the execution time of the slowest MPI process.

  • T_ave is the average time spent over MPI processes.

  • I is the imbalance value.

Note that if you force to stop your simulation, the timer are automatically printed in your terminal.

Output with OpenMP:

Profiling .........................................  tot. time  ( GPU )   avginb  maxinb     count  percent
sim ...............................................  2.967e+04            0.000   0.000         1  100.00%
... // some logs
  loop ............................................  2.964e+04            0.000   0.000         1  99.88%
    scheme ........................................  2.881e+04            0.000   0.000    100000  97.09%
      combined_compute_prolog .....................  2.300e+03            0.000   0.000    100000   7.75%
      check_and_update_particles ..................  1.016e+04            0.000   0.000    100000  34.25%
        particle_displ_over .......................  2.154e+03            0.000   0.000    100000   7.26%
        update_particles_full .....................  6.482e+03            0.000   0.000      5961  21.84%
          update_particles_full_body ..............  6.474e+03            0.000   0.000      5961  21.82%
            compact_neighbor_friction .............  1.621e+02            0.000   0.000      5961   0.55%
            move_particles_friction ...............  6.347e+02            0.000   0.000      5961   2.14%
            trigger_load_balance ..................  2.591e+02            0.000   0.000      5961   0.87%
              trigger_lb_tmp ......................  6.095e+00            0.000   0.000      5961   0.02%
                nth_timestep ......................  3.342e+00            0.000   0.000      5961   0.01%
              extend_domain .......................  2.389e+02            0.000   0.000      5961   0.80%
...

Output with MPI:

Profiling .........................................  tot. time  ( GPU )   avginb  maxinb     count  percent
sim ...............................................  2.376e+04            0.000   0.000         1  100.00%
... // some logs
  loop ............................................  2.372e+04            0.000   0.000         1  99.82%
    scheme ........................................  2.308e+04            0.086   2.249    100000  97.13%
      combined_compute_prolog .....................  5.779e+02            0.280   2.937    100000   2.43%
      check_and_update_particles ..................  1.687e+04            0.454   2.770    100000  70.97%
        particle_displ_over .......................  4.067e+03            0.687   2.643    100000  17.11%
        update_particles_full .....................  1.159e+04            0.167   0.812      6001  48.78%
          update_particles_full_body ..............  1.159e+04            0.167   0.813      6001  48.76%
            compact_neighbor_friction .............  7.170e+01            0.387   0.876      6001   0.30%
            move_particles_friction ...............  1.797e+02            0.254   0.853      6001   0.76%
            trigger_load_balance ..................  9.340e+01            0.674   1.787      6001   0.39%
              trigger_lb_tmp ......................  2.582e+00            0.187   2.836      6001   0.01%
                nth_timestep
              extend_domain .......................  8.655e+01            0.733   2.016      6001   0.36%
...

8.8. Debug features in exaNBody

ExaNBody is equipped with a range of debug features tailored to aid developers in the debugging process. This section outlines the comprehensive list of debug functionalities available within ExaNBody, providing developers with essential tools to diagnose and resolve issues effectively. This is an exhaustive list:

ExaNBody Debug Command Lines

Type of tools

Command line

Description

Simulation graph

–debug-graph [true|false]

Prints simulation graph as built after user input file, included files and command lines are processed

user input YAML document

–debug-yaml [true|false]

Prints final YAML document used by application after user input file, included files and command lines are merged

Output ldbg

–logging-debug true

Print debug logs added in ldbg <<

filtering debug output

–debug-filter [“regexp1”,”regexp2”,…]

Filters which operator nodes output debug messges with ldbg<<”…”. regexp is a regular expression matching operator pathname, i.e. it’s name within block and sub block, for instance “sim.first_iteration.compute_force.lj_force” can be filtered differently than sim.compute_loop.compute_force.lj_force”. alternatively, adding a filter expression such as “.*lj_force” will activate debug messages for all instances of lj_force operator.

How to use output ldbg:

Possiblity to active it only for one operator:
  • Command line : –logging-debug true –debug-filter[“.*operator1”,”.*operator2”,…]

  • Operator name : logging and debug

Example in your input file (.msp):

configuration:
  logging: { debug: false , parallel: true }
  debug:
    filter: [ ".*init_neighbor_friction" , ".*move_particles_friction" , ".*check_nbh_friction" , ".*compact_neighbor_friction" , ".*extend_domain" ]

9. Tutorials

9.1. Add a new plugin

You can create new plugins if you wish to develop a new set of operators that can be compiled independently of others or require specific plugin compilation. Here’s how to define and add a new plugin.

First of all, add your new plugin directory in src/CMakeLists.txt:

mkdir my_new_plugin
add_subdirectory(my_new_plugin)
cd my_new_plugin
vi CMakeLists.txt

Create a CMakeLists.txt and use the macros defined to plug your operators to exaNBody support such as:

set(list_of_plugins_required exanbIO exanbDefBox exanbParticleNeighbors exadem_numerical_scheme exadem_friction exadem_force_field)
set(exadem_my_new_plugin_LINK_LIBRARIES ${list_of_plugins_required})
xstamp_add_plugin(exadem_my_new_plugin ${CMAKE_CURRENT_SOURCE_DIR})

Warning: Do not forget to do a make UpdatePluginDataBase.

9.2. Add a new operator

Initially, it’s crucial to establish a precise definition of the intended kernel, the targeted data, and the method for executing this kernel.

9.2.1. Define your kernel

We recommend writing the computation kernel (in a functor) in a .h file within an include/exaDEM folder of your plugin directory. For instance, if one intends to apply gravitational force to a particle, it’s necessary to have knowledge of the external force and access to mutators for modifying the force fields. The kernel for a particle is then written in a file include/exaDEM/gravity_force.h:

namespace exaDEM
{
  struct GravityForceFunctor
  {
    exanb::Vec3d g = { 0.0 , 0.0 , -9.807};
    ONIKA_HOST_DEVICE_FUNC inline void operator () (double mass, double& fx, double& fy, double& fz ) const
    {
      fx += g.x * mass;
      fy += g.y * mass;
      fz += g.z * mass;
    }
  };
}

It is possible to provide specific kernel Traits that will be used by the data traversal functions (ex: ComputeCellParticles). For example, if we want to provide the GPU capabilities for this kernel:

namespace exanb
{
template<> struct ComputeCellParticlesTraits<exaDEM::GravityForceFunctor>
  {
    static inline constexpr bool CudaCompatible = true;
  };
}

9.2.2. Define an operator:

To begin with, all operators inherit from the OperatorNode class, where you establish the necessary physical fields for applying your operator. Following that, it’s essential to define the slots that will establish the execution graph among operators. Ultimately, you are required to furnish an Execute() function, which the graph will execute, along with a Documentation() function.

This is an example of an empty operator:

#include <exaDEM/your_kernel.h>
namespace exaDEM
{
  using namespace exanb;
  template<typename GridT
    , class = AssertGridHasFields< GridT, list_of_fields>
    >
  class YourOperator : public OperatorNode
  {
    using ComputeFields = FieldSet< list_of_fields>;
    static constexpr ComputeFields compute_field_set {};
    ADD_SLOT( GridT  , grid     , INPUT_OUTPUT );
  public:

    inline std::string documentation() const override final
    {
      return R"EOF(Your operator.)EOF";
    }

    inline void execute () override final
    {
      YourFunctor func;
      compute_cell_particles( *grid , false , func , compute_field_set , gpu_execution_context() , gpu_time_account_func() );
    }
  };

  template<class GridT> using YourOperatorTmpl = YourOperator<GridT>;

  // === register factories ===
  CONSTRUCTOR_FUNCTION
  {
    OperatorNodeFactory::instance()->register_factory( "your_operator_name", make_grid_variant_operator< YourOperatorTmpl > );
  }
}

9.2.2.1. Comments:

  • Please refers to src/forcefields/gravity_force.cpp to illustrate a simple example.

  • AssertGridHasFields allows to eliminate wrong grids

  • If ComputeFields does not correspond to the input parameters of your functor, the operator can’t compile.

  • ADD_SLOT macro works as ADD_SLOT(Type, Name, TypeOfSlot, DefaultValue, Documentation) with:
    • Type: double, int

    • TypeOfSlot: INPUT, OUTPUT or INPUT_OUTPUT

    • Documentation: is a DocString type: DocString{slot documentation}

  • To access to a slot value, please add *.

  • Constant value are hidden the operator, for example the gravity functor is initialized as: GravityForceFunctor func { *gravity};, gravity is a slot.

  • Note that you need to specify you operator for a given Grid and define the operator_name in your dictionnary.

9.2.2.2. Comment:

You can do your own traversal function by explicitely iterating over cells. Example:

ADD_SLOT( GridT  , grid            , INPUT_OUTPUT );
inline void execute () override final
{
  MPI_Comm comm = *mpi;
  auto cells = grid->cells();
  IJK dims = grid->dimension();
  size_t ghost_layers = grid->ghost_layers();
  IJK dims_no_ghost = dims - (2*ghost_layers);
# pragma omp parallel
  {
    GRID_OMP_FOR_BEGIN(dims_no_ghost,_,loc_no_ghosts)
    {
      IJK loc = loc_no_ghosts + ghost_layers;
      size_t cell_i = grid_ijk_to_index(dims,loc);
      auto& cell_ptr = cells[cell_i];
      const size_t n = cells[cell_i].size();
      auto* __restrict__ _my_field = cell_ptr[field::my_field];
#     pragma omp simd
      for(size_t j=0;j<n;j++)
        _my_field[j] += 27;
    }
  }
}

10. ExaNBody Publications:

Main citation:

Carrard, T., Prat, R., Latu, G., Babilotte, K., Lafourcade, P., Amarsid, L., & Soulard, L. (2023). ExaNBody: a HPC framework for N-Body applications. (Carrard)

Field High Performance Computing:

Carrard, T., Prat, R., Latu, G., Babilotte, K., Lafourcade, P., Amarsid, L., & Soulard, L. (2023). ExaNBody: a HPC framework for N-Body applications. (Carrard)

Prat, R., Carrard, T., Soulard, L., Durand, O., Namyst, R., & Colombet, L. (2020). AMR-based molecular dynamics for non-uniform, highly dynamic particle simulations. Computer Physics Communications, 253, 107177.

Prat, R., Colombet, L., & Namyst, R. (2018, August). Combining task-based parallelism and adaptive mesh refinement techniques in molecular dynamics simulations. In Proceedings of the 47th International Conference on Parallel Processing (pp. 1-10).

Dirand, E., Colombet, L., & Raffin, B. (2018). Tins: A task-based dynamic helper core strategy for in situ analytics. In Supercomputing Frontiers: 4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings 4 (pp. 159-178). Springer International Publishing.

Cieren, E., Colombet, L., Pitoiset, S., & Namyst, R. (2014). ExaStamp: A parallel framework for molecular dynamics on heterogeneous clusters. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II 20 (pp. 121-132). Springer International Publishing.

Field Physics / Molecular Dynamics:

Soulard, L., Durand, O., Burie, J. R., & Babilotte, K. (2024). Micro-jetting: Areal density calculation from a triangular groove. Journal of Applied Physics, 135(20).

Soulard, L. (2023). Micro-jetting: A semi-analytical model to calculate the velocity and density of the jet from a triangular groove. Journal of Applied Physics, 133(8).

Soulard, L., Carrard, T., & Durand, O. (2022). Molecular dynamics study of the impact of a solid drop on a solid target. Journal of Applied Physics, 131(13).

Soulard, L., Durand, O., Prat, R., & Carrard, T. (2021). High velocity impact of a spherical particle on a surface: Theory and simulation of the jet formation. Journal of Applied Physics, 129(20).

Soulard, L., & Durand, O. (2020). Observation of phase transitions in shocked tin by molecular dynamics. Journal of Applied Physics, 127(16).

Durand, O., Soulard, L., Colombet, L., & Prat, R. (2020). Influence of the phase transitions of shock-loaded tin on microjetting and ejecta production using molecular dynamics simulations. Journal of Applied Physics, 127(17).

Durand, O., Jaouen, S., Soulard, L., Heuze, O., & Colombet, L. (2017). Comparative simulations of microjetting using atomistic and continuous approaches in the presence of viscosity and surface tension. Journal of Applied Physics, 122(13).

11. Example for rst usage (math, code etc)

11.1. Example of math citatiion :

To cite paper, place the appropriate reference in the file source/Bibliography.rst. Then, you can cite it by using this kind of command:

My paper is written by (AaBbCcDd et al.)

11.2. Example of math with latex as equation :

(1)\[\frac{ \sum_{t=0}^{N}f(t,k) }{N}\]

11.3. Example of math with latex as inline text :

Quand \(a > 0\) alors il y a deux solutions à \(ax^2 + bx + c = 0\) et elles sont exactement $$x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.$$

Le double dollar indique une équation “standalone” sur sa ligne

11.4. Example of bash code :

mkdir my_new_plugin
add_subdirectory(my_new_plugin)
cd my_new_plugin
vi CMakeLists.txt

11.5. Example of cpp code :

namespace exaSPH
{
  struct TestFunctor
  {
    exanb::Vec3d a = { 1.0 , 2.0 , 3.0};
    ONIKA_HOST_DEVICE_FUNC inline void operator () (double mass, double& fx, double& fy, double& fz ) const
    {
      fx += a.x * mass;
      fy += a.y * mass;
      fz += a.z * mass;
    }
  };
}

11.6. Example of yaml code :

domain:
  cell_size: 2 m
  periodic: [false,true,false]

11.7. Example of simple figure :

map to buried treasure

Figure 1: This is the caption of the figure (a simple paragraph).