5.2. Parallelization

5.2.1. MPI Parallelization

5.2.1.1. Domain Decomposition

5.2.1.2. Cost Model

The weight associated with each cell used for load distribution using the RCB method is as follows:

\[W_i=1+3.P_i+\sum_{j \in {0,...,N_i}}[F(I_i[j].type)]\]

with \(W_i\) the array of weights, i is the cell id, \(I_i\) the interactions into the cell i , \(P_i\) the number of particles into the cell i, \(N_i\) is the number of interactions, and F is the cost function associated to the interaction type.

F values
Value	Type	F(Value)
0	Vertex - Vertex	1
1	Vertex - Edge	3
2	Vertex - Face	5
3	Edge - Edge	4
4	Vertex - Cylinder	1
5	Vertex - Surface	1
6	Vertex - Ball	1
7	Vertex - Vertex (STL)	1
8	Vertex - Edge (STL)	3
9	Vertex - Face (STL)	5
10	Edge - Edge (STL)	4
11	Vertex (STL) - Edge	3
12	Vertex (STL) - Face	5

5.2.2. Thread Parallelization

5.2.3. GPU Support

5.2.4. Benchmarks

5.2.4.1. Balde - CPU (Polyhedron)

Information:

Balde: {25,952 faces, 77,856 edges, 28,497 vertices}
Number of timesteps: 10,000
Number of polyhedra: 1,651,637
This benchmark is achieved by reproducing the tutorial simulation (Blade) with polyhedra 5 times smaller. See images below.
Test the hybrid parallelization MPI + OpenMP
Machine used: CCRT-IRENE with:
- Node: bi-processor of skylake
- Each processor has 24 cores
- Frequency: 2.7GHz
- until 16,384 core

Some remarks:

For this test, the simulation domain is split into 39,000 cells and then distributed among the MPI processes. Some configurations fail because there are not enough cells per MPI process.
Best results are obtained with 24 OMP threads per MPI process. This corresponds to the number of cores per socket. NUMA effects are likely to reduce performance beyond this.

5.2.4.2. Rotating Drum - CPU (Sphere)

This benchmark has been presented in 2023 within the paper [] : “ExaNBody : a HPC framework for N-Body applications”.

Note

We are no longer able to reproduce this performance as the code has changed (hopefully for the better), but it shows the performance of the first prototype. The input files are no longer available either, as the operators have changed and the data structures have changed, notably with the improvements for interaction and parallelization per interaction instead of per particle in OpenMP/gpu.

Description:

Different OpenMP/Mpi configurations (number of cores/threads per mpi process) have been tested to balance multi-level parallelism. Both simulations were instrumented during 1,000 representative iterations. The performance of ExaDEM was evaluated using up to 256 cluster nodes, built on bi-socket 64-core AMD EPYC Milan 7763 processors running at 2.45 GHz and equipped with 256 GB of RAM.

../../_images/mpi-dem-example-100M-modified.png — Dem simulation of 100 million spheres in a rotating drum.

ExaDEM’s performance is evaluated with a simulation of a rotating drum containing 100 million spherical particles, see the figure below. This setup is a tough benchmark as particles are rapidly moving all around the heterogeneously dense domain due to gravity. Additionally, the employed contact force model has a low arithmetic intensity, and exaDEM must handle pairwise friction information that is updated by kernel and must migrate between mpi processes when subdomains are redistributed.

../../_images/mpi-dem-example-100M-mpi.png — Domain decomposition of 100,000 spheres into a rotating drum

../../_images/drum_dem_100M.png — Speedup for different OpenMP/MPI configurations. ExaDEM simulation with 1, 8, and 128 threads per `mpi` process.

Note

Note that the Milan nodes are made up of 128 cores spread over 8 NUMA nodes, and we have pointed out that NUMA effects reduce overall performance.

../../_images/drum_dem_100M_comp.png — Operator speedup according to the total number of cores used.

../../_images/drum_dem_100M_comp_pourcentage.png — Operator time ratios at different parallelization scales.

5.2.4.3. Rotating Drum - GPU (Polyhedron)

This example is defined in the repository: https://github.com/Collab4exaNBody/exaDEM-benchmark/tree/main/rotating-drum-poly . This simulation is run on an a100 GPU using 32 cores. Result format: Loop time (Update Particles/Force Field).

GPU Benchmarks
Version	Case 10k	Case 80K
v1.0.1 (06/24)	28.1(17.1/6.8)	71.6(37.8/26.0)
v1.0.2 (11/24)	23.3(17.7/4.1)	48.9(33.0/13.8)
master (05/12/24)	6.38(2.61/2.49)	17.6(11.65/4.3)
v1.1.0 (27/03/25)	6.99(2.43/2.56)	17.3(10.94/4.4)