GPU and profiling

In this example, the ring network created in an earlier tutorial will be used to run the model with a GPU. In addition, it is shown how to profile the performance difference. Only the differences with that tutorial will be described.

Note

Concepts covered in this example:

  1. Building a arbor.context that’ll use a GPU. This requires that you have built Arbor with GPU support enabled.

  2. Build a arbor.domain_decomposition and provide a arbor.partition_hint.

  3. Profile an Arbor simulation using arbor.meter_manager.

The hardware context

An execution context describes the hardware resources on which the simulation will run. It contains the thread pool used to parallelise work on the local CPU, and optionally describes GPU resources and the MPI communicator for distributed simulations. In some other examples, the arbor.single_cell_model object created the execution context arbor.context behind the scenes. The details of the execution context can be customized by the user. We may specify the number of threads in the thread pool; determine the id of the GPU to be used; or create our own MPI communicator.

Step (11) creates a hardware context where we set the gpu_id. This requires that you have built Arbor manually, with GPU support (see here how to do that). On a regular consumer device with a single GPU, the index you should pass is 0. Change the value to run the example with and without GPU. The number of threads threads are (when no MPI is used) set to arbor.env.thread_concurrency(). This value corresponds to the number of locally available threads as best as can be established by Arbor at the start of the program.

Note

If you use GPUs in combination with MPI, consider using find_private_gpu().

# (11) Set up the hardware context
# gpu_id set to None will not use a GPU.
# gpu_id=0 instructs Arbor to the first GPU present in your system
context = A.context(gpu_id=None)
print(context)

Profiling

Arbor comes with a arbor.meter_manager to help you profile your simulations. In this case, you can run the example with gpu_id=None and gpu_id=0 and observe the difference with the meter_manager. If you are interested in more detailled report, Arbor also offers a region based profiler which is aimed at developers and must be enabled at build time.

Step (12) sets up the meter manager and starts it using the (only) context. This way, only Arbor related execution is measured, not Python code.

Step (13) instantiates the recipe and sets the first checkpoint on the meter manager. We now have the time it took to construct the recipe.

# (12) Set up and start the meter manager
meters = A.meter_manager()
meters.start(context)

# (13) Instantiate recipe
ncells = 50
recipe = ring_recipe(ncells)
meters.checkpoint("recipe-create", context)

The domain decomposition

The domain decomposition describes the distribution of the cells over the available computational resources. The arbor.single_cell_model also handled that without our knowledge in the previous examples. Now, we have to define it ourselves.

The arbor.domain_decomposition class can be manually created by the user, by deciding which cells go on which ranks. Or we can use a load balancer that can partition the cells across ranks according to some rules. Arbor provides arbor.partition_load_balance, which, using the recipe and execution context, creates the arbor.domain_decomposition object for us.

A way to customize arbor.partition_load_balance is by providing a arbor.partition_hint. They let you configure how cells are distributed over the resources in the context, but without requiring you to know the precise configuration of a context up front. Whether you run your simulation on your laptop CPU, desktop GPU, CPU cluster of GPU cluster, using partition hints you can just say: use GPUs, if available. You only have to change the context to actually define which hardware Arbor will execute on.

Step (14) creates a arbor.partition_hint, and tells it to put 1000 cells in a groups allocated to GPUs, and to prefer the utilisation of the GPU if present. In fact, the default distribution strategy of arbor.partition_load_balance already spreads out cells as evenly as possible over CPUs, and groups (up to 1000) on GPUs, so strictly speaking it was not necessary to give that part of the hint. Lastly, a dictionary is created with which hints are assigned to a particular arbor.cell_kind. Different kinds may favor different execution, hence the option. In this simulation, there are only arbor.cell_kind.cable, so we assign the hint to that kind.

Step (15) creates a arbor.partition_load_balance with the recipe, context and hints created above. Another checkpoint will help us understand how long creating the load balancer took.

# (14) Define a hint at to the execution.
hint = A.partition_hint()
hint.prefer_gpu = True
hint.gpu_group_size = 1000
print(hint)
hints = {A.cell_kind.cable: hint}

# (15) Domain decomp
decomp = A.partition_load_balance(recipe, context, hints)
print(decomp)
meters.checkpoint("load-balance", context)

The simulation

Step (16) creates a arbor.simulation, sets the spike recorders to record, creates a handle to their eventual results and makes another checkpoint.

# (16) Simulation init and set spike generators to record
sim = A.simulation(recipe, context, decomp)
sim.record(A.spike_recording.all)
handles = [
    sim.sample((gid, "Um"), A.regular_schedule(1 * U.ms)) for gid in range(ncells)
]
meters.checkpoint("simulation-init", context)

The execution

Step (17) runs the simulation. Since we have more cells this time, which are connected in series, it will take some time for the action potential to propagate. In the ring network we could see it takes about 5 ms for the signal to propagate through one cell, so let’s set the runtime to 5*ncells. Then, another checkpoint, so that we’ll know how long the simulation took.

# (17) Run simulation
sim.run(ncells * 5 * U.ms)
print("Simulation finished")
meters.checkpoint("simulation-run", context)

The results

The scientific results should be similar, other than number of cells, to those in ring network, so we’ll not discuss them here. Let’s turn our attention to the meter_manager.

# (18) Results
# Print profiling information
print(f"{A.meter_report(meters, context)}")

Step (18) shows how arbor.meter_report can be used to read out the meter_manager. It generates a table with the time between checkpoints. As an example, the following table is the result of a run on a 2019 laptop CPU:

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
recipe-create                   0.000           0.059
load-balance                    0.000           0.007
simulation-init                 0.012           0.662
simulation-run                  0.037           0.319
meter-total                     0.049           1.048

The full code

You can find the full code of the example at python/examples/network_ring_gpu.py.