GPU and profiling¶
In this example, the ring network created in an earlier tutorial will be used to run the model with a GPU. In addition, it is shown how to profile the performance difference. Only the differences with that tutorial will be described.
Note
Concepts covered in this example:
Building a
arbor.context
that’ll use a GPU. This requires that you have built Arbor with GPU support enabled.Build a
arbor.domain_decomposition
and provide aarbor.partition_hint
.Profile an Arbor simulation using
arbor.meter_manager
.
The hardware context¶
An execution context describes the hardware resources on
which the simulation will run. It contains the thread pool used to parallelise
work on the local CPU, and optionally describes GPU resources and the MPI
communicator for distributed simulations. In some other examples, the
arbor.single_cell_model
object created the execution context
arbor.context
behind the scenes. The details of the execution context
can be customized by the user. We may specify the number of threads in the
thread pool; determine the id of the GPU to be used; or create our own MPI
communicator.
Step (11) creates a hardware context where we set the
gpu_id
. This requires that you have built
Arbor manually, with GPU support (see here how to do
that). On a regular consumer device with a single GPU, the index you should pass
is 0
. Change the value to run the example with and without GPU. The number
of threads threads
are (when no MPI is used) set to
arbor.env.thread_concurrency()
. This value corresponds to the number of
locally available threads as best as can be established by Arbor at the start of
the program.
Note
If you use GPUs in combination with MPI, consider using find_private_gpu()
.
# (11) Set up the hardware context
# gpu_id set to None will not use a GPU.
# gpu_id=0 instructs Arbor to the first GPU present in your system
context = A.context(gpu_id=None)
print(context)
Profiling¶
Arbor comes with a arbor.meter_manager
to help you profile your
simulations. In this case, you can run the example with gpu_id=None
and
gpu_id=0
and observe the difference with the meter_manager
.
If you are interested in more detailled report, Arbor also offers a region based
profiler which is aimed at developers and must be enabled at build time.
Step (12) sets up the meter manager and starts it using the (only) context. This way, only Arbor related execution is measured, not Python code.
Step (13) instantiates the recipe and sets the first checkpoint on the meter manager. We now have the time it took to construct the recipe.
# (12) Set up and start the meter manager
meters = A.meter_manager()
meters.start(context)
# (13) Instantiate recipe
ncells = 50
recipe = ring_recipe(ncells)
meters.checkpoint("recipe-create", context)
The domain decomposition¶
The domain decomposition describes the distribution of the cells over the
available computational resources. The arbor.single_cell_model
also
handled that without our knowledge in the previous examples. Now, we have to
define it ourselves.
The arbor.domain_decomposition
class can be manually created by the
user, by deciding which cells go on which ranks. Or we can use a load balancer
that can partition the cells across ranks according to some rules. Arbor
provides arbor.partition_load_balance
, which, using the recipe and
execution context, creates the arbor.domain_decomposition
object for
us.
A way to customize arbor.partition_load_balance
is by providing a
arbor.partition_hint
. They let you configure how cells are distributed
over the resources in the context
, but without requiring you to
know the precise configuration of a context
up front. Whether
you run your simulation on your laptop CPU, desktop GPU, CPU cluster of GPU
cluster, using partition hints
you can just say:
use GPUs, if available. You only have to change the context
to
actually define which hardware Arbor will execute on.
Step (14) creates a arbor.partition_hint
, and tells it to put 1000
cells in a groups allocated to GPUs, and to prefer the utilisation of the GPU if
present. In fact, the default distribution strategy of
arbor.partition_load_balance
already spreads out cells as evenly as
possible over CPUs, and groups (up to 1000) on GPUs, so strictly speaking it was
not necessary to give that part of the hint. Lastly, a dictionary is created
with which hints are assigned to a particular arbor.cell_kind
.
Different kinds may favor different execution, hence the option. In this
simulation, there are only arbor.cell_kind.cable
, so we assign the hint
to that kind.
Step (15) creates a arbor.partition_load_balance
with the recipe,
context and hints created above. Another checkpoint will help us understand how
long creating the load balancer took.
# (14) Define a hint at to the execution.
hint = A.partition_hint()
hint.prefer_gpu = True
hint.gpu_group_size = 1000
print(hint)
hints = {A.cell_kind.cable: hint}
# (15) Domain decomp
decomp = A.partition_load_balance(recipe, context, hints)
print(decomp)
meters.checkpoint("load-balance", context)
The simulation¶
Step (16) creates a arbor.simulation
, sets the spike recorders to
record, creates a handle to their eventual results and makes another
checkpoint.
# (16) Simulation init and set spike generators to record
sim = A.simulation(recipe, context, decomp)
sim.record(A.spike_recording.all)
handles = [
sim.sample((gid, "Um"), A.regular_schedule(1 * U.ms)) for gid in range(ncells)
]
meters.checkpoint("simulation-init", context)
The execution¶
Step (17) runs the simulation. Since we have more cells this time, which are
connected in series, it will take some time for the action potential to
propagate. In the ring network we could see it
takes about 5 ms for the signal to propagate through one cell, so let’s set the
runtime to 5*ncells
. Then, another checkpoint, so that we’ll know how long
the simulation took.
# (17) Run simulation
sim.run(ncells * 5 * U.ms)
print("Simulation finished")
meters.checkpoint("simulation-run", context)
The results¶
The scientific results should be similar, other than number of cells, to those
in ring network, so we’ll not discuss them here.
Let’s turn our attention to the meter_manager
.
# (18) Results
# Print profiling information
print(f"{A.meter_report(meters, context)}")
Step (18) shows how arbor.meter_report
can be used to read out the
meter_manager
. It generates a table with the time between
checkpoints. As an example, the following table is the result of a run on a 2019
laptop CPU:
---- meters -------------------------------------------------------------------------------
meter time(s) memory(MB)
-------------------------------------------------------------------------------------------
recipe-create 0.000 0.059
load-balance 0.000 0.007
simulation-init 0.012 0.662
simulation-run 0.037 0.319
meter-total 0.049 1.048
The full code¶
You can find the full code of the example at python/examples/network_ring_gpu.py
.