Distribute BQSKit Across a Cluster
This guide describes how to launch a BQSKit Runtime Server in detached mode on one or more computers, connect to it, and perform compilations on the server. The detached mode allows for greater parallelization from cluster computing all the way up to supercomputers.
Detached Runtime Architecture
When distributing a runtime across many computers, we need to start it in
detached mode, where the server is “detached” from the client, i.e., started
and stopped independently. As opposed to the attached mode, which happens
transparently to the user whenever they instantiate a Compiler
or call the compile()
function.
The detached runtime architecture consists of four types of entities:
Clients who submit compilation tasks
A Server that acts as a central controller
Managers that manage workers and act as a liaison between the server and workers
Workers that perform the actual computation
Starting a Detached Server
To start a detached server, we first must start the managers, who will spin up workers, then wait to connect to a server. To do this, launch the following executable:
bqskit-manager
See bqskit-manager --help
for info on options. A manager should be started
on all shared memory (typically individual computers/nodes) systems first.
Then, we launch the server and provide the IP addresses of all the managers
to connect to:
bqskit-server <manager-ip-1> <manager-ip-2> ...
See bqskit-server --help
for info on potential options.
Optionally Configuring Worker Ranks
There is an optional, slightly altered way to start the managers that allows more fine-grained control over worker ranks. This method is helpful if you need to specify environment variables to different groups of workers on the same system. For example, in a multi-GPU system, you may use environment variables to control access to GPUs and want to split the workers to different GPUs but still have one manager responsible for the lot.
To accomplish this, we first start the manager, then the workers independently,
and finally the server as usual. In this start-up method, we need to tell
the manager to connect to the workers rather than spawn them. This is
done with -x
flag:
bqskit-manager -x
Then, you start the workers and connect them to the manager:
ENV_VAR=SOMETHING bqskit-worker <num_workers_in_rank>
See bqskit-worker --help
for more info. You can also use the --cpus
flag to pin workers to cores.
Spawning a Manager Hierarchy
Typically, in a small cluster, managers will directly manage workers and report to the central server. However, this is not a necessity. Managers can manage other managers and work under different managers. This hierarchical, potentially-unbalanced architecture may be helpful in heterogenous clusters or large supercomputers where more explicit control over communication is desired.
While spawning the server, we spawn the lower-level managers first, then work our way up until we complete the runtime with a server. The lowest-level workers and managers can be spawned the same as above, but now we spawn the next-level up managers by passing in the IP address of the lower-level managers:
bqskit-manager -m <lower-level-manager-ip-1> <lower-level-manager-ip-2> ...
Lastly, the server is started the same way:
bqskit-server <most-senior-manager-ip-1> <most-senior-manager-ip-2> ...
Shutting Down a Server
Interrupting the central server will properly shut down the entire runtime architecture and clean up all resources. A client cannot shut down the runtime in detached mode.
Connecting to and Compiling with a Server
Once a bqskit server has been started in detached mode, we can connect to it
by simply passing its IP address into a Compiler
constructor or as a
keyword argument in the compile()
method:
with Compiler(ip='server_ip') as compiler:
compiler.compile(...)
or
bqskit.compile(..., ip='server_ip')
Sample SLURM Script Template
Below is a sample SLURM script template that can be filled in to start a detached runtime and execute a client script all-in-one.
#!/bin/bash
#SBATCH --job-name={name}
#SBATCH -A {account}
#SBATCH -C cpu
#SBATCH -q regular
#SBATCH -t {timelimit}
#SBATCH --nodes={nodes}
#SBATCH --ntasks-per-node=1
#SBATCH --output={out_log}
#SBATCH --error={err_log}
module load python &> /dev/null
conda activate {pythonenv} &> /dev/null
echo "Starting runtime managers..."
srun --output {manager_log}-%t bqskit-manager -n{num_workers} -v &
count_started_managers() {{
for log in $(find $(dirname {manager_log}) -wholename "{manager_log}-*");
do
grep -q "Started outgoing thread." $log && printf ".";
done | wc -m
}}
while [ $(count_started_managers | xargs echo -n) -lt {nodes} ]; do
sleep 1
done
echo "Starting runtime server..."
bqskit-server $(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\\n' ' ') -v &> {server_log} &
server_pid=$!
sleep 1
echo "Running compilation workflow..."
python {python_script} localhost
echo "Shutting down server."
kill -2 $server_pid