qsp-hpc-tools: HPC orchestration and caching for QSP simulation
What it does
QSP calibration and virtual-population work often need 10K to 100K simulations per scenario, and a clinical-prediction sweep across treatments multiplies that. qsp-hpc-tools runs the campaigns on a SLURM cluster over SSH and exposes a single Python interface that wraps either MATLAB / SimBiology or a C++ qsp_sim backend. A content-hashed cache survives model edits, so re-running a calibration loop after tweaking a target or a scenario only re-simulates what actually changed.
Architecture overview
There’s one Python Simulator class per backend (QSPSimulator for MATLAB, CppSimulator for C++) with the same call signature. Each handles codebase syncing to the cluster over SSH, SLURM array-job submission and monitoring, and a three-tier content-hashed cache that decides per-call whether to hit the local pool, fetch a derived observable from the cluster, or run a fresh simulation array.
Multi-scenario sweeps (e.g. baseline vs. nab-paclitaxel + GVAX vs. resection-only) are built into the cache key: a parameter set used under multiple scenarios runs separately under each, but where a derived observable is scenario-agnostic the cache reuses it across scenarios.
SLURM orchestration
HPCJobManager is the SSH-side glue. On submission it rsyncs the project directory to the cluster, writes a per-call array-job sbatch script, and polls squeue until the array finishes. Failed array indices are retried up to a configurable budget; anything still missing is returned as a NaN row in the result frame rather than a hard exception, so a single bad parameter combination doesn’t sink the whole sweep. --dependency=afterok chains an observable-derivation job after the C++ simulation array, so the small per-sim summary values are computed on the cluster and only those get pulled back over SSH — the raw trajectories (gigabytes per sweep) stay on the cluster.
Authentication is whatever your SSH config already does (key-based, jump hosts, MFA prompts).
Three-tier content-hashed cache
The cache pays off most when the model, priors, and scenarios are stable and you’re iterating on observable code or downstream analysis (e.g. adding a new observable to compute against an existing simulation pool, or re-running a posterior-predictive analysis with the same theta draws). When the model, a scenario, or the prior changes, the relevant tier invalidates and a fresh sweep runs. The tiers, in lookup order:
- Local pool. Simulation results cached on the laptop or workstation, keyed by model version, scenario, and the parameter set. Cache hits here cost nothing and are enough for re-running a posterior-predictive analysis or changing how a downstream observable is computed.
- HPC summary observables. Already-derived observable values stored on the cluster, keyed by the same identity. Pulled down on demand and cheaper than re-deriving from raw trajectories.
- HPC full simulations. Raw per-sim trajectories on the cluster, used when a new observable needs to be computed against an existing simulation pool.
If none hit, the simulator submits a fresh SLURM array. The config_hash covers the simulator binary content, the parameter-XML template, and the scenario YAML, so a simulator rebuild or a scenario edit invalidates the right tier without manual cache invalidation. Adding a new observable against an existing simulation pool only triggers tier 3, not a fresh sweep.
The C++ backend
CppSimulator is the preferred path for new work; the MATLAB backend is the legacy path. C++ sources are generated by qsp-codegen from your SBML export, so model authoring stays in SimBiology and the C++ binary is a build artifact. qsp-hpc-tools handles the cluster-side build (pip-installing qsp-codegen into a project venv, running cmake and caching the binary by content hash) and parses the raw-binary trajectory format the binary emits into a MATLAB-compatible Parquet so downstream caching and observable derivation work unchanged.
Speedups on representative QSP scenarios run 25× to 87× over the MATLAB workers; the speedup grows with simulation length because the per-sim setup overhead in MATLAB amortises poorly over short sims. The C++ binary writes species, compartments, and assignment-rule values to companion files alongside the trajectory, so any model parameter referenced by a calibration target is available downstream as species_dict[name].
Inference-time use: simulate_with_parameters
simulate_with_parameters(theta, backend="local"|"hpc") is the entry point that posterior predictive checks and OBED retraining sweeps call. The argument is an (n, n_params) matrix of thetas (typically posterior draws from qsp-inference), not a sample size to draw from a prior. Each row is hashed, the cache is consulted, and only the rows that miss are simulated. Local and HPC backends return the same DataFrame shape so calling code doesn’t need to branch on backend.
restriction_classifier_dir=... rejection-samples the prior against a qsp_inference.inference.RestrictionClassifier before any simulator job is submitted, so simulator time isn’t burned on draws that always fail (typically tumors that never reach detectable size). Lognormal, normal, uniform, and Beta priors are supported in the prior CSV.
Burn-in caching across scenarios
Many QSP scenarios share a long pre-treatment “evolve to diagnosis” segment that’s the dominant cost per simulation. Setting evolve_trajectory_dir=... on a sim call dumps each per-sim post-evolve ODE state to disk; subsequent calls under different scenarios pick up from the cached state instead of re-evolving. The dumps are packed into a single LMDB file and invalidation is content-hashed against the upstream parameter set.
assemble_evolve_trajectory_long pivots the per-sim dumps into a long-form pandas DataFrame so trajectory plots can reuse the same observable definition the inference was conditioned on (via qsp_inference.inference.evaluate_calibration_target_over_trajectory).
Stack
- qsp-codegen: SBML to C++ CVODE simulator that this package builds and runs
- qsp-hpc-tools: this package
- qsp-inference: Bayesian inference consumer of the simulator