A major part of SalvusFlow is its remote job execution framework. Every time a simulation is run with SalvusFlow, that framework is used. In such cases the site at which you want to run a simulation has to be specified.
A site is a set of configuration parameters which describe how to run SalvusCompute on a local or remote machine. Every site must have a unique name.
Most of the Salvus tutorials use a site called "local" with a
type. We recommend to set this up as well if you want to follow along the
It is an obvious choice which site type is suitable for any given machine. Salvus currently supports the following site types. Please contact Mondaic if your cluster's job management system is not listed.
local: Runs simulations on the same machine as SalvusFlow. This is the only site type that does not use SSH.
ssh: For simulations on remote machines/work stations without a job queuing system. Uses SSH for remote communication.
slurm: For clusters with the slurm job submission system. Uses SSH for remote communication.
pbs: For clusters with the PbsPro job submission system. Uses SSH for remote communication.
lsf: For clusters using the IBM Spectrum LSF job management system. Uses SSH for remote communication.
SSH configuration is sometimes a major hurdle for people who don't regularly use it, thus we have a separate page to help with that.
There are three recommended ways to add a new site:
SalvusComputeif necessary. Start it by executing
Use the Salvus Configuration Builder.
The other recommended option it to have a look at our library of example site configurations and manually copy and these to the config TOML file and adjust them to your system.
All sites are defined in a global TOML
configuration file. The exact location of this file is system dependent but
can be queried with
$ salvus-cli print-config-paths. A convenient way to
edit this file is to call
which will open the configuration file with your preferred editor (specified
$EDITOR environment variable).
After the configuration has been edited you have to initialize the site by calling
salvus-cli init-site SITE_NAME
This command will run all kinds of tests of the configuration to make sure it is correct. If something goes wrong it offers extensive debugging output to pinpoint the issue. Once the site initialization has been successful the site is ready to be used. Any time a site is updated or changed it has to be initialized again!
Most parameters, together with the provided comments/documentation should be
tmp_directory parameters warrant
further explanation. Both of them specify directories at the local or remote
site that will be managed by SalvusFlow.
run_directory: Every job run on this site will get its own directory.
SalvusFlow will use that directory to store all inputs and most output
tmp_directory: Every job that produces a lot of output (e.g. volumetric
data output or checkpoints for adjoint simulations) will get a folder in
this directory, as these special output files are often orders-of-magnitude
larger in size than standard output files. Many HPC systems have multiple
file systems: one which stores a limited amount of user data (e.g. where
your "home" directory is located), and one which can performantly store
large quantities of data, but which also comes with no guarantee of file
persistance (e.g. where your "scratch" directory is located). In these
cases we recommend to point the
tmp_directory parameter to a folder on
the latter file system, keeping in mind that such data may be cleared from
time-to-time by the system's maintenance routines.
Note that both folders must be read- and writeable from the compute nodes.
Scenario A: Single filesystem, keep everything in same folder.
run_directory = "/path/to/salvus_data/run" tmp_directory = "/path/to/salvus_data/tmp"
Scenario B: Single filesystem. Use actual
/tmp directory for the large
files. Please keep in mind that the
/tmp directory is cleared upon restart
on many systems.
run_directory = "/path/to/salvus_data/run" tmp_directory = "/tmp/salvus_tmp"
Scenario C: One smaller filesystem for most files, another large scratch space for the potentially very large other files.
run_directory = "/path/to/salvus_data/run" tmp_directory = "/scratch/path/to/salvus_data/tmp"