------------------------------------------------------------------------------- MPI and PBS applications... ------------------------------------------------------------------------------- Configuration... Adding a node... On Node How many CPUs and cores does it have (look at NUMA cpu count) lscpu | egrep -e "CPU|Thread|Core|Socket" Look at "CPU(s):" entry Update the following files... vi /etc/clusters vi /etc/hosts vi /etc/ssh/ssh_known_hosts :" add the nodes host keys # For MPI (do not do this, interfers normal use) #vi /etc/openmpi-x86_64/openmpi-default-hostfile # add nodes to "/etc/mpi_nodes" for users to use instead #vi /etc/mpi_nodes # For PBS vi /var/lib/torque/server_priv/nodes Distribute the "hosts" and "ssh_known_hosts" to ALL nodes. MPI Node Testing... Note: Only do this with small test programs such as "mpi_hello" or "hostname" commands. Anything larger should be run using PBS to be queued and run in batch as CPU's become available (see below). Run one job on Ogre mpiexec -host ogre hostname Run on specific nodes (2 of them on node22) mpirun -host node22,node22,shrek,ogre hostname One process on every host mpirun --pernode --hostfile /etc/mpi_nodes hostname | sort mpirun --pernode --hostfile /etc/mpi_nodes \ bash -c 'h=`hostname`; r=`uname -r`; echo $h $r' | sort mpirun --host $(pbsnodes| awk '/^[^ ]/{a=$0} /^ *state = free/{print a}' | tr \\012 ,) \ bash -c 'h=`hostname`; r=`uname -r`; echo $h $r' | sort Two process on every host mpirun --npernode 2 --hostfile /etc/mpi_nodes hostname | sort One process on every CORE in the cluster (auto-discovery) NB: The option "--hetero-nodes" is required to handle mixed set of CPUs. Without it MPI can get the core counts wrong. EG: mistakenly run only 2 processes instead of 4 on a quad-core machine mpirun --hostfile /etc/mpi_nodes --hetero-nodes hostname Using a "mpi_hello" program... mpirun --pernode mpi_hello mpirun mpi_hello mpirun mpi_hello | sort -t\ -k5n ------------------------------------------------------------------------------- Converting a $PBS_NODEFILE to a OpenMPI usable form This is for use with the pure RPM versions of Torque and OpenMPI RPM's. =======8<-------- # Using a comma separated --host list mpirun --host $(tr '\n' , < $PBS_NODEFILE) hostname # OR adding the correct "slots=" entries to the hostfile # Warning the node order may be randomised awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' $PBS_NODEFILE > hostfile_slots.txt mpirun --hostfile hostfile_slots.txt hostname rm -f hostfile_slots.txt =======8<-------- That is okay for personal use but not if other users will be making use of the system. But OpenMPi will ignore --host and --hostfile options if used under PBS and was compiled with the "--with-tm" configuration option. =============================================================================== PBS =======8<-------- #!/bin/bash # # PBS batch script to run a job on PBS nodes # # Place both stdout and stderr into same file #PBS -j oe # # Change the output file locations (disabled) # #PBS -o stdout.log -e stderr.log # # Do not send mail #PBS -m n # Or set a mail address (disabled) # #PBS -M @ # # specify that you want a whole a quadcore machine (disabled) # #PBS -l nodes=1:ppn=4:quadcore # # Maximum time allowed to be in running state (1 minute) #PBS -l walltime=00:01:00 # echo "PBS Job Number " $(echo $PBS_JOBID | sed 's/\..*//') echo "PBS batch run on " $(hostname) echo "Time it was started " $(date +%F_%T) echo "Current Directory " $(pwd) echo "Submitted work dir " $PBS_O_WORKDIR echo "Number of Nodes " $PBS_NP echo "Nodefile List " $PBS_NODEFILE cd "$PBS_O_WORKDIR" # return to the correct sub-directory #echo ----ENVIRONMENT---- #env | grep ^PBS_ echo ----NODEFILE---- cat $PBS_NODEFILE # pbsdsh (PBS distributed shell) runs a command on each allocated node. #echo ----PBSdsh---- #pbsdsh hostname # OpenMPI on this system was built specifically to understand # Torque-PBS system, as such you can just use this and MPI # will run the given program on each node PBS assigns, # with OpenMPI communication setup. echo ----OpenMPI---- mpirun mpi_hello =======8<-------- To queue batch job... # run 1 process on 3 nodes (any that are available) qsub -l nodes=3:ppn=1 pbs_hello # run 1 process over ANY 16 nodes qsub -l nodes=16 pbs_hello # Submit 5 processes on one specific node qsub -l nodes=1:ppn=5:ogre.emperor pbs_hello # require 1 node with 4 free cpus qsub -l nodes=1:ppn=4 pbs_hello # require 4 nodes with 3 free cpus (12 total) qsub -l nodes=4:ppn=3 pbs_hello # You can limit the job to a specific set of nodes available in the cluster # poweredge shrek, ogre, or mou # backnode quadcore dell990 node08 to node20 # backnode dualcore dell960 node21 to node25 # qsub -l nodes=1:ppn=1:backnode pbs_hello # 1 process on ANY backnode qsub -l nodes=3:ppn=4:quadcore pbs_hello # 3 nodes with 4 cores (12 total) qsub -l nodes=5:ppn=2:dualcore pbs_hello # 10 processes on the dualcores qsub -l nodes=1:ppn=5:poweredge pbs_hello # 5 processes on one poweredge qsub -l nodes=3:ppn=3:poweredge pbs_hello # 9 over 3 poweredge machines qsub -l other=dualcore pbs_hello # just one one process on a dual-core You can also specify various "qsub" options directly in the batch file by adding commented lines of the form... #PBS -l nodes=1:ppn=4:quadcore See commented option in batch script above. With that line in place you can then simply submit your batch file qsub pbs_batch_script When your job is complete you will have additional file(s) created in your WORKDIR holding the output from the job run. You will also recieve mail when the job starts, stops or aborts, unless you had a "#PBS -m n" line in the PBS batch file. --- To look at the current queue use qstat What nodes each job is running (needs a wide window for output) qstat -n Look at everything about a specific job number qstat -f {number} How is your job doing (or did in the past) checking over the last 5 days tracejob -n 5 {number} Remove your job from the queue qdel {number} The Administrator can purge a job from the queue (when not running on a node) qdel -p {number) For more information about the current cluster setup... # What nodes are available and what jobs are running on them pbsnodes # Report on one specific node pbsnodes node20.emperor # Report just the states of the nodes (job-exclusive = node is busy) pbsnodes | awk '/^[^ ]/ {a=$0} /^ *state = / {print a, $3}' # Specific properities of a specific node # Properities can be used in "qsub -l", see above) pbsnodes | awk '/^[^ ]/ {a=$0} /^ *prop/ {print a, $3}' # Server and job queue settings qmgr -c 'print server' # Detailed information about a specific node (like "pbsnodes") qmgr -c 'print node ogre.emperor' qmgr -c 'p n node21.emperor' ### Additional not in server notes... Arrays of Jobs... Submit the same pbs batch script multiple times (array of jobs) qsub -t 0-3 -l nodes=1:ppn=2:dualcore pbs_hello The above runs 4 completely seperate sets of dual core jobs. To see the array in the queue as seperate pbs jobs use. qstat -t The environment variables PBS_JOBNAME and PBS_JOBID have and extra component that can be used to determine which array index the runnign job is. PBS_JOBNAME=pbs_hello-3 PBS_JOBID=1306[3].shrek.emperor ------------------------------------------------------------------------------- Server Handling What does pbs_mom report about a node (Debug level 30)... momctl -d 3 -h node25 Have server automatically figure out the np= paramter qmgr -c set server auto_node_np = True Server will then ignore the np value in nodes file. Supervisor deletion of jobs can include a message qdel -m "hey! Stop abusing the NFS servers" 4807 What PBS jobs are running on this node ls /dev/cpuset/torque/ What CPU's are assigned to what job head /dev/cpuset/torque/*/cpuset.cpus =============================================================================== Major Problems Encountered... Remote nodes not working... MPI opens random ports to other random ports You can NOT use a firewall with the system. That is why a intranet cluster is vital. SSH Error... ssh: Could not resolve hostname node22: Name or service not known ORTE was unable to reliably start one or more daemons. In my case this only happened when more than 3 remote nodes are present in the hosts file. According to... http://users.open-mpi.narkive.com/O8OcyQEf/#post6 https://stackoverflow.com/questions/43811232/ The ORTE daemon could in fact run ssh from any node to any other node. That is this must work for all combinations of nodes... ssh node21 ssh node22 hostname The error from ssh could be from ANY node, as MPI tries to ssh to ANY other node. Often using just the short name of the node (no domainname) As such all nodes must have up-to-date DNS or /etc/hosts files and ssh_known_host keys for it to work correctly. Options to "mpirun" that may help.. --mca orte_keep_fqdn_hostnames 1 --mca plm_rsh_no_tree_spawn 1 # disable tree based spawning Testing (including to themselves)... foreach node (Node1 Node2 Node3 Node4 ...) foreach other (Node1 Node2 Node3 Node4 ...) echo from $node to $other ssh $node ssh $other echo OK Physical Cores verses Logical Cores Setting this MCA param is equivalent to --oversubscribe so as to subscribe to the logical number of cores (for example 4) that is commonly used to determine how 'busy a processor is, instead of the physical number of cores (for example 2) --mca rmaps_base_oversubscribe=1 ------------------------------------------------------------------------------- OpenMPI does not obey --hostfile as expected OpenMPI does not use a hostlist as a node list to use, but uses it as a list of hosts on which it CAN run processes. It then requests as many CORES from each node as it can (using a auto-discovery method). This as added somewhere between V1.6 and V1.10 of OpenMPI and was confirmed on the mpi mailing list... rhc@open-mpi.org If you don’t specify a slot count, we auto-discover the number of cores on each node and set #slots to that number. If an RM is involved, then we use what they give us For example, using dual core machines... vi hostfile.txt node21.emperor node22.emperor node22.emperor node23.emperor $PBS_NODEFILE=hostfile.txt # pretend PBS-Torque is running it mpirun --hostfile $PBS_NODEFILE --display-allocation mpi_hello ====================== ALLOCATED NODES ====================== node21.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN node23.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= Hello World! from process 0 / 6 on node21.emperor Hello World! from process 2 / 6 on node22.emperor Hello World! from process 1 / 6 on node21.emperor Hello World! from process 3 / 6 on node22.emperor Hello World! from process 4 / 6 on node23.emperor Hello World! from process 5 / 6 on node23.emperor As you can see we ran 6 processes (ranks) instead of only 4 You can also get other forms of node mapping using --display-allocation --display-map however this will report the resource manager allocation --mca ras_base_verbose 5 will report a correct mapping but that mapping is NOT applied. Solutions... Using a comma seperate host list with --hosts works... mpirun --host $(tr '\n' , < $PBS_NODEFILE) mpi_hello Hello World! from process 0 out of 4 on node21.emperor Hello World! from process 1 out of 4 on node22.emperor Hello World! from process 3 out of 4 on node23.emperor Hello World! from process 2 out of 4 on node22.emperor Add "slot=" information to the hostfile. awk ' { n[$0]++ } END { for(i in n) print i, "slots=" n[i] } ' $PBS_NODEFILE > hostfile_slots.txt cat hostfile_slots.txt # note the order is randomized by awk! node23.emperor slots=1 node22.emperor slots=2 node21.emperor slots=1 mpirun --hostfile hostfile_slots.txt mpi_hello Hello World! from process 0 out of 4 on node23.emperor Hello World! from process 1 out of 4 on node22.emperor Hello World! from process 3 out of 4 on node21.emperor Hello World! from process 2 out of 4 on node22.emperor The recommended solution is to recompile OpenMPI so that the configuration includes "--with-tm" so it understands PBS requests correctly. --- After rebuilding OpenMPI so it understands PBS-Torque, we hit a bug in the Torque-PBS package. Specificaly the generated $PBS_NODEFILE is correct (multiple nodes listed). But PBS runs all processes on the first node only even if that means over-subscribing that node. EG: Commands in the "pbs_hello" and the results... qsub -l nodes=5:ppn=1:dualcore pbs_hello cat $PBS_NODEFILE node21.emperor node25.emperor node24.emperor node23.emperor node22.emperor pbsdsh hostname node21.emperor node21.emperor node21.emperor node21.emperor node21.emperor mpirun hostname node21.emperor node21.emperor node21.emperor node21.emperor node21.emperor Solution Install latest version of Torque 4.2.10-11 from "epel-testing" But needed to remove "num_node_boards=1" from /var/lib/torque/server_priv/nodes ------------------------------------------------------------------------------- "parallel" instead of "mpirun" If you are just wanting to run jobs and some reqire MPI communication you may be able to use parallel to run the jobs on the supplied nodes instead. parallel --sshloginfile $PBS_NODEFILE -nonall hostname Other options to add.. --sshloginfile FILE The file of nodes to run things on --jobs NUM Run NUM simultaneous subjobs per node. --workdir DIR The working directory on all nodes is DIR. --env VAR Copies environment variable VAR to other nodes. --basefile FILE Copies FILE to all nodes before starting subjobs. Use an absolute path, and cleans up afterwards. WARNING: can clean up the 'masternode' if present! --nonall execute the given command on each node. --will-cite Stop the "citation notice" --joblog FILE keep a log of what is processed --resume continue using according to joblog Full example: The jobs do not need to be all the same but can be multiple different jobs, as given in a list... It is important to set the working directory, or "$HOME" will be used. #PBS -l nodes=25:ppn=8,walltime=1:00:00 cd $SCRATCH/example module load gnu-parallel/20121022 parallel --jobs 8 --sshloginfile $PBS_NODEFILE \ --joblog progress.log --workdir $PWD < subjob.lst This runs 200 shell commands (subjobs) simultaniously over 25 nodes. There should be at least a couple thousand jobs in "subjob.lst" for this to make sense. For deeper understanding see... https://wiki.scinet.utoronto.ca/wiki/images/7/7b/Tech-talk-gnu-parallel.pdf ------------------------------------------------------------------------------- numactl NUMA is used to bind a process and its children to a specific set of cpus (nodes) and memory requirements. Basically reserving them for that specific process. numactl --show # what resources the current process has acces too # reserve two cores and show it is reserved numactl --physcpubind=0-1 -- sh -c 'sleep 2; numactl -s; sleep 2' Nothing seems to stop other processes using the reserved cores!!!! ------------------------------------------------------------------------------- Maui schedular As another user pointed out, '-l procs' doesn't work, but considering what you said about the new cluster owners' needs, that may not be essential for them, can get away with -l nodes=N:ppn=P. Make sure you point to the Torque installation when you install Maui --with-pbs=/Torque/installation/dir (Yes, they Maui still calls Torque "pbs".) I also set --prefix=(directory where Maui will be installed), and --with-spooldir= (same directory as --prefix). In maui.cfg, make sure you have consistent values for SERVERHOST your_torque_server_node_name # primary admin must be first in list ADMIN1 maui root #I created a "maui" user to run maui # Resource Manager Definition RMCFG[MASTER] TYPE=PBS #oh well, there comes PBS again ... You can tweak more with Maui configuration parameters, described in the Maui Admin Guide: http://docs.adaptivecomputing.com/maui/ Some features of Maui work well, others not so much, but for instance, you can set up separate queues on Torque, attach some nodes to specific queues (assigning node properties in the nodes' file and configuring the queue to use only those nodes), have routing and execution queues, etc. I hope this helps, Gus Correa -------------------------------------------------------------------------------