Prerequisite

In this guide, we will need:

  • 1 controller node(Ubuntu 18.04.6 LTS)

  • 1 worker node(which is the Jetson nano in my case running Ubuntu 18.04.6 LTS)

Setup controller node

  1. Open the terminal on controller node
  2. Run this script to install dependencies:
    sudo apt update
    sudo apt upgrade -y
    sudo apt-get install -y openssh-server munge libmunge2 libmunge-dev slurm-wlm sudo && sudo apt-get clean
    
  3. Checking if munge run properly:
    munge -n | unmunge | grep STATUS
    

    It should have the output like this STATUS: Success (0)

  4. Check if the key exist in /etc/munge/munge.key, otherwise, generate a new key:
    sudo /usr/sbin/mungekey
    
  5. Setup the correct permissions:
    sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
    sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
    sudo chmod 0755 /run/munge/
    sudo chmod 0700 /etc/munge/munge.key
    sudo chown -R munge: /etc/munge/munge.key
    
  6. Restart the munge service and configure it to run at startup:
    sudo systemctl enable munge
    sudo systemctl restart munge
    
  7. Check the status of munge service:
    sudo systemctl status munge
    
  8. Copy the key to the worker:
    sudo scp /etc/munge/munge.key ubuntu@<IP of the worker>:/home/ubuntu/munge.key
    
  9. Use slurm’s handy configuration file generator located at /usr/share/doc/slurmctld/slurm-wlm-configurator.html to create your configuration file. You can open the configurator file with your browser. Or using my configuration sample (the below configuration may not fit with your system):
     # slurm.conf file generated by configurator.html.
     # Put this file on all nodes of your cluster.
     # See the slurm.conf man page for more information.
     #
     ControlMachine=controller
     #ControlAddr=
     #BackupController=
     #BackupAddr=
     #
     AuthType=auth/munge
     #CheckpointType=checkpoint/none
     CryptoType=crypto/munge
     #DisableRootJobs=NO
     #EnforcePartLimits=NO
     #Epilog=
     #EpilogSlurmctld=
     #FirstJobId=1
     #MaxJobId=999999
     #GresTypes=
     #GroupUpdateForce=0
     #GroupUpdateTime=600
     #JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
     #JobCredentialPrivateKey=
     #JobCredentialPublicCertificate=
     #JobFileAppend=0
     #JobRequeue=1
     #JobSubmitPlugins=1
     #KillOnBadExit=0
     #LaunchType=launch/slurm
     #Licenses=foo*4,bar
     #MailProg=/usr/bin/mail
     #MaxJobCount=5000
     #MaxStepCount=40000
     #MaxTasksPerNode=128
     MpiDefault=none
     #MpiParams=ports=#-#
     #PluginDir=
     #PlugStackConfig=
     #PrivateData=jobs
     ProctrackType=proctrack/pgid
     #Prolog=
     #PrologFlags=
     #PrologSlurmctld=
     #PropagatePrioProcess=0
     #PropagateResourceLimits=
     #PropagateResourceLimitsExcept=
     #RebootProgram=
     ReturnToService=1
     #SallocDefaultCommand=
     SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
     SlurmctldPort=6817
     SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
     SlurmdPort=6818
     SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
     SlurmUser=slurm
     #SlurmdUser=root
     #SrunEpilog=
     #SrunProlog=
     StateSaveLocation=/var/lib/slurm-llnl/slurmctld
     SwitchType=switch/none
     #TaskEpilog=
     TaskPlugin=task/none
     #TaskPluginParam=
     #TaskProlog=
     #TopologyPlugin=topology/tree
     #TmpFS=/tmp
     #TrackWCKey=no
     #TreeWidth=
     #UnkillableStepProgram=
     #UsePAM=0
     #
     #
     # TIMERS
     #BatchStartTimeout=10
     #CompleteWait=0
     #EpilogMsgTime=2000
     #GetEnvTimeout=2
     #HealthCheckInterval=0
     #HealthCheckProgram=
     InactiveLimit=0
     KillWait=30
     #MessageTimeout=10
     #ResvOverRun=0
     MinJobAge=300
     #OverTimeLimit=0
     SlurmctldTimeout=120
     SlurmdTimeout=300
     #UnkillableStepTimeout=60
     #VSizeFactor=0
     Waittime=0
     #
     #
     # SCHEDULING
     #DefMemPerCPU=0
     FastSchedule=1
     #MaxMemPerCPU=0
     #SchedulerRootFilter=1
     #SchedulerTimeSlice=30
     SchedulerType=sched/backfill
     SchedulerPort=7321
     SelectType=select/linear
     #SelectTypeParameters=
     #
     #
     # JOB PRIORITY
     #PriorityFlags=
     #PriorityType=priority/basic
     #PriorityDecayHalfLife=
     #PriorityCalcPeriod=
     #PriorityFavorSmall=
     #PriorityMaxAge=
     #PriorityUsageResetPeriod=
     #PriorityWeightAge=
     #PriorityWeightFairshare=
     #PriorityWeightJobSize=
     #PriorityWeightPartition=
     #PriorityWeightQOS=
     #
     #
     # LOGGING AND ACCOUNTING
     #AccountingStorageEnforce=0
     #AccountingStorageHost=
     #AccountingStorageLoc=
     #AccountingStoragePass=
     #AccountingStoragePort=
     AccountingStorageType=accounting_storage/none
     #AccountingStorageUser=
     AccountingStoreJobComment=YES
     ClusterName=cluster
     #DebugFlags=
     #JobCompHost=
     #JobCompLoc=
     #JobCompPass=
     #JobCompPort=
     JobCompType=jobcomp/none
     #JobCompUser=
     #JobContainerType=job_container/none
     JobAcctGatherFrequency=30
     JobAcctGatherType=jobacct_gather/none
     SlurmctldDebug=3
     SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
     SlurmdDebug=3
     SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
     #SlurmSchedLogFile=
     #SlurmSchedLogLevel=
     #
     #
     # POWER SAVE SUPPORT FOR IDLE NODES (optional)
     #SuspendProgram=
     #ResumeProgram=
     #SuspendTimeout=
     #ResumeTimeout=
     #ResumeRate=
     #SuspendExcNodes=
     #SuspendExcParts=
     #SuspendRate=
     #SuspendTime=
     #
     #
     # COMPUTE NODES
     NodeName=node[1-2] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
     PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP
    
  10. Copy the configuration and paste it in /etc/slurm-llnl/slurm.conf
    sudo nano /etc/slurm-llnl/slurm.conf
    
  11. Start the slurm controller node service and configure it to start at startup:
    sudo systemctl enable slurmctld
    sudo systemctl restart slurmctld
    
  12. You can now check your slurm installation is runnning and your cluster is Setup with the following commands:
    systemctl status slurmctld # returns status of slurm service
    sinfo		# returns cluster information
    
  13. Modify /etc/hosts with hostname of the worker node and its IP address, in my case it would be:

Setup worker node

  1. Do the exact steps 1-7 from setup controller node
  2. Replace the key by the key we have copied from controller node:
    sudo cp /home/ubuntu/munge.key /etc/munge/munge.key
    
  3. Setup the correct permissions:
    sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
    sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
    sudo chmod 0755 /run/munge/
    sudo chmod 0700 /etc/munge/munge.key
    sudo chown -R munge: /etc/munge/munge.key
    
  4. Restart the munge service and configure it to run at startup:
    sudo systemctl enable munge
    sudo systemctl restart munge
    
  5. Test the munge connection to the controller node:
    munge -n | ssh <CONTROLLER_NODE> unmunge 
    

    If this is successful, you should see the munge status of the controller node like this:

  6. Modify /etc/hosts with hostname of the worker node and its IP address, in my case it would be:

  7. Copy the configuration from the one you generate for controller and paste it in /etc/slurm-llnl/slurm.conf
    Notice: the configuration should be the same in all nodes
  8. Start the slurm worker node service and configure it to start at startup.
    sudo systemctl enable slurmd
    sudo systemctl restart slurmd
    sudo scontrol update nodename=node1 state=idle
    
  9. Then, we can verify slurm is Setup correctly and running like so:
    sudo systemctl status slurmd
    sudo tail -f /var/log/slurm-llnl/slurmd.log
    

Testing

  1. Go to terminal in controller node, run:
    srun hostname
    

    It should output like this: