Slurm集群管理软件的安装

节点为cn15872-cn15935, IP段为:10.182.170/171.{1..32}

1、删除失败的安装记录

yum remove mariadb-server mariadb-devel -y
yum remove slurm munge munge-libs munge-devel -y
userdel -r slurm
userdel -r munge
[pkill 可以根据进程名杀死某个进程]

2、安装数据库

yum install mariadb-server mariadb-devel -y

说明:-y选项可以在安装的过程中遇到[Y/n]这种可以自动选Y

3、创建用户

export MUNGEUSER=980
groupadd -g $MUNGEUSER munge
useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
export SLURMUSER=981
groupadd -g $SLURMUSER slurm
useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm

4、安装munge

yum install epel-release -y
yum install munge munge-libs munge-devel -y

安装rng-tools来产生munge需要的key(key文件可以为任意大于32字节的文件)

[配置munge.key]
yum install rng-tools -y
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

[munge后续配置]
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge

5、安装slurm

yum reinstall pam.x86_64 -y
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel perl-ExtUtils-MakeMaker man2html libibmad libibumad -y
yum install rpm-build
mkdir /slurm
cd /slurm
wget https://download.schedmd.com/slurm/slurm-17.11.5.tar.bz2
rpmbuild -ta slurm-17.11.5.tar.bz2
cd /root/rpmbuild/RPMS/x86_64
mkdir /slurm/slurm-rpms
cp *.rpm /slurm/slurm-rpms  [*rpm文件为源码包,实际上就是对这些包进行安装]
yum --nogpgcheck localinstall *.rpm -y

6、配置slurm

http://slurm.schedmd.com/configurator.easy.html
此链接可以在线生成slurm.conf的基本配置,后续只要在这上面进行修改即可;1)在/etc/slurm下新建slurm.conf,将上述配置参数写入该文件即可;2)将此slurm.conf文件上传到所有的节点上;3)可以使用scp命令进行传输;接着进行如下步骤:

# 控制节点上的配置
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

# 所有slurmd进程运行节点上的配置
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log

启动slurmd进程

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

启动slurmctld进程

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

7、使用slurm

scontrol show nodes  # 显示所有节点;
srun -N5 hostname  # 5个节点上运行hostname;
sinfo # 查看节点信息;
……

安装参考: https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/