节点为cn15872-cn15935, IP段为:10.182.170/171.{1..32}
1、删除失败的安装记录
yum remove mariadb-server mariadb-devel -y
yum remove slurm munge munge-libs munge-devel -y
userdel -r slurm
userdel -r munge
[pkill 可以根据进程名杀死某个进程]
2、安装数据库
yum install mariadb-server mariadb-devel -y
说明:-y选项可以在安装的过程中遇到[Y/n]这种可以自动选Y
3、创建用户
export MUNGEUSER=980
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=981
groupadd -g $SLURMUSER slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
4、安装munge
yum install epel-release -y
yum install munge munge-libs munge-devel -y
安装rng-tools来产生munge需要的key(key文件可以为任意大于32字节的文件)
[配置munge.key]
yum install rng-tools -y
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
[munge后续配置]
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge
5、安装slurm
yum reinstall pam.x86_64 -y
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel perl-ExtUtils-MakeMaker man2html libibmad libibumad -y
yum install rpm-build
mkdir /slurm
cd /slurm
wget https://download.schedmd.com/slurm/slurm-17.11.5.tar.bz2
rpmbuild -ta slurm-17.11.5.tar.bz2
cd /root/rpmbuild/RPMS/x86_64
mkdir /slurm/slurm-rpms
cp *.rpm /slurm/slurm-rpms [*rpm文件为源码包,实际上就是对这些包进行安装]
yum --nogpgcheck localinstall *.rpm -y
6、配置slurm
http://slurm.schedmd.com/configurator.easy.html
此链接可以在线生成slurm.conf的基本配置,后续只要在这上面进行修改即可;1)在/etc/slurm下新建slurm.conf,将上述配置参数写入该文件即可;2)将此slurm.conf文件上传到所有的节点上;3)可以使用scp命令进行传输;接着进行如下步骤:
# 控制节点上的配置
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
# 所有slurmd进程运行节点上的配置
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log
启动slurmd进程
systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service
启动slurmctld进程
systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service
7、使用slurm
scontrol show nodes # 显示所有节点;
srun -N5 hostname # 5个节点上运行hostname;
sinfo # 查看节点信息;
……
安装参考: https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/