|
大佬们求助,
我现在有的机器是7R32-96核的服务器,但是发现用slurm提交任务后只能运行一个,没办法两个48核的任务运行,然后按 http://bbs.keinsci.com/forum.php ... ht=slurm&page=1 这个帖子里老师们的建议把slurm.conf从
SelectType=SELECT/LINEAR
改成了SelectType=select/cons_tres,SelectTypeParameters=CR_Core。
结果发现任务没办法运行了,如下
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1187 localhost vasp xingpu PD 0:00 1 (launch failed requeued held)
slurmd -c显示为
slurmd: fatal: Unable to determine this slurmd's NodeName
下面是我的slurm.conf
#
# See the slurm.conf man page for more information.
#
ControlMachine=localhost
ControlAddr=127.0.0.1
#
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=root
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3
# COMPUTE NODES
NodeName=node1 Sockets=2 CoresPerSocket=48 ThreadsPerCore=1 State=UNKNOWN
NodeName=node2 Sockets=2 CoresPerSocket=48 ThreadsPerCore=1 State=UNKNOWN
PartitionName=localhost Nodes=all Default=YES MaxTime=INFINITE State=UP
求助大佬们,谢谢谢谢
|
|