|
本帖最后由 万里云 于 2024-12-30 18:08 编辑
集群上共有五个GPU节点,每个两张显卡。Slurm.conf中配置信息如下:
# GPU nodes
NodeName=GPUnode67 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257592 Gres=gpu:2
NodeName=GPUnode66 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257592 Gres=gpu:2
NodeName=GPUnode68 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257592 Gres=gpu:2
NodeName=GPUnode70 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257592 Gres=gpu:2
NodeName=GPUnode69 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257592 Gres=gpu:2
# GPU partitions
PartitionName=gpu_test Nodes=GPUnode[66-70] Default=NO MaxTime=60 State=UP OverSubscribe=NO PriorityTier=3 AllowGroups=p0,p1,p2,quant
PartitionName=gpu_2d Nodes=GPUnode[66-70] Default=NO MaxTime=2880 State=UP OverSubscribe=NO PriorityTier=2 AllowGroups=p0,p1,p2
PartitionName=gpu_7d Nodes=GPUnode[66-70] Default=NO MaxTime=10080 State=UP OverSubscribe=NO PriorityTier=1 AllowGroups=p0,p1,p2
PartitionName=gpu_unlimited Nodes=GPUnode[66-70] Default=NO MaxTime=INFINITE State=UP OverSubscribe=NO PriorityTier=0 AllowGroups=p0,p1,p2
现在用户反馈若节点上分配了使用单张显卡的任务,虽然还空着一张显卡,也没法利用起来。看了一圈slurm配置手册,似乎需要改OverSubscribe选项,但几个值都说不适合GPU资源(gres)
FORCE
Makes all resources (except GRES) in the partition available for oversubscription without any means for users to disable it.
YES
Makes all resources (except GRES) in the partition available for sharing upon request by the job.
不知道要怎么改才能实现两个任务分别利用同个节点上的两张显卡?
|
|