|
本帖最后由 naxiangzi 于 2024-11-7 22:49 编辑
目前测试 sbatch test.sh 发现所在节点可以运行, 但其它节点没有运行。
配置2台,主控+计算节点,计算节点1台
cpu:
- 架构: x86_64
- CPU 运行模式: 32-bit, 64-bit
- 字节序: Little Endian
- CPU: 152
- 在线 CPU 列表: 0-151
- 每个核的线程数: 2
- 每个座的核数: 38
- 座: 2
- NUMA 节点: 2
- 厂商 ID: GenuineIntel
- BIOS Vendor ID: Intel(R) Corporation
- CPU 系列: 6
- 型号: 106
- 型号名称: Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz
复制代码 内存:
- total used free shared buff/cache available
- Mem: 251Gi 5.3Gi 238Gi 4.0Gi 7.0Gi 239Gi
- Low: 251Gi 12Gi 238Gi
复制代码
slurm.conf配置如下
- ################################################
- # NODES #
- ################################################
- NodeName=master NodeAddr=192.168.0.100 CPUs=152 CoresPerSocket=38 ThreadsPerCore=2 RealMemory=2450 Procs=1 State=UNKNOWN
- NodeName=node01 NodeAddr=192.168.0.101 CPUs=152 CoresPerSocket=38 ThreadsPerCore=2 RealMemory=2450 Procs=1 State=UNKNOWN
- ################################################
- # PARTITIONS #
- ################################################
- PartitionName=compute Nodes=All Default=YES MaxTime=INFINITE State=UP
复制代码 脚本:网上抄的,不知有没有对
- #!/bin/bash
- #SBATCH -J h5_group
- #SBATCH -p normal
- #SBATCH -N 2
- #SBATCH -n 1
- #SBATCH --mem=1G
- #SBATCH -D /public/home/xxx/xxx/HDF5/h5_test
- #SBATCH --gres=dcu:1
- #SBATCH -o h5_group.o%j
- #SBATCH -e h5_group.e%j
- echo "Start time: `date` "
- echo "SLURM_JOB_ID:$SLURM_JOB_ID"
- echo "SLURM_NNODES:$SLURM_NNODES"
- echo "SLURM_TASKS_PER_NODE:$SLURM_TASKS_PER_NODE"
- echo "SLURM_NTASK:$SLURM_NTASK"
- echo "SLURM_JOB_PARTITION:$SLURM_JOB_PARTITION"
- srun ./h5_group
- # (mpirun ./test)
- echo "End time: `date`"
复制代码
squeue: 有看到节点,一会儿,再次运行,就没内容了
- [hermit@master public]$ squeue
- JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
- 66 compute simple_m sutai R 1:06 2 master,node01
复制代码 master: htop 有看到在跑cpu,node01节点没有看到
[hermit@master ~]$ slurmd -C
NodeName=master CPUs=152 Boards=1 SocketsPerBoard=2 CoresPerSocket=38 ThreadsPerCore=2 RealMemory=257268
UpTime=2-02:35:36
|
|