10分钟一键创建HPC集群-pcluster (parallelcluster)

brief introduction of pcluster on aws

-- D.C

俗话说的好，没图说个基霸

好话说前头

对生信来说，第一步不是往下看，而是找到正确的AMI, 这个镜像应该是预装好了pcluster相关必要组件的，所以这一步 至关重要 。
pcluster的版本（目前是2.7.0）对应不同的ami，所以也要区分清楚,在创建ec2界面的第一步，我们可以搜索parallelcluster-2.7.0,目前的记录显示有41个相关的ami,根据你想要的操作系统记下对应的ami ID。不同region的ami id也不同。

举例：

pcluster 版本	系统	宁夏区ami	北京区ami
2.5.1	ubuntu	ami-0202652c7cb199eb6	ami-00881ffc995032786
2.5.1	linux	ami-05038e0c41061b799	ami-0bbadb6ff64415ab3
2.5.1	ubuntu	ami-0b0ebbfcd0c50f225	ami-0d3a6e7dd85085042

本文会涉及到多台机器：

名称	说明
模板机	从官方的pcluster镜像启动，安装用户自己的软件后，打包成分析镜像AMI
控制机	不需要从镜像启动，用于安装pcluster命令，用于控制集群
主节点	用于调度集群作业的节点
计算节点	用于分析计算的节点

[必读] 创建分析镜像

开一台ec2作为 模板机 （如t2.micro），选择从parallelcluster的官方ami启动 [如aws-parallelcluster-2.7.0-amzn-hvm-202005172030 (ami-0c7a09bc17088086c)]
将我的分析流程部署在这台模板机上，再基于 这台ec2 打一个新的ami，拿到这个新的ami id (如ami-xxxxxxxxxxxxxx)，这个AMI就可以设置为pcluster集群的启动ami,对应下面template里面的custom_ami。
请注意，使用parallelcluster一键集群的功能，需要保证 控制服务器安装的parallelcluster版本 和 部署分析流程（的模板机）选择ami的parallelcluster版本 都是一一对应的。如都是2.7.0版本。
所以流程大致是：

原始pcluster官方镜像（2.7版本，ami-0c7a09bc17088086c） --> 启动模板机 --> 安装分析流程 -->打镜像sanpshot --> 得到custom_ami

控制机(默认Amazon Linux 2 AMI) --> 安装pcluster软件2.7版本 --> 配置config --> 运行命令启动集群

打开一台虚拟机作为Pcluster的控制服务器 (v2.7.0)

选择一个ec2配置 eg. t2.micro 系统不限，默认选择amazon linux
记住我的安全秘钥名称 eg. mykey.pem
登录这台机器

安装pcluster 控制软件

安装python3,pip3,pcluster

# 安装python3 依赖包
sudo yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
sudo yum install gcc -y

# 安装python3 和 pip3
sudo yum -y install python3
python3 -m pip install --upgrade pip --user

# 设置清华镜像
pip3 install pip -U
pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 安装parallelcluster
pip3 install aws-parallelcluster --upgrade --user  #默认安装最新版pcluster
pcluster version

TIPS:

万一装错了pcluster版本咋办：

pip3 uninstall aws-parallelcluster # 删除老的
pip3 install aws-parallelcluster==2.7.0 # 安装指定版本

配置aws秘钥文件

$ aws configure
AWS Access Key ID [None]: AKXXXXXXXXXXXXXXXX
AWS Secret Access Key [None]: wJalrXUtnFEMI/KXXXXXXXXXXXXXXXXX
Default region name [us-east-1]: cn-northwest-1
Default output format [None]: json

# 确认下aksk秘钥生效（能摸到S3上数据）
$ aws s3 ls
2020-01-29 11:10:27 lovevideo

生成并配置集群的config文件，一键启动pcluster

cd ~/.parallelcluster
vim config 将下面的配置文件拷贝进去并按要求修改

[config template]

[aws]
aws_region_name = cn-northwest-1 # change if you want
[global]
cluster_template = myname # change if you want, MUST remember this name!
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[cluster myname] # change if you changed the name of cluster_template in [global] settings above
key_name = mytest # change to your keypair name
master_instance_type = c5.2xlarge  # master node type
compute_instance_type = c5.large  # compute node type
pre_install =  https://awshcls.s3.cn-northwest-1.amazonaws.com.cn/efsfix.sh # 中国区用EFS加上这句
pre_install_args = NONE  # 中国区用EFS加上这句
initial_queue_size = 1  #  number of compute nodes when this cluster be created
max_queue_size = 30  # maximum number of compute nodes
maintain_initial_size = true  # will remain 1 compute node even no job submitted
master_root_volume_size = 25  # master's root disk volumn, 17G by default
compute_root_volume_size = 25  # compute's root disk volumn, 17G by default
cluster_type = spot  # ondemand/spot
spot_price = 0.4   # change if you want use spot as compute nodes, get latest price of specific instance in your EC2 console
base_os = alinux2  # change if you did not choose amazon linux as your template ami OS
scheduler = sge #  设定调度引擎，sge,torque,slurm, awsbatch
custom_ami = ami-xxxxxxxxxxxxxx # ami-0c7a09bc17088086c 2.7.0; ami-0c081e1551e30ee5a 2.6.1 ; change to your customized AMI based on pcluster ami
s3_read_resource = NONE
s3_read_write_resource = NONE
placement = compute
vpc_settings = default
#ebs_settings = custom1, custom2 # use EBS to be shared as NFS
efs_settings = custom1 # use EFS to be shared [recommmend]
[vpc default]
vpc_id = vpc-6b111111  # change to your vpc id
master_subnet_id = subnet-a23v24c # change to your subnet id
#compute_subnet_id = subnet-a23v24c # if you want to put compute in private subnet for security
#[ebs custom1]  # change or add more if you want
#shared_dir = data1  # the dir will show in your master or compute nodes
#volume_type = gp2
#volume_size = 80    # GB
#[ebs custom2]  # change or add more if you want
#shared_dir = data2
#volume_type = gp2
#volume_size = 200 # GB
[efs custom1]
shared_dir = myefs # change and will be mounted as /myefs in your master and computer node
encrypted = false
performance_mode = generalPurpose
[scaling custom]
scaledown_idletime = 10  # if idle time of compute nodes exceeds, it will be terminated for cost control (min)

pcluster configure

启动集群命令 pcluster <create/delete> <cluster_template>, 在上面的例子里，cluster_template = myname, 集群启动大约需等待10分钟左右。

$ pcluster create -c config myname
Beginning cluster creation for cluster: myname
Creating stack named: parallelcluster-myname
Status: parallelcluster-myname - CREATE_COMPLETE
MasterPublicIP: 161.111.111.111
ClusterUser: ec2-user
MasterPrivateIP: 172.11.11.11

PS: 这台控制机在启动集群成功后，平时可以关闭，需要的时候在开启，节约费用。

登录主节点，确认集群功能

进到AWS console界面查看master和compute节点状态
ssh登录master节点 ssh -i 'xx.pem' ec2-user@192.111.11.11
如果是Slurm (推荐)

$ cat >> test.slurm<<EOF
#!/bin/bash
#SBATCH -J array
#SBATCH -p compute
#SBATCH -N 1
#SBATCH --cpus-per-task=1
#SBATCH -t 5:00
#SBATCH -a 0-2

input=(foo bar baz)
echo "This is job #\${SLURM_ARRAY_JOB_ID}, with parameter ${input[$SLURM_ARRAY_TASK_ID]}"
echo "There are \${SLURM_ARRAY_TASK_COUNT} task(s) in the array."
echo "  Max index is \${SLURM_ARRAY_TASK_MAX}"
echo "  Min index is \${SLURM_ARRAY_TASK_MIN}"
sleep 5
EOF

提交任务sbatch test.slurm

如果是PBS：

$ cat >> test.pbs<<EOF
#!/bin/bash
#PBS -l nodes=1:ppn=2

sleep 600
EOF

qsub test.pbs

qstat

如果是SGE:

$ cat >> test.sh<<EOF
#!/bin/bash
sleep 600
EOF

qsub -cwd -pe smp 2 -l vf=2G test.sh

qhost to check your queue

df -h to check your volumns

qsub test.sh to check your cluster function

qstat -f to see job status

submit your jobs using command like qsub -cwd -S /bin/bash -V -l vf=2G -pe smp 4 -o output -e output -q all.q yourscript.sh

关于共享存储

我们一般可以用EBS的gp2类型来组建NFS盘，可以和本地环境无缝对接了，但是对于高IOPS的应用场景，可能就会遇到瓶颈了，所幸Pcluster还为我们提供了其他的共享存储选项：

EFS 弹性文件系统（上面的例子用的就是EFS）
FSx Lustre

以EFS为例，将EBS设置部分替换成如下：

efs_settings = customfs

[efs customfs]
shared_dir = efs
encrypted = false
performance_mode = generalPurpose

这样build出来的集群，EFS共享存储性能有保证，且容量弹性，按照实际占用空间计费，如果善于利用EFS的生命周期管理（去EFS页面设置），成本也能控制的不错。

注意：如果想利用现有的EFS盘也可以(如下设置)，但是要注意要提前删除这块EFS上所有的挂载目标（mount target），以前被坑过的就是，当我手动在EFS界面新建efs盘后，系统会默认为这块efs添加所在region的所有可用区的mount target，这样做无疑是为了以后使用方便，但是这样的盘是无法被pcluste利用的，推测其后台会在建集群的时候会分配一个对应可用区的mount target, 如果发现已经有了mount target就会卡在那里。

doc 是这么说的：

Specifying this option voids all other Amazon EFS options except for shared_dir. If you set this option to config_sanity, it only supports file systems:

That don't have a mount target in the stack's Availability Zone

OR

That do have an existing mount target in the stack's Availability Zone, with inbound and outbound NFS traffic allowed from 0.0.0.0/0.

删除mount target方法：EFS界面-点击对应的efs id - 右下角Network - Manage - 把每个可用区的mount target统统remove掉 - save

efs_settings = customfs

[efs customfs]
shared_dir = efs
efs_fs_id = fs-302c28d5

for debug

vi ~/.bashrc
add export SGE_ROOT=/opt/sge
add PATH=/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/amazon/openmpi/bin:$PATH
source ~/.bashrc
sudo /etc/init.d/sgemaster.p6444 <start/stop/restart>

for autoscaling 混搭[高阶]

wget https://awshcls.s3.cn-northwest-1.amazonaws.com.cn/pcluster/asgmodify.json

#修改其中ASG name、LaunchTemplateName和所需实例类型
vi asgmodify.json

aws autoscaling update-auto-scaling-group --cli-input-json file://asgmodify.json --profile zhy

pcluster_User_Guide

行者无疆，干就是了。

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search