高阶11 使用S3数据同步神器,数据尽在我手

在高阶9中提到了S3神器-Amazon S3多线程断点续传迁移工具,今天终于有空给大家介绍,对我等生信汪来说简直是莫大的福音,动辄上G的国外数据库,几KB的下载速度一度令我们抓狂,有了这个工具,我们从此不必再烦恼啦。 -- D.C

例行介绍,该工具的全称是:Amazon S3 MultiThread Resume Migration Solution (Amazon S3多线程断点续传迁移) 点我 ,官方没有给出缩写,为了方便记忆,就叫做SMRMS吧,非常好记,两边是SM,中间一个人(Ren)。





在典型测试中,迁移1.2TB数据从 美东区域us-east-1 S3 到 国内宁夏区域 cn-northwest-1 S3 只用1小时。


问题 : 先思考一下自己的应用场景是什么?是要搬迁大量数据?还只是偶尔需要去海外扒数据库?


版本 场景
单机版 一次性的搬迁工作,包含三个模式:LOCAL_TO_S3:本地上传;S3_TO_S3:轻中量级,一次性运行的;ALIOSS_TO_S3:阿里云OSS到S3
集群版 大量文件,单文件从0到TB级别。定时任务扫描或即时数据同步(S3触发SQS)。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。
无服务器版 轻中量(建议单文件< 50GB),不定期传输,或即时数据同步。利用断点续传和SQS重驱动,Lambda不用担心15分钟超时。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。






单机版的一些要求: Python 3.6 及其以上,并且要装有针对AWS的SDK: boto3, 后面会讲到。如果客官还要从阿里云拉数据,则还要装阿里云的命令行oss2.

PS: 下载的软件包里有个requirment.txt文件,不放心就跑一下。

$ pip install -r requirements.txt --user
Requirement already satisfied: boto3 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (1.10.34)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.9.4)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.2.1)
Requirement already satisfied: botocore<1.14.0,>=1.13.34 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (1.13.34)
Requirement already satisfied: python-dateutil<2.8.1,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (2.8.0)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (0.15.2)
Requirement already satisfied: urllib3<1.26,>=1.20; python_version >= "3.4" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.25.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.8.1,>=2.1; python_version >= "2.7"->botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.13.0)




python3 --version
Python 3.7.8
[ec2-user@ip-172-xx-xx-xxx ~]$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************6BXX         iam-role
secret_key     ****************6qXX         iam-role
    region                us-west-2      config-file    ~/.aws/config


#!/bin/bash -v
yum update -y
yum install git -y
yum install python3 -y
pip3 install boto3

# Setup BBR
echo "Setup BBR"
cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
exec /sbin/modprobe sch_fq >/dev/null 2>&1
chmod 755 /etc/sysconfig/modules/tcpcong.modules
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf
modprobe tcp_bbr
modprobe sch_fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

echo "Download application amazon-s3-resumable-upload.git"
cd /home/ec2-user/  || exit
git clone https://github.com/aws-samples/amazon-s3-resumable-upload
chmod 755 ec2_init.sh
sudo bash ec2_init.sh
由于我的场景是海外aws和国内aws互传,所以可以启用BBR来提高传输效率,什么是BBR? TCP Bottleneck Bandwidth and RTT, 这是Amazon Linux AMI内核支持的功能,默认不开启,如下方式开启:

$ sudo modprobe tcp_bbr

$ sudo modprobe sch_fq

$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr


$ sudo su -

$ cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
> exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
> exec /sbin/modprobe sch_fq >/dev/null 2>&1

$ chmod 755 /etc/sysconfig/modules/tcpcong.modules

$ echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf


$ vi ~/.aws/credentials

aws_access_key_id = xxxxxxxxxxxxxxxxxxx
aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxx

进到/home/ec2-user/amazon-s3-resumable-upload/single_node 文件夹下,找到文件 s3_upload_config.ini, vim 打开编辑,主要修改以下参数:

JobType = S3_TO_S3  # 改
# 'LOCAL_TO_S3' | 'S3_TO_S3' | 'ALIOSS_TO_S3'

DesBucket = mybucket  # 改
# Destination S3 bucket name
# 目标文件bucket, type = str

S3Prefix = test  # 改,如果是同步桶内某个文件夹,就写这个文件夹名
# S3_TO_S3 mode Src. S3 Prefix, and same as Des. S3 Prefix; LOCAL_TO_S3 mode, this is Des. S3 Prefix.
# S3_TO_S3 源S3的Prefix(与目标S3一致),LOCAL_TO_S3 则为目标S3的Prefix, type = str

SrcFileIndex = *
# Specify the file name to upload. Wildcard "*" to upload all.
# 指定要上传的文件的文件名, type = str,Upload全部文件则用 "*"

DesProfileName = ningxia  # 改,中国区保密 profile 名称,本文中是 ningxia
# Profile name config in ~/.aws credentials. It is the destination account profile.
# 在~/.aws 中配置的能访问目标S3的 profile name

SrcDir = d:\mydir
# Source file directory. It is useless in S3_TO_S3 mode
# 原文件本地存放目录,S3_TO_S3 则该字段无效 type = str

SrcBucket = mybucket  # 改
# Source bucket name. It is useless in LOCAL_TO_S3 mode.
# 源Bucket,LOCAL_TO_S3 则本字段无效

SrcProfileName = oregon  # 改,海外区保密 profile 名称,本文中是oregon
# Profile name config in ~/.aws credentials. It is the source account profile. Useless for LOCAL_TO_S3 mode.
# 在~/.aws 中配置的能访问源S3的 profile name,LOCAL_TO_S3 则本字段无效

[ALIOSS_TO_S3] # 如果是阿里云到国内aws,那么需要把阿里的aksk也要设置在这里
ali_SrcBucket = img-process
ali_access_key_id = xxxx
ali_access_key_secret = xxx
ali_endpoint = oss-cn-beijing.aliyuncs.com

ChunkSize = 5
# File chunksize, unit MBytes, not less than 5MB. Single file parts number < 10,000, limited by S3 mulitpart upload API. The application will auto change it adapting to file size, you don't need to change it.
# 文件分片大小,单位为MB,不小于5M,单文件分片总数不能超过10000, 所以程序会根据文件大小自动调整该值,你一般无需调整。type = int

MaxRetry = 20
# Max retry times while S3 API call fail.
# S3 API call 失败,最大重试次数, type = int

MaxThread = 5
# Max threads for ONE file.
# 单文件同时上传的进程数量, type = int

MaxParallelFile = 5
# Max paralle running file, i.e. concurrency threads = MaxParallelFile * MaxThread
# 并行操作文件数量, type = int, 即同时并发的进程数 = MaxParallelFile * MaxThread

StorageClass = STANDARD  # 看要同步的文件是什么类型,本文是下数据库,所以保持默认

ifVerifyMD5 = False
# Practice for twice MD5 for whole file.
# If True, then after merge file, will do the second time of Etag MD5 for the whole file.
# In S3_TO_S3 mode, this True will force to re-download all parts while break-point resume for calculating MD5, but not reupload the parts which already uploaded.
# In LOCAL_TO_S3 mode, this True will re-read the file and calculate MD5 to compare with S3 ETag after finish one file upload.
# This switch will not affect the MD5 verification of every part upload, even False, it still verify very part's MD5.
# 是否做这个文件的二次的MD5校验
# 为True则一个文件完成上传合并分片之后再次进行整个文件的ETag校验MD5。
# 对于 S3_TO_S3,该开关True会在断点续传的时候重新下载所有已传过的分片来计算MD5。
# 对于LOCAL_TO_S3,该开关True会在文件上传完毕之后重新读取整个文件并计算本地的MD5。
# 该开关不影响每个分片上传时候的校验,即使为False也会校验每个分片MD5。

DontAskMeToClean = True
# If True: While there is unfinished upload, it will not ask you to clean the unfinished parts on Des. S3 or not. It will move on and resume break-point upload.
# If True: 遇到存在现有的未完成upload时,不再询问是否Clean,默认不Clean,自动续传

LoggingLevel = INFO



MISSION ACCOMPLISHED - Time: 0:31:39.284875  - FROM: mybucket/test TO mybucket/test

