高阶11 使用S3数据同步神器，数据尽在我手

在高阶9中提到了S3神器-Amazon S3多线程断点续传迁移工具，今天终于有空给大家介绍，对我等生信汪来说简直是莫大的福音，动辄上G的国外数据库，几KB的下载速度一度令我们抓狂，有了这个工具，我们从此不必再烦恼啦。 -- D.C

例行介绍，该工具的全称是：Amazon S3 MultiThread Resume Migration Solution (Amazon S3多线程断点续传迁移) 点我 ,官方没有给出缩写，为了方便记忆，就叫做SMRMS吧，非常好记，两边是SM，中间一个人(Ren)。

适用于：

从本地服务器上传 Amazon S3 或下载
海外与中国区 Amazon S3 之间数据同步
从阿里云 OSS 迁移大量数据到 Amazon S3，这个厉害！

有三个版本：

单机版：源是本地/S3/阿里云，目的地是S3/本地。
多服务器的集群版: 源是S3，目的地是S3/本地
无服务器的AWS Lambda版: 源是S3，目的地是S3/本地

此外还支持S3的版本控制，即时触发或定时扫描。

好用的点：

多线程并发，充分利用带宽。
断点续传，自动重传。这个太好用了！
支持S3的所有存储级别：标准，非频繁IA，Glacier归档，深度归档。
优化的流控机制。

在典型测试中，迁移1.2TB数据从美东区域us-east-1 S3 到国内宁夏区域 cn-northwest-1 S3 只用1小时。

我该选择什么版本？

问题：先思考一下自己的应用场景是什么？是要搬迁大量数据？还只是偶尔需要去海外扒数据库？

这里简单归纳了以下:

版本	场景
单机版	一次性的搬迁工作，包含三个模式：LOCAL_TO_S3:本地上传；S3_TO_S3:轻中量级，一次性运行的；ALIOSS_TO_S3：阿里云OSS到S3
集群版	大量文件，单文件从0到TB级别。定时任务扫描或即时数据同步（S3触发SQS）。支持S3新增文件触发传输，或Jobsender定时扫描现有S3文件。
无服务器版	轻中量(建议单文件< 50GB)，不定期传输，或即时数据同步。利用断点续传和SQS重驱动，Lambda不用担心15分钟超时。支持S3新增文件触发传输，或Jobsender定时扫描现有S3文件。

更多信息请参考本文开篇的链接，aws架构师写的readme已经详细到令人发指了！

准备工作，安装程序

因为我的需求很简单，就是下载一个海外数据库，所以我选择了单机版的S3_To_S3功能。

下图是单机版完整介绍。

singlenode

单机版的一些要求： Python 3.6 及其以上，并且要装有针对AWS的SDK: boto3, 后面会讲到。如果客官还要从阿里云拉数据，则还要装阿里云的命令行oss2.

PS: 下载的软件包里有个requirment.txt文件，不放心就跑一下。

$ pip install -r requirements.txt --user
Requirement already satisfied: boto3 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (1.10.34)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.9.4)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.2.1)
Requirement already satisfied: botocore<1.14.0,>=1.13.34 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (1.13.34)
Requirement already satisfied: python-dateutil<2.8.1,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (2.8.0)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (0.15.2)
Requirement already satisfied: urllib3<1.26,>=1.20; python_version >= "3.4" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.25.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.8.1,>=2.1; python_version >= "2.7"->botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.13.0)

针对我的这个场景，基本思路是：

新建/拥有一个海外账户（只要绑定信用卡就行，对个人也开放），开一台ec2虚机下载，然后用s3命令传输到S3某个桶内。
在海外虚机上安装S3数据工具,配置好两边的AKSK，以及SMRMS的config文件。
运行同步命令。

当然这是比较粗糙的，群里大神完全可以利用aws的sdk啥的完全可以做的一键自动化，今天我们把基本流程走一遍。

起海外虚拟机

我这里开了一台us-west-2 俄勒冈区域的虚机，配置是m5a.large, 划重点，下载任务对CPU有要求，切勿选择t系列机器，推荐C系列、M系列。
随后登录进去查看下python版本是不是在3.6以上：

python3 --version
Python 3.7.8

配置好AKSK。

[ec2-user@ip-172-xx-xx-xxx ~]$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************6BXX         iam-role
secret_key     ****************6qXX         iam-role
    region                us-west-2      config-file    ~/.aws/config

安装SMRMS工具

如果我准备在 EC2 上运行这个工具，下面是一个Shell脚本帮助快速安装所需环境和软件包，copy下来存到ec2，命名为 ec2_init.sh 。（PS: 其他系统，如本地linux、Mac,或本地windows GUI，请参考这里）

#!/bin/bash -v
yum update -y
yum install git -y
yum install python3 -y
pip3 install boto3

# Setup BBR
echo "Setup BBR"
cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
#!/bin/bash
exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
exec /sbin/modprobe sch_fq >/dev/null 2>&1
EOF
chmod 755 /etc/sysconfig/modules/tcpcong.modules
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf
modprobe tcp_bbr
modprobe sch_fq
sysctl -w net.ipv4.tcp_congestion_control=bbr


echo "Download application amazon-s3-resumable-upload.git"
cd /home/ec2-user/  || exit
git clone https://github.com/aws-samples/amazon-s3-resumable-upload

接着运行脚本,会在当前目录创建一个文件夹 amazon-s3-resumable-upload ：

chmod 755 ec2_init.sh
sudo bash ec2_init.sh

$ tree
.
├── cluster
│   ├── cdk-cluster
│   │   ├── app.py
│   │   ├── cdk
│   │   │   ├── cdk_ec2_stack.py
│   │   │   ├── cdk_resource_stack.py
│   │   │   ├── cdk_vpc_stack.py
│   │   │   ├── cw_agent_config.json
│   │   │   ├── __init__.py
│   │   │   ├── user_data_jobsender.sh
│   │   │   ├── user_data_part1.sh
│   │   │   ├── user_data_part2.sh
│   │   │   └── user_data_worker.sh
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── code
│   │   │   ├── requirements.txt
│   │   │   ├── s3_migration_cluster_config.ini
│   │   │   ├── s3_migration_cluster_jobsender.py
│   │   │   ├── s3_migration_cluster_worker.py
│   │   │   ├── s3_migration_ignore_list.txt
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── img
│   │   ├── 02-jobsender.png
│   │   ├── 02-new.png
│   │   ├── 03.png
│   │   ├── 04.png
│   │   ├── 05.png
│   │   ├── 07.png
│   │   ├── 08.png
│   │   ├── 09.png
│   │   └── 0a.png
│   ├── old-cdk-cluster-0.96
│   │   ├── app.py
│   │   ├── cdk
│   │   │   ├── cdk_ec2_stack.py
│   │   │   ├── cdk_resource_stack.py
│   │   │   ├── cdk_vpc_stack.py
│   │   │   ├── cw_agent_config.json
│   │   │   ├── user_data_jobsender.sh
│   │   │   ├── user_data_part1.sh
│   │   │   ├── user_data_part2.sh
│   │   │   └── user_data_worker.sh
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── code
│   │   │   ├── requirements.txt
│   │   │   ├── s3_migration_cluster_config.ini
│   │   │   ├── s3_migration_cluster_jobsender.py
│   │   │   ├── s3_migration_cluster_worker.py
│   │   │   ├── s3_migration_ignore_list.txt
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── README-English.md
│   └── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── img
│   ├── 01.png
│   └── 02.png
├── LICENSE
├── README.md
├── serverless
│   ├── cdk-serverless
│   │   ├── app.py
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── lambda
│   │   │   ├── lambda_function_jobsender.py
│   │   │   ├── lambda_function_worker.py
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── s3_migration_ignore_list.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── img
│   │   ├── 01.png
│   │   ├── 02-jobsender.png
│   │   ├── 02-new.png
│   │   ├── 05.png
│   │   ├── 06.png
│   │   ├── 07b.png
│   │   └── 09.png
│   ├── old-cdk-serverless-0.96
│   │   ├── app.py
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── lambda
│   │   │   ├── lambda_function_jobsender.py
│   │   │   ├── lambda_function_worker.py
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── s3_migration_ignore_list.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── README-English.md
│   └── README.md
├── single_node
│   ├── ec2_init.sh
│   ├── img
│   │   ├── img01.png
│   │   ├── img02.png
│   │   ├── img03.png
│   │   ├── img04.png
│   │   ├── img05.png
│   │   └── img06.png
│   ├── os_x
│   ├── README.md
│   ├── requestPayer-exampleCodeFrom-\344\270\201\345\217\257_s3_download.py
│   ├── requirements.txt
│   ├── s3_download_config.ini
│   ├── s3_download_config.ini.default
│   ├── s3_download.py
│   ├── s3_upload_config.ini
│   ├── s3_upload_config.ini.default
│   ├── s3_upload.py
│   └── win64
│       ├── s3_download.zip
│       └── s3_upload.zip
└── tools
    ├── analystic_dynamodb_table.py
    ├── clean_unfinished_multipart_upload.py
    └── README.md

20 directories, 112 files

重点！配置SMRMS

开启BBR。

由于我的场景是海外aws和国内aws互传，所以可以启用BBR来提高传输效率，什么是BBR？ TCP Bottleneck Bandwidth and RTT，这是Amazon Linux AMI内核支持的功能，默认不开启,如下方式开启：

$ sudo modprobe tcp_bbr

$ sudo modprobe sch_fq

$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

如果要永久开启：

$ sudo su -

$ cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
>#!/bin/bash
> exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
> exec /sbin/modprobe sch_fq >/dev/null 2>&1
> EOF

$ chmod 755 /etc/sysconfig/modules/tcpcong.modules

$ echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf

配置秘钥文件。

由于我是从海外S3传到国内S3，涉及两个账号，所以需要配置两个AKSK信息。

$ vi ~/.aws/credentials
[ningxia]
region=cn-northwest-1
aws_access_key_id=xxxxxxxxxxxxxxx
aws_secret_access_key=xxxxxxxxxxxxxxxxxxxx


[oregon]
region=us-west-2
aws_access_key_id = xxxxxxxxxxxxxxxxxxx
aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxx

配置SMRMS的config。

进到/home/ec2-user/amazon-s3-resumable-upload/single_node 文件夹下，找到文件 s3_upload_config.ini, vim 打开编辑，主要修改以下参数：

[Basic]
JobType = S3_TO_S3  # 改
# 'LOCAL_TO_S3' | 'S3_TO_S3' | 'ALIOSS_TO_S3'

DesBucket = mybucket  # 改
# Destination S3 bucket name
# 目标文件bucket, type = str

S3Prefix = test  # 改，如果是同步桶内某个文件夹，就写这个文件夹名
# S3_TO_S3 mode Src. S3 Prefix, and same as Des. S3 Prefix; LOCAL_TO_S3 mode, this is Des. S3 Prefix.
# S3_TO_S3 源S3的Prefix(与目标S3一致)，LOCAL_TO_S3 则为目标S3的Prefix, type = str

SrcFileIndex = *
# Specify the file name to upload. Wildcard "*" to upload all.
# 指定要上传的文件的文件名, type = str，Upload全部文件则用 "*"

DesProfileName = ningxia  # 改，中国区保密 profile 名称，本文中是 ningxia
# Profile name config in ~/.aws credentials. It is the destination account profile.
# 在~/.aws 中配置的能访问目标S3的 profile name

[LOCAL_TO_S3]
SrcDir = d:\mydir
# Source file directory. It is useless in S3_TO_S3 mode
# 原文件本地存放目录，S3_TO_S3 则该字段无效 type = str

[S3_TO_S3]
SrcBucket = mybucket  # 改
# Source bucket name. It is useless in LOCAL_TO_S3 mode.
# 源Bucket，LOCAL_TO_S3 则本字段无效

SrcProfileName = oregon  # 改，海外区保密 profile 名称，本文中是oregon
# Profile name config in ~/.aws credentials. It is the source account profile. Useless for LOCAL_TO_S3 mode.
# 在~/.aws 中配置的能访问源S3的 profile name，LOCAL_TO_S3 则本字段无效

[ALIOSS_TO_S3] # 如果是阿里云到国内aws，那么需要把阿里的aksk也要设置在这里
ali_SrcBucket = img-process
ali_access_key_id = xxxx
ali_access_key_secret = xxx
ali_endpoint = oss-cn-beijing.aliyuncs.com

[Advanced]
ChunkSize = 5
# File chunksize, unit MBytes, not less than 5MB. Single file parts number < 10,000, limited by S3 mulitpart upload API. The application will auto change it adapting to file size, you don't need to change it.
# 文件分片大小，单位为MB，不小于5M，单文件分片总数不能超过10000, 所以程序会根据文件大小自动调整该值，你一般无需调整。type = int

MaxRetry = 20
# Max retry times while S3 API call fail.
# S3 API call 失败，最大重试次数, type = int

MaxThread = 5
# Max threads for ONE file.
# 单文件同时上传的进程数量, type = int

MaxParallelFile = 5
# Max paralle running file, i.e. concurrency threads = MaxParallelFile * MaxThread
# 并行操作文件数量, type = int, 即同时并发的进程数 = MaxParallelFile * MaxThread

StorageClass = STANDARD  # 看要同步的文件是什么类型，本文是下数据库，所以保持默认
# 'STANDARD'|'REDUCED_REDUNDANCY'|'STANDARD_IA'|'ONEZONE_IA'|'INTELLIGENT_TIERING'|'GLACIER'|'DEEP_ARCHIVE'

ifVerifyMD5 = False
# Practice for twice MD5 for whole file.
# If True, then after merge file, will do the second time of Etag MD5 for the whole file.
# In S3_TO_S3 mode, this True will force to re-download all parts while break-point resume for calculating MD5, but not reupload the parts which already uploaded.
# In LOCAL_TO_S3 mode, this True will re-read the file and calculate MD5 to compare with S3 ETag after finish one file upload.
# This switch will not affect the MD5 verification of every part upload, even False, it still verify very part's MD5.
# 是否做这个文件的二次的MD5校验
# 为True则一个文件完成上传合并分片之后再次进行整个文件的ETag校验MD5。
# 对于 S3_TO_S3，该开关True会在断点续传的时候重新下载所有已传过的分片来计算MD5。
# 对于LOCAL_TO_S3，该开关True会在文件上传完毕之后重新读取整个文件并计算本地的MD5。
# 该开关不影响每个分片上传时候的校验，即使为False也会校验每个分片MD5。

DontAskMeToClean = True
# If True: While there is unfinished upload, it will not ask you to clean the unfinished parts on Des. S3 or not. It will move on and resume break-point upload.
# If True: 遇到存在现有的未完成upload时，不再询问是否Clean，默认不Clean，自动续传

LoggingLevel = INFO
# 'WARNING' | 'INFO' | 'DEBUG'

保存，退出。

运行同步程序SMRMS

下载数据库文件 test.1a.zip, 由于这台ec2在美国，很快就下完了，然后用aws s3 cp命令推到海外的S3://mybucket/test/下。
配置好以后就可以运行SMRMS程序进行同步了,速度还是杠杠地。

$ python3 /home/ec2-user/amazon-s3-resumable-upload/single_node/s3_upload.py --nogui
Reading config file: s3_upload_config.ini
Logging to file: /home/ec2-user/amazon-s3-resumable-upload/single_node/log/s3_upload-2020-08-23T13-06-09.log
Logging level: INFO
2020-08-23 13:06:09,201 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:09,242 INFO - Checking write permission for: davischen
2020-08-23 13:06:10,668 INFO - Get source file list
2020-08-23 13:06:10,686 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:10,708 INFO - Get s3 file list davischen
2020-08-23 13:06:10,764 INFO - Bucket list length：4
2020-08-23 13:06:10,765 INFO - Get s3 file list davischen
2020-08-23 13:06:11,028 INFO - Bucket list length：9
2020-08-23 13:06:11,028 INFO - Get unfinished multipart upload
2020-08-23 13:06:11,823 INFO - Start file: test/
2020-08-23 13:06:11,823 INFO - Start file: test/AtomSetup-x64.exe
2020-08-23 13:06:11,823 INFO - Duplicated. test/ same size, goto next file.
2020-08-23 13:06:11,823 INFO - Duplicated. test/AtomSetup-x64.exe same size, goto next file.
2020-08-23 13:06:11,824 INFO - Start file: test/test.1a.zip
2020-08-23 13:06:11,824 INFO - Start file: test/nt.26.tar.gz
2020-08-23 13:06:11,825 INFO - New upload: test/test.1a.zip
2020-08-23 13:06:11,825 INFO - Duplicated. test/nt.26.tar.gz same size, goto next file.
--->Downloading test/test.1a.zip - 1/5786
--->Downloading test/test.1a.zip - 2/5786
--->Downloading test/test.1a.zip - 3/5786
--->Downloading test/test.1a.zip - 4/5786
--->Downloading test/test.1a.zip - 5/5786
    --->Uploading test/test.1a.zip - 1/5786
    --->Uploading test/test.1a.zip - 4/5786
    --->Uploading test/test.1a.zip - 3/5786
    --->Uploading test/test.1a.zip - 2/5786
    --->Uploading test/test.1a.zip - 5/5786
        --->Complete test/test.1a.zip - 5/5786 0.02% - 1.7 MB/s
--->Downloading test/test.1a.zip - 6/5786
        --->Complete test/test.1a.zip - 2/5786 0.03% - 1.6 MB/s
--->Downloading test/test.1a.zip - 7/5786
    --->Uploading test/test.1a.zip - 6/5786
    --->Uploading test/test.1a.zip - 7/5786
        --->Complete test/test.1a.zip - 6/5786 0.05% - 5.4 MB/s
--->Downloading test/test.1a.zip - 8/5786
    --->Uploading test/test.1a.zip - 8/5786
        --->Complete test/test.1a.zip - 7/5786 0.07% - 4.5 MB/s
--->Downloading test/test.1a.zip - 9/5786
    --->Uploading test/test.1a.zip - 9/5786
        --->Complete test/test.1a.zip - 8/5786 0.09% - 5.6 MB/s
--->Downloading test/test.1a.zip - 10/5786
        --->Complete test/test.1a.zip - 1/5786 0.10% - 1.0 MB/s
--->Downloading test/test.1a.zip - 11/5786
        --->Complete test/test.1a.zip - 9/5786 0.12% - 6.2 MB/s
--->Downloading test/test.1a.zip - 12/5786
    --->Uploading test/test.1a.zip - 10/5786
    --->Uploading test/test.1a.zip - 11/5786
    --->Uploading test/test.1a.zip - 12/5786
        --->Complete test/test.1a.zip - 10/5786 0.14% - 5.2 MB/s
--->Downloading test/test.1a.zip - 13/5786
        --->Complete test/test.1a.zip - 12/5786 0.16% - 5.5 MB/s
...

MISSION ACCOMPLISHED - Time: 0:31:39.284875  - FROM: mybucket/test TO mybucket/test

28G的文件通过这个工具拉进来只要31min，而走公网下载十几K的速度就算了吧。。。之后就能看到宁夏区的S3://mybucket/test/里已经有这个数据啦~

结语

由于本文是测试单机版的S3_To_S3功能，其他场景没有全部过一遍，所以请有其他场景需求的童鞋参考开篇的官方文档，本文仅做抛砖引玉,也欢迎大家在技术群里踊跃讨论。

人最大的弱点，在于渴望被认同。

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search