高阶11 使用S3数据同步神器,数据尽在我手
在高阶9中提到了S3神器-Amazon S3多线程断点续传迁移工具,今天终于有空给大家介绍,对我等生信汪来说简直是莫大的福音,动辄上G的国外数据库,几KB的下载速度一度令我们抓狂,有了这个工具,我们从此不必再烦恼啦。 -- D.C
例行介绍,该工具的全称是:Amazon S3 MultiThread Resume Migration Solution (Amazon S3多线程断点续传迁移) 点我 ,官方没有给出缩写,为了方便记忆,就叫做SMRMS吧,非常好记,两边是SM,中间一个人(Ren)。
适用于:
-
从本地服务器上传 Amazon S3 或下载
-
海外与中国区 Amazon S3 之间数据同步
-
从阿里云 OSS 迁移大量数据到 Amazon S3,这个厉害!
有三个版本:
-
单机版: 源是本地/S3/阿里云,目的地是S3/本地。
-
多服务器的集群版: 源是S3,目的地是S3/本地
-
无服务器的AWS Lambda版: 源是S3,目的地是S3/本地
此外还支持S3的版本控制,即时触发或定时扫描。
好用的点:
-
多线程并发,充分利用带宽。
-
断点续传,自动重传。这个太好用了!
-
支持S3的所有存储级别:标准,非频繁IA,Glacier归档,深度归档。
-
优化的流控机制。
在典型测试中,迁移1.2TB数据从 美东区域us-east-1 S3 到 国内宁夏区域 cn-northwest-1 S3 只用1小时。
我该选择什么版本?
问题 : 先思考一下自己的应用场景是什么?是要搬迁大量数据?还只是偶尔需要去海外扒数据库?
这里简单归纳了以下:
版本 | 场景 |
---|---|
单机版 | 一次性的搬迁工作,包含三个模式:LOCAL_TO_S3:本地上传;S3_TO_S3:轻中量级,一次性运行的;ALIOSS_TO_S3:阿里云OSS到S3 |
集群版 | 大量文件,单文件从0到TB级别。定时任务扫描或即时数据同步(S3触发SQS)。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。 |
无服务器版 | 轻中量(建议单文件< 50GB),不定期传输,或即时数据同步。利用断点续传和SQS重驱动,Lambda不用担心15分钟超时。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。 |
更多信息请参考本文开篇的链接,aws架构师写的readme已经详细到令人发指了!
准备工作,安装程序
因为我的需求很简单,就是下载一个海外数据库,所以我选择了单机版的S3_To_S3功能。
下图是单机版完整介绍。
单机版的一些要求: Python 3.6 及其以上,并且要装有针对AWS的SDK: boto3, 后面会讲到。如果客官还要从阿里云拉数据,则还要装阿里云的命令行oss2.
PS: 下载的软件包里有个requirment.txt文件,不放心就跑一下。
$ pip install -r requirements.txt --user
Requirement already satisfied: boto3 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (1.10.34)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.9.4)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.2.1)
Requirement already satisfied: botocore<1.14.0,>=1.13.34 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (1.13.34)
Requirement already satisfied: python-dateutil<2.8.1,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (2.8.0)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (0.15.2)
Requirement already satisfied: urllib3<1.26,>=1.20; python_version >= "3.4" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.25.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.8.1,>=2.1; python_version >= "2.7"->botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.13.0)
针对我的这个场景,基本思路是:
-
新建/拥有一个海外账户(只要绑定信用卡就行,对个人也开放),开一台ec2虚机下载,然后用s3命令传输到S3某个桶内。
-
在海外虚机上安装S3数据工具,配置好两边的AKSK,以及SMRMS的config文件。
-
运行同步命令。
当然这是比较粗糙的,群里大神完全可以利用aws的sdk啥的完全可以做的一键自动化,今天我们把基本流程走一遍。
起海外虚拟机
-
我这里开了一台us-west-2 俄勒冈区域的虚机,配置是m5a.large, 划重点,下载任务对CPU有要求,切勿选择t系列机器,推荐C系列、M系列。
-
随后登录进去查看下python版本是不是在3.6以上:
python3 --version
Python 3.7.8
- 配置好AKSK。
[ec2-user@ip-172-xx-xx-xxx ~]$ aws configure list
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************6BXX iam-role
secret_key ****************6qXX iam-role
region us-west-2 config-file ~/.aws/config
安装SMRMS工具
- 如果我准备在 EC2 上运行这个工具,下面是一个Shell脚本帮助快速安装所需环境和软件包,copy下来存到ec2,命名为 ec2_init.sh 。(PS: 其他系统,如本地linux、Mac,或本地windows GUI,请参考这里)
#!/bin/bash -v
yum update -y
yum install git -y
yum install python3 -y
pip3 install boto3
# Setup BBR
echo "Setup BBR"
cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
#!/bin/bash
exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
exec /sbin/modprobe sch_fq >/dev/null 2>&1
EOF
chmod 755 /etc/sysconfig/modules/tcpcong.modules
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf
modprobe tcp_bbr
modprobe sch_fq
sysctl -w net.ipv4.tcp_congestion_control=bbr
echo "Download application amazon-s3-resumable-upload.git"
cd /home/ec2-user/ || exit
git clone https://github.com/aws-samples/amazon-s3-resumable-upload
- 接着运行脚本,会在当前目录创建一个文件夹 amazon-s3-resumable-upload :
chmod 755 ec2_init.sh
sudo bash ec2_init.sh
$ tree
.
├── cluster
│ ├── cdk-cluster
│ │ ├── app.py
│ │ ├── cdk
│ │ │ ├── cdk_ec2_stack.py
│ │ │ ├── cdk_resource_stack.py
│ │ │ ├── cdk_vpc_stack.py
│ │ │ ├── cw_agent_config.json
│ │ │ ├── __init__.py
│ │ │ ├── user_data_jobsender.sh
│ │ │ ├── user_data_part1.sh
│ │ │ ├── user_data_part2.sh
│ │ │ └── user_data_worker.sh
│ │ ├── cdk.context.json
│ │ ├── cdk.json
│ │ ├── code
│ │ │ ├── requirements.txt
│ │ │ ├── s3_migration_cluster_config.ini
│ │ │ ├── s3_migration_cluster_jobsender.py
│ │ │ ├── s3_migration_cluster_worker.py
│ │ │ ├── s3_migration_ignore_list.txt
│ │ │ └── s3_migration_lib.py
│ │ ├── README.md
│ │ ├── requirements.txt
│ │ ├── setup.py
│ │ └── source.bat
│ ├── img
│ │ ├── 02-jobsender.png
│ │ ├── 02-new.png
│ │ ├── 03.png
│ │ ├── 04.png
│ │ ├── 05.png
│ │ ├── 07.png
│ │ ├── 08.png
│ │ ├── 09.png
│ │ └── 0a.png
│ ├── old-cdk-cluster-0.96
│ │ ├── app.py
│ │ ├── cdk
│ │ │ ├── cdk_ec2_stack.py
│ │ │ ├── cdk_resource_stack.py
│ │ │ ├── cdk_vpc_stack.py
│ │ │ ├── cw_agent_config.json
│ │ │ ├── user_data_jobsender.sh
│ │ │ ├── user_data_part1.sh
│ │ │ ├── user_data_part2.sh
│ │ │ └── user_data_worker.sh
│ │ ├── cdk.context.json
│ │ ├── cdk.json
│ │ ├── code
│ │ │ ├── requirements.txt
│ │ │ ├── s3_migration_cluster_config.ini
│ │ │ ├── s3_migration_cluster_jobsender.py
│ │ │ ├── s3_migration_cluster_worker.py
│ │ │ ├── s3_migration_ignore_list.txt
│ │ │ └── s3_migration_lib.py
│ │ ├── README.md
│ │ ├── requirements.txt
│ │ ├── setup.py
│ │ └── source.bat
│ ├── README-English.md
│ └── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── img
│ ├── 01.png
│ └── 02.png
├── LICENSE
├── README.md
├── serverless
│ ├── cdk-serverless
│ │ ├── app.py
│ │ ├── cdk.context.json
│ │ ├── cdk.json
│ │ ├── lambda
│ │ │ ├── lambda_function_jobsender.py
│ │ │ ├── lambda_function_worker.py
│ │ │ └── s3_migration_lib.py
│ │ ├── README.md
│ │ ├── requirements.txt
│ │ ├── s3_migration_ignore_list.txt
│ │ ├── setup.py
│ │ └── source.bat
│ ├── img
│ │ ├── 01.png
│ │ ├── 02-jobsender.png
│ │ ├── 02-new.png
│ │ ├── 05.png
│ │ ├── 06.png
│ │ ├── 07b.png
│ │ └── 09.png
│ ├── old-cdk-serverless-0.96
│ │ ├── app.py
│ │ ├── cdk.context.json
│ │ ├── cdk.json
│ │ ├── lambda
│ │ │ ├── lambda_function_jobsender.py
│ │ │ ├── lambda_function_worker.py
│ │ │ └── s3_migration_lib.py
│ │ ├── README.md
│ │ ├── requirements.txt
│ │ ├── s3_migration_ignore_list.txt
│ │ ├── setup.py
│ │ └── source.bat
│ ├── README-English.md
│ └── README.md
├── single_node
│ ├── ec2_init.sh
│ ├── img
│ │ ├── img01.png
│ │ ├── img02.png
│ │ ├── img03.png
│ │ ├── img04.png
│ │ ├── img05.png
│ │ └── img06.png
│ ├── os_x
│ ├── README.md
│ ├── requestPayer-exampleCodeFrom-\344\270\201\345\217\257_s3_download.py
│ ├── requirements.txt
│ ├── s3_download_config.ini
│ ├── s3_download_config.ini.default
│ ├── s3_download.py
│ ├── s3_upload_config.ini
│ ├── s3_upload_config.ini.default
│ ├── s3_upload.py
│ └── win64
│ ├── s3_download.zip
│ └── s3_upload.zip
└── tools
├── analystic_dynamodb_table.py
├── clean_unfinished_multipart_upload.py
└── README.md
20 directories, 112 files
重点!配置SMRMS
- 开启BBR。
由于我的场景是海外aws和国内aws互传,所以可以启用BBR来提高传输效率,什么是BBR? TCP Bottleneck Bandwidth and RTT, 这是Amazon Linux AMI内核支持的功能,默认不开启,如下方式开启:
$ sudo modprobe tcp_bbr
$ sudo modprobe sch_fq
$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
如果要永久开启:
$ sudo su -
$ cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
>#!/bin/bash
> exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
> exec /sbin/modprobe sch_fq >/dev/null 2>&1
> EOF
$ chmod 755 /etc/sysconfig/modules/tcpcong.modules
$ echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf
- 配置秘钥文件。
由于我是从海外S3传到国内S3,涉及两个账号,所以需要配置两个AKSK信息。
$ vi ~/.aws/credentials
[ningxia]
region=cn-northwest-1
aws_access_key_id=xxxxxxxxxxxxxxx
aws_secret_access_key=xxxxxxxxxxxxxxxxxxxx
[oregon]
region=us-west-2
aws_access_key_id = xxxxxxxxxxxxxxxxxxx
aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxx
- 配置SMRMS的config。
进到/home/ec2-user/amazon-s3-resumable-upload/single_node 文件夹下,找到文件 s3_upload_config.ini, vim 打开编辑,主要修改以下参数:
[Basic]
JobType = S3_TO_S3 # 改
# 'LOCAL_TO_S3' | 'S3_TO_S3' | 'ALIOSS_TO_S3'
DesBucket = mybucket # 改
# Destination S3 bucket name
# 目标文件bucket, type = str
S3Prefix = test # 改,如果是同步桶内某个文件夹,就写这个文件夹名
# S3_TO_S3 mode Src. S3 Prefix, and same as Des. S3 Prefix; LOCAL_TO_S3 mode, this is Des. S3 Prefix.
# S3_TO_S3 源S3的Prefix(与目标S3一致),LOCAL_TO_S3 则为目标S3的Prefix, type = str
SrcFileIndex = *
# Specify the file name to upload. Wildcard "*" to upload all.
# 指定要上传的文件的文件名, type = str,Upload全部文件则用 "*"
DesProfileName = ningxia # 改,中国区保密 profile 名称,本文中是 ningxia
# Profile name config in ~/.aws credentials. It is the destination account profile.
# 在~/.aws 中配置的能访问目标S3的 profile name
[LOCAL_TO_S3]
SrcDir = d:\mydir
# Source file directory. It is useless in S3_TO_S3 mode
# 原文件本地存放目录,S3_TO_S3 则该字段无效 type = str
[S3_TO_S3]
SrcBucket = mybucket # 改
# Source bucket name. It is useless in LOCAL_TO_S3 mode.
# 源Bucket,LOCAL_TO_S3 则本字段无效
SrcProfileName = oregon # 改,海外区保密 profile 名称,本文中是oregon
# Profile name config in ~/.aws credentials. It is the source account profile. Useless for LOCAL_TO_S3 mode.
# 在~/.aws 中配置的能访问源S3的 profile name,LOCAL_TO_S3 则本字段无效
[ALIOSS_TO_S3] # 如果是阿里云到国内aws,那么需要把阿里的aksk也要设置在这里
ali_SrcBucket = img-process
ali_access_key_id = xxxx
ali_access_key_secret = xxx
ali_endpoint = oss-cn-beijing.aliyuncs.com
[Advanced]
ChunkSize = 5
# File chunksize, unit MBytes, not less than 5MB. Single file parts number < 10,000, limited by S3 mulitpart upload API. The application will auto change it adapting to file size, you don't need to change it.
# 文件分片大小,单位为MB,不小于5M,单文件分片总数不能超过10000, 所以程序会根据文件大小自动调整该值,你一般无需调整。type = int
MaxRetry = 20
# Max retry times while S3 API call fail.
# S3 API call 失败,最大重试次数, type = int
MaxThread = 5
# Max threads for ONE file.
# 单文件同时上传的进程数量, type = int
MaxParallelFile = 5
# Max paralle running file, i.e. concurrency threads = MaxParallelFile * MaxThread
# 并行操作文件数量, type = int, 即同时并发的进程数 = MaxParallelFile * MaxThread
StorageClass = STANDARD # 看要同步的文件是什么类型,本文是下数据库,所以保持默认
# 'STANDARD'|'REDUCED_REDUNDANCY'|'STANDARD_IA'|'ONEZONE_IA'|'INTELLIGENT_TIERING'|'GLACIER'|'DEEP_ARCHIVE'
ifVerifyMD5 = False
# Practice for twice MD5 for whole file.
# If True, then after merge file, will do the second time of Etag MD5 for the whole file.
# In S3_TO_S3 mode, this True will force to re-download all parts while break-point resume for calculating MD5, but not reupload the parts which already uploaded.
# In LOCAL_TO_S3 mode, this True will re-read the file and calculate MD5 to compare with S3 ETag after finish one file upload.
# This switch will not affect the MD5 verification of every part upload, even False, it still verify very part's MD5.
# 是否做这个文件的二次的MD5校验
# 为True则一个文件完成上传合并分片之后再次进行整个文件的ETag校验MD5。
# 对于 S3_TO_S3,该开关True会在断点续传的时候重新下载所有已传过的分片来计算MD5。
# 对于LOCAL_TO_S3,该开关True会在文件上传完毕之后重新读取整个文件并计算本地的MD5。
# 该开关不影响每个分片上传时候的校验,即使为False也会校验每个分片MD5。
DontAskMeToClean = True
# If True: While there is unfinished upload, it will not ask you to clean the unfinished parts on Des. S3 or not. It will move on and resume break-point upload.
# If True: 遇到存在现有的未完成upload时,不再询问是否Clean,默认不Clean,自动续传
LoggingLevel = INFO
# 'WARNING' | 'INFO' | 'DEBUG'
保存,退出。
运行同步程序SMRMS
-
下载数据库文件 test.1a.zip, 由于这台ec2在美国,很快就下完了,然后用aws s3 cp命令推到海外的S3://mybucket/test/下。
-
配置好以后就可以运行SMRMS程序进行同步了,速度还是杠杠地。
$ python3 /home/ec2-user/amazon-s3-resumable-upload/single_node/s3_upload.py --nogui
Reading config file: s3_upload_config.ini
Logging to file: /home/ec2-user/amazon-s3-resumable-upload/single_node/log/s3_upload-2020-08-23T13-06-09.log
Logging level: INFO
2020-08-23 13:06:09,201 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:09,242 INFO - Checking write permission for: davischen
2020-08-23 13:06:10,668 INFO - Get source file list
2020-08-23 13:06:10,686 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:10,708 INFO - Get s3 file list davischen
2020-08-23 13:06:10,764 INFO - Bucket list length:4
2020-08-23 13:06:10,765 INFO - Get s3 file list davischen
2020-08-23 13:06:11,028 INFO - Bucket list length:9
2020-08-23 13:06:11,028 INFO - Get unfinished multipart upload
2020-08-23 13:06:11,823 INFO - Start file: test/
2020-08-23 13:06:11,823 INFO - Start file: test/AtomSetup-x64.exe
2020-08-23 13:06:11,823 INFO - Duplicated. test/ same size, goto next file.
2020-08-23 13:06:11,823 INFO - Duplicated. test/AtomSetup-x64.exe same size, goto next file.
2020-08-23 13:06:11,824 INFO - Start file: test/test.1a.zip
2020-08-23 13:06:11,824 INFO - Start file: test/nt.26.tar.gz
2020-08-23 13:06:11,825 INFO - New upload: test/test.1a.zip
2020-08-23 13:06:11,825 INFO - Duplicated. test/nt.26.tar.gz same size, goto next file.
--->Downloading test/test.1a.zip - 1/5786
--->Downloading test/test.1a.zip - 2/5786
--->Downloading test/test.1a.zip - 3/5786
--->Downloading test/test.1a.zip - 4/5786
--->Downloading test/test.1a.zip - 5/5786
--->Uploading test/test.1a.zip - 1/5786
--->Uploading test/test.1a.zip - 4/5786
--->Uploading test/test.1a.zip - 3/5786
--->Uploading test/test.1a.zip - 2/5786
--->Uploading test/test.1a.zip - 5/5786
--->Complete test/test.1a.zip - 5/5786 0.02% - 1.7 MB/s
--->Downloading test/test.1a.zip - 6/5786
--->Complete test/test.1a.zip - 2/5786 0.03% - 1.6 MB/s
--->Downloading test/test.1a.zip - 7/5786
--->Uploading test/test.1a.zip - 6/5786
--->Uploading test/test.1a.zip - 7/5786
--->Complete test/test.1a.zip - 6/5786 0.05% - 5.4 MB/s
--->Downloading test/test.1a.zip - 8/5786
--->Uploading test/test.1a.zip - 8/5786
--->Complete test/test.1a.zip - 7/5786 0.07% - 4.5 MB/s
--->Downloading test/test.1a.zip - 9/5786
--->Uploading test/test.1a.zip - 9/5786
--->Complete test/test.1a.zip - 8/5786 0.09% - 5.6 MB/s
--->Downloading test/test.1a.zip - 10/5786
--->Complete test/test.1a.zip - 1/5786 0.10% - 1.0 MB/s
--->Downloading test/test.1a.zip - 11/5786
--->Complete test/test.1a.zip - 9/5786 0.12% - 6.2 MB/s
--->Downloading test/test.1a.zip - 12/5786
--->Uploading test/test.1a.zip - 10/5786
--->Uploading test/test.1a.zip - 11/5786
--->Uploading test/test.1a.zip - 12/5786
--->Complete test/test.1a.zip - 10/5786 0.14% - 5.2 MB/s
--->Downloading test/test.1a.zip - 13/5786
--->Complete test/test.1a.zip - 12/5786 0.16% - 5.5 MB/s
...
MISSION ACCOMPLISHED - Time: 0:31:39.284875 - FROM: mybucket/test TO mybucket/test
- 28G的文件通过这个工具拉进来只要31min,而走公网下载十几K的速度就算了吧。。。之后就能看到宁夏区的S3://mybucket/test/里已经有这个数据啦~
结语
- 由于本文是测试单机版的S3_To_S3功能,其他场景没有全部过一遍,所以请有其他场景需求的童鞋参考开篇的官方文档,本文仅做抛砖引玉,也欢迎大家在技术群里踊跃讨论。
人最大的弱点,在于渴望被认同。