本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
使用 Nvidia RAPIDS Accelerator for Apache Spark
对于 Amazon EMR 发行版 6.2.0 及更高版本,您可以使用 Nvidia 的 RAPIDS Accelerator for Apache Spark 插件来通过 EC2 图形处理器(GPU)实例类型加速 Spark。RAPIDS Accelerator 无需更改代码即可 GPU-accelerate为您的 Apache Spark 3.0 数据科学管道,在大幅降低基础设施成本的同时加快数据处理和模型训练的速度。
以下各节将指导您配置 EMR 集群以使用适用于 Spark 的 Spark-RAPIDS 插件。
选择实例类型
要使用适用于 Spark 的 Nvidia Spark-RAPIDS 插件,核心实例组和任务实例组必须使用符合硬件要求的 EC2 GPU 实例类型 Spark-RAPIDS。要查看 Amazon EMR 支持的 GPU 实例类型的完整列表,请参阅《Amazon EMR 管理指南》中的支持的实例类型。主实例组的实例类型可以是 GPU 或非 GPU 类型,但不支持 ARM 实例类型。
为集群设置应用程序配置
1。使 Amazon EMR 能在您的新集群上安装插件
要安装插件,请在创建集群时提供以下配置:
{
"Classification":"spark",
"Properties":{
"enableSparkRapids":"true"
}
}
2。将 YARN 配置为使用 GPU
有关如何在 YARN 上使用 GPU 的详细信息,请参阅 Apache Hadoop 文档中的在 YARN 上使用 GPU。以下示例显示了 Amazon EMR 6.x 和 7.x 发行版的 YARN 配置示例:
- Amazon EMR 7.x
-
Amazon EMR 7.x 的 YARN 配置示例
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},{
"Classification":"container-executor",
"Properties":{
},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/spark-rapids-cgroup",
"yarn-hierarchy":"yarn"
}
}
]
}
- Amazon EMR 6.x
-
Amazon EMR 6.x 的 YARN 配置示例
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},{
"Classification":"container-executor",
"Properties":{
},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/sys/fs/cgroup",
"yarn-hierarchy":"yarn"
}
}
]
}
3。将 Spark 配置为使用 RAPIDS
以下是使 Spark 能够使用 RAPIDS 插件所需的配置:
{
"Classification":"spark-defaults",
"Properties":{
"spark.plugins":"com.nvidia.spark.SQLPlugin",
"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native"
}
}
XGBoost4J-Spark 在@@ 集群上启用 Spark RAPIDS 插件后,XGBoost 文档中的库也可用。您可以使用以下配置将 XGBoost 与您的 Spark 任务集成:
{
"Classification":"spark-defaults",
"Properties":{
"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar"
}
}
有关可用于调整 GPU-accelerated EMR 集群的其他 Spark 配置,请参阅文档中的适用于 Apache Spark 的 Rapids Accelerator 调优指南。 Nvidia.github.io
4。配置 YARN 容量调度器
必须配置 DominantResourceCalculator 来启用 GPU 调度和隔离。有关详细信息,请参阅 Apache Hadoop 文档中的 Using GPU On YARN。
{
"Classification":"capacity-scheduler",
"Properties":{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
5。创建一个 JSON 文件以包含您的配置
您可以创建一个 JSON 文件,在其中包含您的配置,以便为 Spark 集群使用 RAPIDS 插件。您稍后在启动集群时需提供该文件。
您可以将文件存储在本地或 S3 上。有关如何为集群提供应用程序配置的详细信息,请参阅配置应用程序。
使用以下示例文件作为模板来构建自己的配置。
- Amazon EMR 7.x
-
Amazon EMR 7.x 的示例 my-configurations.json 文件
[
{
"Classification":"spark",
"Properties":{
"enableSparkRapids":"true"
}
},
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},
{
"Classification":"container-executor",
"Properties":{
},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/spark-rapids-cgroup",
"yarn-hierarchy":"yarn"
}
}
]
},
{
"Classification":"spark-defaults",
"Properties":{
"spark.plugins":"com.nvidia.spark.SQLPlugin",
"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar",
"spark.rapids.sql.concurrentGpuTasks":"1",
"spark.executor.resource.gpu.amount":"1",
"spark.executor.cores":"2",
"spark.task.cpus":"1",
"spark.task.resource.gpu.amount":"0.5",
"spark.rapids.memory.pinnedPool.size":"0",
"spark.executor.memoryOverhead":"2G",
"spark.locality.wait":"0s",
"spark.sql.shuffle.partitions":"200",
"spark.sql.files.maxPartitionBytes":"512m"
}
},
{
"Classification":"capacity-scheduler",
"Properties":{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
]
- Amazon EMR 6.x
-
Amazon EMR 6.x 的示例 my-configurations.json 文件
[
{
"Classification":"spark",
"Properties":{
"enableSparkRapids":"true"
}
},
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},
{
"Classification":"container-executor",
"Properties":{
},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/sys/fs/cgroup",
"yarn-hierarchy":"yarn"
}
}
]
},
{
"Classification":"spark-defaults",
"Properties":{
"spark.plugins":"com.nvidia.spark.SQLPlugin",
"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar",
"spark.rapids.sql.concurrentGpuTasks":"1",
"spark.executor.resource.gpu.amount":"1",
"spark.executor.cores":"2",
"spark.task.cpus":"1",
"spark.task.resource.gpu.amount":"0.5",
"spark.rapids.memory.pinnedPool.size":"0",
"spark.executor.memoryOverhead":"2G",
"spark.locality.wait":"0s",
"spark.sql.shuffle.partitions":"200",
"spark.sql.files.maxPartitionBytes":"512m"
}
},
{
"Classification":"capacity-scheduler",
"Properties":{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
]
为您的集群添加引导操作
有关如何在创建集群时提供引导操作脚本的更多信息,请参阅《Amazon EMR 管理指南》中的引导操作基础。
以下示例脚本展示了如何为 Amazon EMR 6.x 和 7.x 制作引导操作文件:
- Amazon EMR 7.x
-
Amazon EMR 7.x 的示例 my-bootstrap-action.sh 文件
要使用 YARN 管理 Amazon EMR 7.x 发行版的 GPU 资源,您必须在集群上手动挂载 CGroup v1。您可以使用引导操作脚本来执行此操作,如本示例所示。
#!/bin/bash
set -ex
sudo mkdir -p /spark-rapids-cgroup/devices
sudo mount -t cgroup -o devices cgroupv1-devices /spark-rapids-cgroup/devices
sudo chmod a+rwx -R /spark-rapids-cgroup
- Amazon EMR 6.x
-
Amazon EMR 6.x 的示例 my-bootstrap-action.sh 文件
对于 Amazon EMR 6.x 发行版,您必须在集群上打开 YARN 的 CGroup 权限。您可以使用引导操作脚本来执行此操作,如本示例所示。
#!/bin/bash
set -ex
sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
sudo chmod a+rwx -R /sys/fs/cgroup/devices
启动您的集群。
最后一步是使用上述集群配置启动您的集群。以下是一个通过 Amazon EMR CLI 启动集群的命令示例:
aws emr create-cluster \
--release-label emr-7.13.0 \
--applications Name=Hadoop Name=Spark \
--service-role EMR_DefaultRole_V2 \
--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \
InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \
InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \
--configurations file:///my-configurations.json \
--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://amzn-s3-demo-bucket/my-bootstrap-action.sh