View a markdown version of this page

使用 Nvidia RAPIDS Accelerator for Apache Spark - Amazon EMR

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 Nvidia RAPIDS Accelerator for Apache Spark

对于 Amazon EMR 发行版 6.2.0 及更高版本,您可以使用 Nvidia 的 RAPIDS Accelerator for Apache Spark 插件来通过 EC2 图形处理器(GPU)实例类型加速 Spark。RAPIDS Accelerator 无需更改代码即可 GPU-accelerate为您的 Apache Spark 3.0 数据科学管道,在大幅降低基础设施成本的同时加快数据处理和模型训练的速度。

以下各节将指导您配置 EMR 集群以使用适用于 Spark 的 Spark-RAPIDS 插件。

选择实例类型

要使用适用于 Spark 的 Nvidia Spark-RAPIDS 插件,核心实例组和任务实例组必须使用符合硬件要求的 EC2 GPU 实例类型 Spark-RAPIDS。要查看 Amazon EMR 支持的 GPU 实例类型的完整列表,请参阅《Amazon EMR 管理指南》中的支持的实例类型。主实例组的实例类型可以是 GPU 或非 GPU 类型,但不支持 ARM 实例类型。

为集群设置应用程序配置

1。使 Amazon EMR 能在您的新集群上安装插件

要安装插件,请在创建集群时提供以下配置:

{ "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }

2。将 YARN 配置为使用 GPU

有关如何在 YARN 上使用 GPU 的详细信息,请参阅 Apache Hadoop 文档中的在 YARN 上使用 GPU。以下示例显示了 Amazon EMR 6.x 和 7.x 发行版的 YARN 配置示例:

Amazon EMR 7.x

Amazon EMR 7.x 的 YARN 配置示例

{ "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } },{ "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/spark-rapids-cgroup", "yarn-hierarchy":"yarn" } } ] }
Amazon EMR 6.x

Amazon EMR 6.x 的 YARN 配置示例

{ "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } },{ "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/sys/fs/cgroup", "yarn-hierarchy":"yarn" } } ] }

3。将 Spark 配置为使用 RAPIDS

以下是使 Spark 能够使用 RAPIDS 插件所需的配置:

{ "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" } }

XGBoost4J-Spark 在@@ 集群上启用 Spark RAPIDS 插件后,XGBoost 文档中的库也可用。您可以使用以下配置将 XGBoost 与您的 Spark 任务集成:

{ "Classification":"spark-defaults", "Properties":{ "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar" } }

有关可用于调整 GPU-accelerated EMR 集群的其他 Spark 配置,请参阅文档中的适用于 Apache Spark 的 Rapids Accelerator 调优指南。 Nvidia.github.io

4。配置 YARN 容量调度器

必须配置 DominantResourceCalculator 来启用 GPU 调度和隔离。有关详细信息,请参阅 Apache Hadoop 文档中的 Using GPU On YARN

{ "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } }

5。创建一个 JSON 文件以包含您的配置

您可以创建一个 JSON 文件,在其中包含您的配置,以便为 Spark 集群使用 RAPIDS 插件。您稍后在启动集群时需提供该文件。

您可以将文件存储在本地或 S3 上。有关如何为集群提供应用程序配置的详细信息,请参阅配置应用程序

使用以下示例文件作为模板来构建自己的配置。

Amazon EMR 7.x

Amazon EMR 7.x 的示例 my-configurations.json 文件

[ { "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }, { "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } }, { "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/spark-rapids-cgroup", "yarn-hierarchy":"yarn" } } ] }, { "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar", "spark.rapids.sql.concurrentGpuTasks":"1", "spark.executor.resource.gpu.amount":"1", "spark.executor.cores":"2", "spark.task.cpus":"1", "spark.task.resource.gpu.amount":"0.5", "spark.rapids.memory.pinnedPool.size":"0", "spark.executor.memoryOverhead":"2G", "spark.locality.wait":"0s", "spark.sql.shuffle.partitions":"200", "spark.sql.files.maxPartitionBytes":"512m" } }, { "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } } ]
Amazon EMR 6.x

Amazon EMR 6.x 的示例 my-configurations.json 文件

[ { "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }, { "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } }, { "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/sys/fs/cgroup", "yarn-hierarchy":"yarn" } } ] }, { "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar", "spark.rapids.sql.concurrentGpuTasks":"1", "spark.executor.resource.gpu.amount":"1", "spark.executor.cores":"2", "spark.task.cpus":"1", "spark.task.resource.gpu.amount":"0.5", "spark.rapids.memory.pinnedPool.size":"0", "spark.executor.memoryOverhead":"2G", "spark.locality.wait":"0s", "spark.sql.shuffle.partitions":"200", "spark.sql.files.maxPartitionBytes":"512m" } }, { "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } } ]

为您的集群添加引导操作

有关如何在创建集群时提供引导操作脚本的更多信息,请参阅《Amazon EMR 管理指南》中的引导操作基础

以下示例脚本展示了如何为 Amazon EMR 6.x 和 7.x 制作引导操作文件:

Amazon EMR 7.x

Amazon EMR 7.x 的示例 my-bootstrap-action.sh 文件

要使用 YARN 管理 Amazon EMR 7.x 发行版的 GPU 资源,您必须在集群上手动挂载 CGroup v1。您可以使用引导操作脚本来执行此操作,如本示例所示。

#!/bin/bash set -ex sudo mkdir -p /spark-rapids-cgroup/devices sudo mount -t cgroup -o devices cgroupv1-devices /spark-rapids-cgroup/devices sudo chmod a+rwx -R /spark-rapids-cgroup
Amazon EMR 6.x

Amazon EMR 6.x 的示例 my-bootstrap-action.sh 文件

对于 Amazon EMR 6.x 发行版,您必须在集群上打开 YARN 的 CGroup 权限。您可以使用引导操作脚本来执行此操作,如本示例所示。

#!/bin/bash set -ex sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct sudo chmod a+rwx -R /sys/fs/cgroup/devices

启动您的集群。

最后一步是使用上述集群配置启动您的集群。以下是一个通过 Amazon EMR CLI 启动集群的命令示例:

aws emr create-cluster \ --release-label emr-7.13.0 \ --applications Name=Hadoop Name=Spark \ --service-role EMR_DefaultRole_V2 \ --ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \ InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \ --configurations file:///my-configurations.json \ --bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://amzn-s3-demo-bucket/my-bootstrap-action.sh