Using shuffle-optimized disks
With Amazon EMR releases 7.1.0 and higher, use shuffle-optimized disks when you run Apache Spark or Hive jobs to improve performance for I/O-intensive workloads.
Compared to standard disks, shuffle-optimized disks provide higher IOPS (I/O operations per second) for faster
data movement and reduced latency during shuffle operations. Shuffle-optimized disks let you attach disk sizes of up to 2 TB per worker,
so configure the appropriate capacity for your workload requirements.
Key benefits
Shuffle-optimized disks provide the following benefits.
-
High IOPS performance – shuffle-optimized disks provide
higher IOPS than standard disks, leading to more efficient and rapid data shuffling during Spark
and Hive jobs and other shuffle-intensive workloads.
-
Larger disk size – Shuffle-optimized disks support disk sizes from 20GB to 2TB per worker, so choose the
appropriate capacity based on your workloads.
Getting started
See the following steps to use shuffle-optimized disks in your workflows.
- Spark
-
-
Create an EMR Serverless release 7.1.0 application with the following command.
aws emr-serverless create-application \
--type "SPARK" \
--name my-application-name \
--release-label emr-7.1.0 \
--region <AWS_REGION>
-
Configure your Spark job to include the parameters spark.emr-serverless.driver.disk.type and/or spark.emr-serverless.executor.disk.type to run with shuffle-optimized disks. You can use
either one or both parameters, depending on your use case.
aws emr-serverless start-job-run \
--application-id application-id \
--execution-role-arn job-role-arn \
--job-driver '{
"sparkSubmit": {
"entryPoint": "/usr/lib/spark/examples/jars/spark-examples.jar",
"entryPointArguments": ["1"],
"sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi
--conf spark.executor.cores=4
--conf spark.executor.memory=20g
--conf spark.driver.cores=4
--conf spark.driver.memory=8g
--conf spark.executor.instances=1
--conf spark.emr-serverless.executor.disk.type=shuffle_optimized"
}
}'
For more information, refer to
Spark job properties.
- Hive
-
-
Create an EMR Serverless release 7.1.0 application with the following command.
aws emr-serverless create-application \
--type "HIVE" \
--name my-application-name \
--release-label emr-7.1.0 \
--region <AWS_REGION>
-
Configure your Hive job to include the parameters hive.driver.disk.type and/or hive.tez.disk.type to run with shuffle-optimized disks. You can use
either one or both parameters, depending on your use case.
aws emr-serverless start-job-run \
--application-id application-id \
--execution-role-arn job-role-arn \
--job-driver '{
"hive": {
"query": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/query/hive-query.ql",
"parameters": "--hiveconf hive.log.explain.output=false"
}
}' \
--configuration-overrides '{
"applicationConfiguration": [{
"classification": "hive-site",
"properties": {
"hive.exec.scratchdir": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/hive/scratch",
"hive.metastore.warehouse.dir": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/hive/warehouse",
"hive.driver.cores": "2",
"hive.driver.memory": "4g",
"hive.tez.container.size": "4096",
"hive.tez.cpu.vcores": "1",
"hive.driver.disk.type": "shuffle_optimized",
"hive.tez.disk.type": "shuffle_optimized"
}
}]
}'
For more information,
Hive job properties.
Configuring an application with pre-initialized capacity
See the following examples to create applications
based on Amazon EMR release 7.1.0. These applications have the following properties:
-
5 pre-initialized Spark drivers, each with 2 vCPU, 4 GB of memory, and 50 GB of shuffle-optimized disk.
-
50 pre-initialized executors, each with 4 vCPU, 8 GB of memory, and 500 GB of shuffle-optimized disk.
When this application runs Spark jobs, it first consumes the pre-initialized workers and then scales the
on-demand workers up to the maximum capacity of 400 vCPU and 1024 GB of memory.
Optionally, you can omit capacity for either DRIVER or EXECUTOR.
- Spark
-
aws emr-serverless create-application \
--type "SPARK" \
--name <my-application-name> \
--release-label emr-7.1.0 \
--initial-capacity '{
"DRIVER": {
"workerCount": 5,
"workerConfiguration": {
"cpu": "2vCPU",
"memory": "4GB",
"disk": "50GB",
"diskType": "SHUFFLE_OPTIMIZED"
}
},
"EXECUTOR": {
"workerCount": 50,
"workerConfiguration": {
"cpu": "4vCPU",
"memory": "8GB",
"disk": "500GB",
"diskType": "SHUFFLE_OPTIMIZED"
}
}
}' \
--maximum-capacity '{
"cpu": "400vCPU",
"memory": "1024GB"
}'
- Hive
-
aws emr-serverless create-application \
--type "HIVE" \
--name <my-application-name> \
--release-label emr-7.1.0 \
--initial-capacity '{
"DRIVER": {
"workerCount": 5,
"workerConfiguration": {
"cpu": "2vCPU",
"memory": "4GB",
"disk": "50GB",
"diskType": "SHUFFLE_OPTIMIZED"
}
},
"EXECUTOR": {
"workerCount": 50,
"workerConfiguration": {
"cpu": "4vCPU",
"memory": "8GB",
"disk": "500GB",
"diskType": "SHUFFLE_OPTIMIZED"
}
}
}' \
--maximum-capacity '{
"cpu": "400vCPU",
"memory": "1024GB"
}'