EMR Spark MagicCommitProtocol
From EMR 6.15.0 onwards, MagicCommitProtocol becomes the default FileCommitProtocol for Spark when utilizing the S3A filesystem.
MagicCommitProtocol
The MagicCommitProtocol is an alternative implementation of the FileCommitProtocol
The MagicCommitProtocol is the default FileCommitProtocol implementation used by Spark running on Amazon Elastic Map Reduce (EMR) when the S3A filesystem is utilized. The MagicCommitProtocol internally uses the MagicV2Committer to perform the file writes to Amazon S3.
For static insert operations, the MagicCommitProtocol writes the files in the job's output location during the task commit phase. In contrast, for dynamic insert overwrite operations, the files written by task attempts only appear in the job's output location upon job commit. This is achieved by exporting the commit metadata back to the Spark driver on the task commit call.
Enabling MagicCommitProtocol
The MagicCommitProtocol is enabled by default for Spark running on Amazon Elastic Map Reduce (EMR) when using the S3A filesystem.
In order to use the S3A filesystem, you can either:
-
Use the file scheme as
s3a://
when defining the table, partition, or directory. -
Set the configuration
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
in core-site.xml.
Disabling the MagicCommitProtocol
-
You can set
spark.sql.execution.datasources.SQLEmrOptimizedCommitProtocol.leverageMagicCommitProtocol
to false by hard-coding it in aSparkConf
, passing it as a--conf
parameter in the Spark shell orspark-submit
andspark-sql
tools, or inconf/spark-defaults.conf
. For more information, see Spark configurationin the Apache Spark documentation. The following example shows how to disable MagicCommitProtocol while running a
spark-sql
command.spark-sql \ --conf spark.sql.execution.datasources.SQLEmrOptimizedCommitProtocol.leverageMagicCommitProtocol=false \ -e "INSERT OVERWRITE TABLE target_table SELECT * FROM source_table;"
-
Use the
spark-defaults
configuration classification to set thespark.sql.execution.datasources.SQLEmrOptimizedCommitProtocol.leverageMagicCommitProtocol
property to false. For more information, see Configure applications.
MagicCommitProtocol considerations
-
For static partition insert, On Spark executors, the MagicCommitProtocol consumes a small amount of memory for each file written by a task attempt until the task gets committed or aborted. In most jobs, the amount of memory consumed is negligible. There is no extra memory requirement on the Spark driver
-
For dynamic partition insert, on Spark drivers, the MagicCommitProtocol requires memory to store metadata info of each committed file until the job gets committed or aborted. In most jobs, default Spark driver memory setting is negligible.
For jobs that have long-running tasks that write a large number of files, the memory that the commit protocol consumes may be noticeable and require adjustments to the memory allocated for Spark, especially for Spark executors. You can tune memory using the
spark.driver.memory
property for Spark drivers, and thespark.executor.memory
property for Spark executors. As a guideline, a single task writing 100,000 files would typically require an additional 200MB of memory. For more information, see Application propertiesin the Apache Spark Configuration documentation.