# Programming AWS Glue ETL scripts in Scala You can find Scala code examples and utilities for AWS Glue in the [AWS Glue samples repository](https://github.com/awslabs/aws-glue-samples) on the GitHub website. AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs. The following sections describe how to use the AWS Glue Scala library and the AWS Glue API in ETL scripts, and provide reference documentation for the library. **Contents** + [Using Scala](glue-etl-scala-using.md) + [Testing on a DevEndpoint notebook](glue-etl-scala-using.md#aws-glue-programming-scala-using-notebook) + [Testing on a DevEndpoint REPL](glue-etl-scala-using.md#aws-glue-programming-scala-using-repl) + [Scala script example](glue-etl-scala-example.md) + [Scala API list](glue-etl-scala-apis.md) + [com.amazonaws.services.glue](glue-etl-scala-apis.md#glue-etl-scala-apis-glue) + [com.amazonaws.services.glue.ml](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-ml) + [com.amazonaws.services.glue.dq](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-dq) + [com.amazonaws.services.glue.types](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-types) + [com.amazonaws.services.glue.util](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-util) + [ChoiceOption](glue-etl-scala-apis-glue-choiceoption.md) + [ChoiceOption trait](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-trait) + [ChoiceOption object](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-object) + [Apply](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-object-def-apply) + [ChoiceOptionWithResolver](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoptionwithresolver-case-class) + [MatchCatalogSchemaChoiceOption](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-matchcatalogschemachoiceoption-case-class) + [DataSink](glue-etl-scala-apis-glue-datasink-class.md) + [writeDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-writeDynamicFrame) + [pyWriteDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDynamicFrame) + [writeDataFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-writeDataFrame) + [pyWriteDataFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDataFrame) + [setCatalogInfo](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setCatalogInfo) + [supportsFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-supportsFormat) + [setFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setFormat) + [withFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-withFormat) + [setAccumulableSize](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setAccumulableSize) + [getOutputErrorRecordsAccumulable](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-getOutputErrorRecordsAccumulable) + [errorsAsDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-errorsAsDynamicFrame) + [DataSink object](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-object) + [recordMetrics](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-object-defs-recordMetrics) + [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) + [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md) + [DynamicFrame class](glue-etl-scala-apis-glue-dynamicframe-class.md) + [errorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-vals-errorsCount) + [applyMapping](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping) + [assertErrorThreshold](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-assertErrorThreshold) + [Count](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-count) + [dropField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropField) + [dropFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropFields) + [dropNulls](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropNulls) + [errorsAsDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame) + [Filter](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-filter) + [getName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getName) + [getNumPartitions](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getNumPartitions) + [getSchemaIfComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getSchemaIfComputed) + [isSchemaComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-isSchemaComputed) + [javaToPython](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-javaToPython) + [Join](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-join) + [Map](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-map) + [mergeDynamicFrames](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-merge) + [printSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-printSchema) + [recomputeSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-recomputeSchema) + [Relationalize](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-relationalize) + [renameField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-renameField) + [Repartition](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition) + [resolveChoice](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice) + [Schema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-schema) + [selectField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectField) + [selectFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectFields) + [Show](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-show) + [SimplifyDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-simplifyDDBJson) + [Spigot](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-spigot) + [splitFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitFields) + [Def splitRows](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitRows) + [stageErrorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-stageErrorsCount) + [toDF](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-toDF) + [Unbox](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unbox) + [Unnest](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) + [unnestDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson) + [withFrameSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withFrameSchema) + [Def withName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withName) + [withTransformationContext](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withTransformationContext) + [DynamicFrame object](glue-etl-scala-apis-glue-dynamicframe-object.md) + [Def apply](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-apply) + [Def emptyDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-emptyDynamicFrame) + [Def fromPythonRDD](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-fromPythonRDD) + [Def ignoreErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-ignoreErrors) + [Def inlineErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-inlineErrors) + [Def newFrameWithErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-newFrameWithErrors) + [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md) + [addField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-addField) + [dropField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-dropField) + [setError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-setError) + [isError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-isError) + [getError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getError) + [clearError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clearError) + [Write](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-write) + [readFields](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-readFields) + [Clone](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clone) + [Schema](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-schema) + [getRoot](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getRoot) + [toJson](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-toJson) + [getFieldNode](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getFieldNode) + [getField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getField) + [hashCode](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-hashCode) + [Equals](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-equals) + [DynamicRecord object](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-object) + [Apply](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-object-defs-apply) + [RecordTraverser trait](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-recordtraverser-trait) + [GlueContext](glue-etl-scala-apis-glue-gluecontext.md) + [addIngestionTimeColumns](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-addIngestionTimeColumns) + [createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions) + [forEachBatch](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-forEachBatch) + [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) + [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) + [getJDBCSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getJDBCSink) + [getSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSink) + [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) + [getSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSource) + [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) + [getSparkSession](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSparkSession) + [startTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-start-transaction) + [commitTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-commit-transaction) + [cancelTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-cancel-transaction) + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-1) + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-2) + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-3) + [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md) + [MappingSpec case class](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-case-class) + [MappingSpec object](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object) + [orderingByTarget](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-val-orderingbytarget) + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-1) + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-2) + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-3) + [ResolveSpec](glue-etl-scala-apis-glue-resolvespec.md) + [ResolveSpec object](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object) + [Def](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object-def-apply_1) + [Def](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object-def-apply_2) + [ResolveSpec case class](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-case-class) + [Def methods](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-case-class-defs) + [ArrayNode](glue-etl-scala-apis-glue-types-arraynode.md) + [ArrayNode case class](glue-etl-scala-apis-glue-types-arraynode.md#glue-etl-scala-apis-glue-types-arraynode-case-class) + [Def methods](glue-etl-scala-apis-glue-types-arraynode.md#glue-etl-scala-apis-glue-types-arraynode-case-class-defs) + [BinaryNode](glue-etl-scala-apis-glue-types-binarynode.md) + [BinaryNode case class](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class-defs) + [BooleanNode](glue-etl-scala-apis-glue-types-booleannode.md) + [BooleanNode case class](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class-defs) + [ByteNode](glue-etl-scala-apis-glue-types-bytenode.md) + [ByteNode case class](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class-defs) + [DateNode](glue-etl-scala-apis-glue-types-datenode.md) + [DateNode case class](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class-defs) + [DecimalNode](glue-etl-scala-apis-glue-types-decimalnode.md) + [DecimalNode case class](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class-defs) + [DoubleNode](glue-etl-scala-apis-glue-types-doublenode.md) + [DoubleNode case class](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class-defs) + [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) + [DynamicNode class](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-class) + [Def methods](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-class-defs) + [DynamicNode object](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-object) + [Def methods](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-object-defs) + [EvaluateDataQuality](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md) + [apply](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md#glue-etl-scala-apis-glue-dq-EvaluateDataQuality-defs-apply) + [Example](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md#glue-etl-scala-apis-glue-dq-EvaluateDataQuality-example) + [FloatNode](glue-etl-scala-apis-glue-types-floatnode.md) + [FloatNode case class](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class-defs) + [FillMissingValues](glue-etl-scala-apis-glue-ml-fillmissingvalues.md) + [Apply](glue-etl-scala-apis-glue-ml-fillmissingvalues.md#glue-etl-scala-apis-glue-ml-fillmissingvalues-defs-apply) + [FindMatches](glue-etl-scala-apis-glue-ml-findmatches.md) + [Apply](glue-etl-scala-apis-glue-ml-findmatches.md#glue-etl-scala-apis-glue-ml-findmatches-defs-apply) + [FindIncrementalMatches](glue-etl-scala-apis-glue-ml-findincrementalmatches.md) + [Apply](glue-etl-scala-apis-glue-ml-findincrementalmatches.md#glue-etl-scala-apis-glue-ml-findincrementalmatches-defs-apply) + [IntegerNode](glue-etl-scala-apis-glue-types-integernode.md) + [IntegerNode case class](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class-defs) + [LongNode](glue-etl-scala-apis-glue-types-longnode.md) + [LongNode case class](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class-defs) + [MapLikeNode](glue-etl-scala-apis-glue-types-maplikenode.md) + [MapLikeNode class](glue-etl-scala-apis-glue-types-maplikenode.md#glue-etl-scala-apis-glue-types-maplikenode-class) + [Def methods](glue-etl-scala-apis-glue-types-maplikenode.md#glue-etl-scala-apis-glue-types-maplikenode-class-defs) + [MapNode](glue-etl-scala-apis-glue-types-mapnode.md) + [MapNode case class](glue-etl-scala-apis-glue-types-mapnode.md#glue-etl-scala-apis-glue-types-mapnode-case-class) + [Def methods](glue-etl-scala-apis-glue-types-mapnode.md#glue-etl-scala-apis-glue-types-mapnode-case-class-defs) + [NullNode](glue-etl-scala-apis-glue-types-nullnode.md) + [NullNode class](glue-etl-scala-apis-glue-types-nullnode.md#glue-etl-scala-apis-glue-types-nullnode-class) + [NullNode case object](glue-etl-scala-apis-glue-types-nullnode.md#glue-etl-scala-apis-glue-types-nullnode-case-object) + [ObjectNode](glue-etl-scala-apis-glue-types-objectnode.md) + [ObjectNode object](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-object) + [Def methods](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-object-defs) + [ObjectNode case class](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-case-class) + [Def methods](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-case-class-defs) + [ScalarNode](glue-etl-scala-apis-glue-types-scalarnode.md) + [ScalarNode class](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-class) + [Def methods](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-class-defs) + [ScalarNode object](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-object) + [Def methods](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-object-defs) + [ShortNode](glue-etl-scala-apis-glue-types-shortnode.md) + [ShortNode case class](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class-defs) + [StringNode](glue-etl-scala-apis-glue-types-stringnode.md) + [StringNode case class](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class-defs) + [TimestampNode](glue-etl-scala-apis-glue-types-timestampnode.md) + [TimestampNode case class](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class) + [Val fields](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class-vals) + [Def methods](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class-defs) + [GlueArgParser](glue-etl-scala-apis-glue-util-glueargparser.md) + [GlueArgParser object](glue-etl-scala-apis-glue-util-glueargparser.md#glue-etl-scala-apis-glue-util-glueargparser-object) + [Def methods](glue-etl-scala-apis-glue-util-glueargparser.md#glue-etl-scala-apis-glue-util-glueargparser-object-defs) + [Job](glue-etl-scala-apis-glue-util-job.md) + [Job object](glue-etl-scala-apis-glue-util-job.md#glue-etl-scala-apis-glue-util-job-object) + [Def methods](glue-etl-scala-apis-glue-util-job.md#glue-etl-scala-apis-glue-util-job-object-defs) # Using Scala to program AWS Glue ETL scripts You can automatically generate a Scala extract, transform, and load (ETL) program using the AWS Glue console, and modify it as needed before assigning it to a job. Or, you can write your own program from scratch. For more information, see [Configuring job properties for Spark jobs in AWS Glue](add-job.md). AWS Glue then compiles your Scala program on the server before running the associated job. To ensure that your program compiles without errors and runs as expected, it's important that you load it on a development endpoint in a REPL (Read-Eval-Print Loop) or a Jupyter Notebook and test it there before running it in a job. Because the compile process occurs on the server, you will not have good visibility into any problems that happen there. ## Testing a Scala ETL program in a Jupyter notebook on a development endpoint To test a Scala program on an AWS Glue development endpoint, set up the development endpoint as described in [Adding a development endpoint](add-dev-endpoint.md). Next, connect it to a Jupyter Notebook that is either running locally on your machine or remotely on an Amazon EC2 notebook server. To install a local version of a Jupyter Notebook, follow the instructions in [Tutorial: Jupyter notebook in JupyterLab](dev-endpoint-tutorial-local-jupyter.md). The only difference between running Scala code and running PySpark code on your Notebook is that you should start each paragraph on the Notebook with the the following: ``` %spark ``` This prevents the Notebook server from defaulting to the PySpark flavor of the Spark interpreter. ## Testing a Scala ETL program in a Scala REPL You can test a Scala program on a development endpoint using the AWS GlueScala REPL. Follow the instructions in [Tutorial: Use a SageMaker AI notebookTutorial: Use a REPL shell](dev-endpoint-tutorial-repl.md), except at the end of the SSH-to-REPL command, replace `-t gluepyspark` with `-t glue-spark-shell`. This invokes the AWS Glue Scala REPL. To close the REPL when you are finished, type `sys.exit`. # Scala script example - streaming ETL **Example** The following example script connects to Amazon Kinesis Data Streams, uses a schema from the Data Catalog to parse a data stream, joins the stream to a static dataset on Amazon S3, and outputs the joined results to Amazon S3 in parquet format. ``` // This script connects to an Amazon Kinesis stream, uses a schema from the data catalog to parse the stream, // joins the stream to a static dataset on Amazon S3, and outputs the joined results to Amazon S3 in parquet format. import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import java.util.Calendar import org.apache.spark.SparkContext import org.apache.spark.sql.Dataset import org.apache.spark.sql.Row import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.streaming.Trigger import scala.collection.JavaConverters._ object streamJoiner { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val sparkSession: SparkSession = glueContext.getSparkSession import sparkSession.implicits._ // @params: [JOB_NAME] val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val staticData = sparkSession.read // read() returns type DataFrameReader .format("csv") .option("header", "true") .load("s3://amzn-s3-demo-bucket/inputs/productsStatic.csv") // load() returns a DataFrame val datasource0 = sparkSession.readStream // readstream() returns type DataStreamReader .format("kinesis") .option("streamName", "stream-join-demo") .option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com") .option("startingPosition", "TRIM_HORIZON") .load // load() returns a DataFrame val selectfields1 = datasource0.select(from_json($"data".cast("string"), glueContext.getCatalogSchemaAsSparkSchema("stream-demos", "stream-join-demo2")) as "data").select("data.*") val datasink2 = selectfields1.writeStream.foreachBatch { (dataFrame: Dataset[Row], batchId: Long) => { //foreachBatch() returns type DataStreamWriter val joined = dataFrame.join(staticData, "product_id") val year: Int = Calendar.getInstance().get(Calendar.YEAR) val month :Int = Calendar.getInstance().get(Calendar.MONTH) + 1 val day: Int = Calendar.getInstance().get(Calendar.DATE) val hour: Int = Calendar.getInstance().get(Calendar.HOUR_OF_DAY) if (dataFrame.count() > 0) { joined.write // joined.write returns type DataFrameWriter .mode(SaveMode.Append) .format("parquet") .option("quote", " ") .save("s3://amzn-s3-demo-bucket/output/" + "/year=" + "%04d".format(year) + "/month=" + "%02d".format(month) + "/day=" + "%02d".format(day) + "/hour=" + "%02d".format(hour) + "/") } } } // end foreachBatch() .trigger(Trigger.ProcessingTime("100 seconds")) .option("checkpointLocation", "s3://amzn-s3-demo-bucket/checkpoint/") .start().awaitTermination() // start() returns type StreamingQuery Job.commit() } } ``` # APIs in the AWS Glue Scala library AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs. The following sections describe the APIs in the AWS Glue Scala library. ## com.amazonaws.services.glue The **com.amazonaws.services.glue** package in the AWS Glue Scala library contains the following APIs: + [ChoiceOption](glue-etl-scala-apis-glue-choiceoption.md) + [DataSink](glue-etl-scala-apis-glue-datasink-class.md) + [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) + [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md) + [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md) + [GlueContext](glue-etl-scala-apis-glue-gluecontext.md) + [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md) + [ResolveSpec](glue-etl-scala-apis-glue-resolvespec.md) ## com.amazonaws.services.glue.ml The **com.amazonaws.services.glue.ml** package in the AWS Glue Scala library contains the following APIs: + [FillMissingValues](glue-etl-scala-apis-glue-ml-fillmissingvalues.md) + [FindIncrementalMatches](glue-etl-scala-apis-glue-ml-findincrementalmatches.md) + [FindMatches](glue-etl-scala-apis-glue-ml-findmatches.md) ## com.amazonaws.services.glue.dq The **com.amazonaws.services.glue.dq** package in the AWS Glue Scala library contains the following APIs: + [EvaluateDataQuality](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md) ## com.amazonaws.services.glue.types The **com.amazonaws.services.glue.types** package in the AWS Glue Scala library contains the following APIs: + [ArrayNode](glue-etl-scala-apis-glue-types-arraynode.md) + [BinaryNode](glue-etl-scala-apis-glue-types-binarynode.md) + [BooleanNode](glue-etl-scala-apis-glue-types-booleannode.md) + [ByteNode](glue-etl-scala-apis-glue-types-bytenode.md) + [DateNode](glue-etl-scala-apis-glue-types-datenode.md) + [DecimalNode](glue-etl-scala-apis-glue-types-decimalnode.md) + [DoubleNode](glue-etl-scala-apis-glue-types-doublenode.md) + [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) + [FloatNode](glue-etl-scala-apis-glue-types-floatnode.md) + [IntegerNode](glue-etl-scala-apis-glue-types-integernode.md) + [LongNode](glue-etl-scala-apis-glue-types-longnode.md) + [MapLikeNode](glue-etl-scala-apis-glue-types-maplikenode.md) + [MapNode](glue-etl-scala-apis-glue-types-mapnode.md) + [NullNode](glue-etl-scala-apis-glue-types-nullnode.md) + [ObjectNode](glue-etl-scala-apis-glue-types-objectnode.md) + [ScalarNode](glue-etl-scala-apis-glue-types-scalarnode.md) + [ShortNode](glue-etl-scala-apis-glue-types-shortnode.md) + [StringNode](glue-etl-scala-apis-glue-types-stringnode.md) + [TimestampNode](glue-etl-scala-apis-glue-types-timestampnode.md) ## com.amazonaws.services.glue.util The **com.amazonaws.services.glue.util** package in the AWS Glue Scala library contains the following APIs: + [GlueArgParser](glue-etl-scala-apis-glue-util-glueargparser.md) + [Job](glue-etl-scala-apis-glue-util-job.md) # AWS Glue Scala ChoiceOption APIs **Topics** + [ ## ChoiceOption trait ](#glue-etl-scala-apis-glue-choiceoption-trait) + [ ## ChoiceOption object ](#glue-etl-scala-apis-glue-choiceoption-object) + [ ## Case class ChoiceOptionWithResolver ](#glue-etl-scala-apis-glue-choiceoptionwithresolver-case-class) + [ ## Case class MatchCatalogSchemaChoiceOption ](#glue-etl-scala-apis-glue-matchcatalogschemachoiceoption-case-class) **Package: com.amazonaws.services.glue** ## ChoiceOption trait ``` trait ChoiceOption extends Serializable ``` ## ChoiceOption object **ChoiceOption** ``` object ChoiceOption ``` A general strategy to resolve choice applicable to all `ChoiceType` nodes in a `DynamicFrame`. + `val CAST` + `val MAKE_COLS` + `val MAKE_STRUCT` + `val MATCH_CATALOG` + `val PROJECT` ### Def apply ``` def apply(choice: String): ChoiceOption ``` ## Case class ChoiceOptionWithResolver ``` case class ChoiceOptionWithResolver(name: String, choiceResolver: ChoiceResolver) extends ChoiceOption {} ``` ## Case class MatchCatalogSchemaChoiceOption ``` case class MatchCatalogSchemaChoiceOption() extends ChoiceOption {} ``` # Abstract DataSink class **Topics** + [ ## Def writeDynamicFrame ](#glue-etl-scala-apis-glue-datasink-class-defs-writeDynamicFrame) + [ ## Def pyWriteDynamicFrame ](#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDynamicFrame) + [ ## Def writeDataFrame ](#glue-etl-scala-apis-glue-datasink-class-defs-writeDataFrame) + [ ## Def pyWriteDataFrame ](#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDataFrame) + [ ## Def setCatalogInfo ](#glue-etl-scala-apis-glue-datasink-class-defs-setCatalogInfo) + [ ## Def supportsFormat ](#glue-etl-scala-apis-glue-datasink-class-defs-supportsFormat) + [ ## Def setFormat ](#glue-etl-scala-apis-glue-datasink-class-defs-setFormat) + [ ## Def withFormat ](#glue-etl-scala-apis-glue-datasink-class-defs-withFormat) + [ ## Def setAccumulableSize ](#glue-etl-scala-apis-glue-datasink-class-defs-setAccumulableSize) + [ ## Def getOutputErrorRecordsAccumulable ](#glue-etl-scala-apis-glue-datasink-class-defs-getOutputErrorRecordsAccumulable) + [ ## Def errorsAsDynamicFrame ](#glue-etl-scala-apis-glue-datasink-class-defs-errorsAsDynamicFrame) + [ ## DataSink object ](#glue-etl-scala-apis-glue-datasink-object) **Package: com.amazonaws.services.glue** ``` abstract class DataSink ``` The writer analog to a `DataSource`. `DataSink` encapsulates a destination and a format that a `DynamicFrame` can be written to. ## Def writeDynamicFrame ``` def writeDynamicFrame( frame : DynamicFrame, callSite : CallSite = CallSite("Not provided", "") ) : DynamicFrame ``` ## Def pyWriteDynamicFrame ``` def pyWriteDynamicFrame( frame : DynamicFrame, site : String = "Not provided", info : String = "" ) ``` ## Def writeDataFrame ``` def writeDataFrame(frame: DataFrame, glueContext: GlueContext, callSite: CallSite = CallSite("Not provided", "") ): DataFrame ``` ## Def pyWriteDataFrame ``` def pyWriteDataFrame(frame: DataFrame, glueContext: GlueContext, site: String = "Not provided", info: String = "" ): DataFrame ``` ## Def setCatalogInfo ``` def setCatalogInfo(catalogDatabase: String, catalogTableName : String, catalogId : String = "") ``` ## Def supportsFormat ``` def supportsFormat( format : String ) : Boolean ``` ## Def setFormat ``` def setFormat( format : String, options : JsonOptions ) : Unit ``` ## Def withFormat ``` def withFormat( format : String, options : JsonOptions = JsonOptions.empty ) : DataSink ``` ## Def setAccumulableSize ``` def setAccumulableSize( size : Int ) : Unit ``` ## Def getOutputErrorRecordsAccumulable ``` def getOutputErrorRecordsAccumulable : Accumulable[List[OutputError], OutputError] ``` ## Def errorsAsDynamicFrame ``` def errorsAsDynamicFrame : DynamicFrame ``` ## DataSink object ``` object DataSink ``` ### Def recordMetrics ``` def recordMetrics( frame : DynamicFrame, ctxt : String ) : DynamicFrame ``` # AWS Glue Scala DataSource trait **Package: com.amazonaws.services.glue** A high-level interface for producing a `DynamicFrame`. ``` trait DataSource { def getDynamicFrame : DynamicFrame def getDynamicFrame( minPartitions : Int, targetPartitions : Int ) : DynamicFrame def getDataFrame : DataFrame /** @param num: the number of records for sampling. * @param options: optional parameters to control sampling behavior. Current available parameter for Amazon S3 sources in options: * 1. maxSamplePartitions: the maximum number of partitions the sampling will read. * 2. maxSampleFilesPerPartition: the maximum number of files the sampling will read in one partition. */ def getSampleDynamicFrame(num:Int, options: JsonOptions = JsonOptions.empty): DynamicFrame def glueContext : GlueContext def setFormat( format : String, options : String ) : Unit def setFormat( format : String, options : JsonOptions ) : Unit def supportsFormat( format : String ) : Boolean def withFormat( format : String, options : JsonOptions = JsonOptions.empty ) : DataSource } ``` # AWS Glue Scala DynamicFrame APIs **Package: com.amazonaws.services.glue** **Contents** + [ # AWS Glue Scala DynamicFrame class ](glue-etl-scala-apis-glue-dynamicframe-class.md) + [ ## Val errorsCount ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-vals-errorsCount) + [ ## Def applyMapping ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping) + [ ## Def assertErrorThreshold ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-assertErrorThreshold) + [ ## Def count ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-count) + [ ## Def dropField ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropField) + [ ## Def dropFields ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropFields) + [ ## Def dropNulls ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropNulls) + [ ## Def errorsAsDynamicFrame ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame) + [ ## Def filter ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-filter) + [ ## Def getName ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getName) + [ ## Def getNumPartitions ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getNumPartitions) + [ ## Def getSchemaIfComputed ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getSchemaIfComputed) + [ ## Def isSchemaComputed ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-isSchemaComputed) + [ ## Def javaToPython ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-javaToPython) + [ ## Def join ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-join) + [ ## Def map ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-map) + [ ## Def mergeDynamicFrames ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-merge) + [ ## Def printSchema ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-printSchema) + [ ## Def recomputeSchema ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-recomputeSchema) + [ ## Def relationalize ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-relationalize) + [ ## Def renameField ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-renameField) + [ ## Def repartition ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition) + [ ## Def resolveChoice ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice) + [ ## Def schema ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-schema) + [ ## Def selectField ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectField) + [ ## Def selectFields ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectFields) + [ ## Def show ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-show) + [ ## Def simplifyDDBJson ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-simplifyDDBJson) + [ ## Def spigot ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-spigot) + [ ## Def splitFields ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitFields) + [ ## Def splitRows ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitRows) + [ ## Def stageErrorsCount ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-stageErrorsCount) + [ ## Def toDF ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-toDF) + [ ## Def unbox ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unbox) + [ ## Def unnest ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) + [ ## Def unnestDDBJson ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson) + [ ## Def withFrameSchema ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withFrameSchema) + [ ## Def withName ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withName) + [ ## Def withTransformationContext ](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withTransformationContext) + [ # The DynamicFrame object ](glue-etl-scala-apis-glue-dynamicframe-object.md) + [ ## Def apply ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-apply) + [ ## Def emptyDynamicFrame ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-emptyDynamicFrame) + [ ## Def fromPythonRDD ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-fromPythonRDD) + [ ## Def ignoreErrors ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-ignoreErrors) + [ ## Def inlineErrors ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-inlineErrors) + [ ## Def newFrameWithErrors ](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-newFrameWithErrors) # AWS Glue Scala DynamicFrame class **Package: com.amazonaws.services.glue** ``` class DynamicFrame extends Serializable with Logging ( val glueContext : GlueContext, _records : RDD[DynamicRecord], val name : String = s"", val transformationContext : String = DynamicFrame.UNDEFINED, callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0, prevErrors : => Long = 0, errorExpr : => Unit = {} ) ``` A `DynamicFrame` is a distributed collection of self-describing [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md) objects. `DynamicFrame`s are designed to provide a flexible data model for ETL (extract, transform, and load) operations. They don't require a schema to create, and you can use them to read and transform data that contains messy or inconsistent values and types. A schema can be computed on demand for those operations that need one. `DynamicFrame`s provide a range of transformations for data cleaning and ETL. They also support conversion to and from SparkSQL DataFrames to integrate with existing code and the many analytics operations that DataFrames provide. The following parameters are shared across many of the AWS Glue transformations that construct `DynamicFrame`s: + `transformationContext` — The identifier for this `DynamicFrame`. The `transformationContext` is used as a key for job bookmark state that is persisted across runs. + `callSite` — Provides context information for error reporting. These values are automatically set when calling from Python. + `stageThreshold` — The maximum number of error records that are allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records that are present in the previous `DynamicFrame`. + `totalThreshold` — The maximum number of total error records before an exception is thrown, including those from previous frames. ## Val errorsCount ``` val errorsCount ``` The number of error records in this `DynamicFrame`. This includes errors from previous operations. ## Def applyMapping ``` def applyMapping( mappings : Seq[Product4[String, String, String, String]], caseSensitive : Boolean = true, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `mappings` — A sequence of mappings to construct a new `DynamicFrame`. + `caseSensitive` — Whether to treat source columns as case sensitive. Setting this to false might help when integrating with case-insensitive stores like the AWS Glue Data Catalog. Selects, projects, and casts columns based on a sequence of mappings. Each mapping is made up of a source column and type and a target column and type. Mappings can be specified as either a four-tuple (`source_path`, `source_type`,` target_path`, `target_type`) or a [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md) object containing the same information. In addition to using mappings for simple projections and casting, you can use them to nest or unnest fields by separating components of the path with '`.`' (period). For example, suppose that you have a `DynamicFrame` with the following schema. ``` {{{ root |-- name: string |-- age: int |-- address: struct | |-- state: string | |-- zip: int }}} ``` You can make the following call to unnest the `state` and `zip` fields. ``` {{{ df.applyMapping( Seq(("name", "string", "name", "string"), ("age", "int", "age", "int"), ("address.state", "string", "state", "string"), ("address.zip", "int", "zip", "int"))) }}} ``` The resulting schema is as follows. ``` {{{ root |-- name: string |-- age: int |-- state: string |-- zip: int }}} ``` You can also use `applyMapping` to re-nest columns. For example, the following inverts the previous transformation and creates a struct named `address` in the target. ``` {{{ df.applyMapping( Seq(("name", "string", "name", "string"), ("age", "int", "age", "int"), ("state", "string", "address.state", "string"), ("zip", "int", "address.zip", "int"))) }}} ``` Field names that contain '`.`' (period) characters can be quoted by using backticks (````). **Note** Currently, you can't use the `applyMapping` method to map columns that are nested under arrays. ## Def assertErrorThreshold ``` def assertErrorThreshold : Unit ``` An action that forces computation and verifies that the number of error records falls below `stageThreshold` and `totalThreshold`. Throws an exception if either condition fails. ## Def count ``` lazy def count ``` Returns the number of elements in this `DynamicFrame`. ## Def dropField ``` def dropField( path : String, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a new `DynamicFrame` with the specified column removed. ## Def dropFields ``` def dropFields( fieldNames : Seq[String], // The column names to drop. transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a new `DynamicFrame` with the specified columns removed. You can use this method to delete nested columns, including those inside of arrays, but not to drop specific array elements. ## Def dropNulls ``` def dropNulls( transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) ``` Returns a new `DynamicFrame` with all null columns removed. **Note** This only removes columns of type `NullType`. Individual null values in other columns are not removed or modified. ## Def errorsAsDynamicFrame ``` def errorsAsDynamicFrame ``` Returns a new `DynamicFrame` containing the error records from this `DynamicFrame`. ## Def filter ``` def filter( f : DynamicRecord => Boolean, errorMsg : String = "", transformationContext : String = "", callSite : CallSite = CallSite("Not provided"), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Constructs a new `DynamicFrame` containing only those records for which the function '`f`' returns `true`. The filter function '`f`' should not mutate the input record. ## Def getName ``` def getName : String ``` Returns the name of this `DynamicFrame`. ## Def getNumPartitions ``` def getNumPartitions ``` Returns the number of partitions in this `DynamicFrame`. ## Def getSchemaIfComputed ``` def getSchemaIfComputed : Option[Schema] ``` Returns the schema if it has already been computed. Does not scan the data if the schema has not already been computed. ## Def isSchemaComputed ``` def isSchemaComputed : Boolean ``` Returns `true` if the schema has been computed for this `DynamicFrame`, or `false` if not. If this method returns false, then calling the `schema` method requires another pass over the records in this `DynamicFrame`. ## Def javaToPython ``` def javaToPython : JavaRDD[Array[Byte]] ``` ## Def join ``` def join( keys1 : Seq[String], keys2 : Seq[String], frame2 : DynamicFrame, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `keys1` — The columns in this `DynamicFrame` to use for the join. + `keys2` — The columns in `frame2` to use for the join. Must be the same length as `keys1`. + `frame2` — The `DynamicFrame` to join against. Returns the result of performing an equijoin with `frame2` using the specified keys. ## Def map ``` def map( f : DynamicRecord => DynamicRecord, errorMsg : String = "", transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a new `DynamicFrame` constructed by applying the specified function '`f`' to each record in this `DynamicFrame`. This method copies each record before applying the specified function, so it is safe to mutate the records. If the mapping function throws an exception on a given record, that record is marked as an error, and the stack trace is saved as a column in the error record. ## Def mergeDynamicFrames ``` def mergeDynamicFrames( stageDynamicFrame: DynamicFrame, primaryKeys: Seq[String], transformationContext: String = "", options: JsonOptions = JsonOptions.empty, callSite: CallSite = CallSite("Not provided"), stageThreshold: Long = 0, totalThreshold: Long = 0): DynamicFrame ``` + `stageDynamicFrame` — The staging `DynamicFrame` to merge. + `primaryKeys` — The list of primary key fields to match records from the source and staging `DynamicFrame`s. + `transformationContext` — A unique string that is used to retrieve metadata about the current transformation (optional). + `options` — A string of JSON name-value pairs that provide additional information for this transformation. + `callSite` — Used to provide context information for error reporting. + `stageThreshold` — A `Long`. The number of errors in the given transformation for which the processing needs to error out. + `totalThreshold` — A `Long`. The total number of errors up to and including in this transformation for which the processing needs to error out. Merges this `DynamicFrame` with a staging `DynamicFrame` based on the specified primary keys to identify records. Duplicate records (records with the same primary keys) are not de-duplicated. If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue. The returned `DynamicFrame` contains record A in the following cases: 1. If `A` exists in both the source frame and the staging frame, then `A` in the staging frame is returned. 1. If `A` is in the source table and `A.primaryKeys` is not in the `stagingDynamicFrame` (that means `A` is not updated in the staging table). The source frame and staging frame do not need to have the same schema. **Example** ``` val mergedFrame: DynamicFrame = srcFrame.mergeDynamicFrames(stageFrame, Seq("id1", "id2")) ``` ## Def printSchema ``` def printSchema : Unit ``` Prints the schema of this `DynamicFrame` to `stdout` in a human-readable format. ## Def recomputeSchema ``` def recomputeSchema : Schema ``` Forces a schema recomputation. This requires a scan over the data, but it might "tighten" the schema if there are some fields in the current schema that are not present in the data. Returns the recomputed schema. ## Def relationalize ``` def relationalize( rootTableName : String, stagingPath : String, options : JsonOptions = JsonOptions.empty, transformationContext : String = "", callSite : CallSite = CallSite("Not provided"), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : Seq[DynamicFrame] ``` + `rootTableName` — The name to use for the base `DynamicFrame` in the output. `DynamicFrame`s that are created by pivoting arrays start with this as a prefix. + `stagingPath` — The Amazon Simple Storage Service (Amazon S3) path for writing intermediate data. + `options` — Relationalize options and configuration. Currently unused. Flattens all nested structures and pivots arrays into separate tables. You can use this operation to prepare deeply nested data for ingestion into a relational database. Nested structs are flattened in the same manner as the [Unnest](#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) transform. Additionally, arrays are pivoted into separate tables with each array element becoming a row. For example, suppose that you have a `DynamicFrame` with the following data. ``` {"name": "Nancy", "age": 47, "friends": ["Fred", "Lakshmi"]} {"name": "Stephanie", "age": 28, "friends": ["Yao", "Phil", "Alvin"]} {"name": "Nathan", "age": 54, "friends": ["Nicolai", "Karen"]} ``` Run the following code. ``` {{{ df.relationalize("people", "s3:/my_bucket/my_path", JsonOptions.empty) }}} ``` This produces two tables. The first table is named "people" and contains the following. ``` {{{ {"name": "Nancy", "age": 47, "friends": 1} {"name": "Stephanie", "age": 28, "friends": 2} {"name": "Nathan", "age": 54, "friends": 3) }}} ``` Here, the friends array has been replaced with an auto-generated join key. A separate table named `people.friends` is created with the following content. ``` {{{ {"id": 1, "index": 0, "val": "Fred"} {"id": 1, "index": 1, "val": "Lakshmi"} {"id": 2, "index": 0, "val": "Yao"} {"id": 2, "index": 1, "val": "Phil"} {"id": 2, "index": 2, "val": "Alvin"} {"id": 3, "index": 0, "val": "Nicolai"} {"id": 3, "index": 1, "val": "Karen"} }}} ``` In this table, '`id`' is a join key that identifies which record the array element came from, '`index`' refers to the position in the original array, and '`val`' is the actual array entry. The `relationalize` method returns the sequence of `DynamicFrame`s created by applying this process recursively to all arrays. **Note** The AWS Glue library automatically generates join keys for new tables. To ensure that join keys are unique across job runs, you must enable job bookmarks. ## Def renameField ``` def renameField( oldName : String, newName : String, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `oldName` — The original name of the column. + `newName` — The new name of the column. Returns a new `DynamicFrame` with the specified field renamed. You can use this method to rename nested fields. For example, the following code would rename `state` to `state_code` inside the address struct. ``` {{{ df.renameField("address.state", "address.state_code") }}} ``` ## Def repartition ``` def repartition( numPartitions : Int, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a new `DynamicFrame` with `numPartitions` partitions. ## Def resolveChoice ``` def resolveChoice( specs : Seq[Product2[String, String]] = Seq.empty[ResolveSpec], choiceOption : Option[ChoiceOption] = None, database : Option[String] = None, tableName : Option[String] = None, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `choiceOption` — An action to apply to all `ChoiceType` columns not listed in the specs sequence. + `database` — The Data Catalog database to use with the `match_catalog` action. + `tableName` — The Data Catalog table to use with the `match_catalog` action. Returns a new `DynamicFrame` by replacing one or more `ChoiceType`s with a more specific type. There are two ways to use `resolveChoice`. The first is to specify a sequence of specific columns and how to resolve them. These are specified as tuples made up of (column, action) pairs. The following are the possible actions: + `cast:type` — Attempts to cast all values to the specified type. + `make_cols` — Converts each distinct type to a column with the name `columnName_type`. + `make_struct` — Converts a column to a struct with keys for each distinct type. + `project:type` — Retains only values of the specified type. The other mode for `resolveChoice` is to specify a single resolution for all `ChoiceType`s. You can use this in cases where the complete list of `ChoiceType`s is unknown before execution. In addition to the actions listed preceding, this mode also supports the following action: + `match_catalog` — Attempts to cast each `ChoiceType` to the corresponding type in the specified catalog table. **Examples:** Resolve the `user.id` column by casting to an int, and make the `address` field retain only structs. ``` {{{ df.resolveChoice(specs = Seq(("user.id", "cast:int"), ("address", "project:struct"))) }}} ``` Resolve all `ChoiceType`s by converting each choice to a separate column. ``` {{{ df.resolveChoice(choiceOption = Some(ChoiceOption("make_cols"))) }}} ``` Resolve all `ChoiceType`s by casting to the types in the specified catalog table. ``` {{{ df.resolveChoice(choiceOption = Some(ChoiceOption("match_catalog")), database = Some("my_database"), tableName = Some("my_table")) }}} ``` ## Def schema ``` def schema : Schema ``` Returns the schema of this `DynamicFrame`. The returned schema is guaranteed to contain every field that is present in a record in this `DynamicFrame`. But in a small number of cases, it might also contain additional fields. You can use the [Unnest](#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) method to "tighten" the schema based on the records in this `DynamicFrame`. ## Def selectField ``` def selectField( fieldName : String, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a single field as a `DynamicFrame`. ## Def selectFields ``` def selectFields( paths : Seq[String], transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `paths` — The sequence of column names to select. Returns a new `DynamicFrame` containing the specified columns. **Note** You can only use the `selectFields` method to select top-level columns. You can use the [applyMapping](#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping) method to select nested columns. ## Def show ``` def show( numRows : Int = 20 ) : Unit ``` + `numRows` — The number of rows to print. Prints rows from this `DynamicFrame` in JSON format. ## Def simplifyDDBJson DynamoDB exports with the AWS Glue DynamoDB export connector results in JSON files of specific nested structures. For more information, see [Data objects](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html). `simplifyDDBJson` Simplifies nested columns in a DynamicFrame of this type of data, and returns a new simplified DynamicFrame. If there are multiple types or a Map type contained in a List type, the elements in the List will not be simplified. This method only supports data in the DynamoDB export JSON format. Consider `unnest` to perform similar changes on other kinds of data. ``` def simplifyDDBJson() : DynamicFrame ``` This method does not take any parameters. **Example input** Consider the following schema generated by a DynamoDB export: ``` root |-- Item: struct | |-- parentMap: struct | | |-- M: struct | | | |-- childMap: struct | | | | |-- M: struct | | | | | |-- appName: struct | | | | | | |-- S: string | | | | | |-- packageName: struct | | | | | | |-- S: string | | | | | |-- updatedAt: struct | | | | | | |-- N: string | |-- strings: struct | | |-- SS: array | | | |-- element: string | |-- numbers: struct | | |-- NS: array | | | |-- element: string | |-- binaries: struct | | |-- BS: array | | | |-- element: string | |-- isDDBJson: struct | | |-- BOOL: boolean | |-- nullValue: struct | | |-- NULL: boolean ``` **Example code** ``` import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.DynamoDbDataSink import org.apache.spark.SparkContextimport scala.collection.JavaConverters._ object GlueApp { def main(sysArgs: Array[String]): Unit = { val glueContext = new GlueContext(SparkContext.getOrCreate()) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val dynamicFrame = glueContext.getSourceWithFormat( connectionType = "dynamodb", options = JsonOptions(Map( "dynamodb.export" -> "ddb", "dynamodb.tableArn" -> "ddbTableARN", "dynamodb.s3.bucket" -> "exportBucketLocation", "dynamodb.s3.prefix" -> "exportBucketPrefix", "dynamodb.s3.bucketOwner" -> "exportBucketAccountID", )) ).getDynamicFrame() val simplified = dynamicFrame.simplifyDDBJson() simplified.printSchema() Job.commit() } } ``` ### Example output The `simplifyDDBJson` transform will simplify this to: ``` root |-- parentMap: struct | |-- childMap: struct | | |-- appName: string | | |-- packageName: string | | |-- updatedAt: string |-- strings: array | |-- element: string |-- numbers: array | |-- element: string |-- binaries: array | |-- element: string |-- isDDBJson: boolean |-- nullValue: null ``` ## Def spigot ``` def spigot( path : String, options : JsonOptions = new JsonOptions("{}"), transformationContext : String = "", callSite : CallSite = CallSite("Not provided"), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Passthrough transformation that returns the same records but writes out a subset of records as a side effect. + `path` — The path in Amazon S3 to write output to, in the form `s3://bucket//path`. + `options` — An optional `JsonOptions` map describing the sampling behavior. Returns a `DynamicFrame` that contains the same records as this one. By default, writes 100 arbitrary records to the location specified by `path`. You can customize this behavior by using the `options` map. Valid keys include the following: + `topk` — Specifies the total number of records written out. The default is 100. + `prob` — Specifies the probability (as a decimal) that an individual record is included. Default is 1. For example, the following call would sample the dataset by selecting each record with a 20 percent probability and stopping after 200 records have been written. ``` {{{ df.spigot("s3://my_bucket/my_path", JsonOptions(Map("topk" -> 200, "prob" -> 0.2))) }}} ``` ## Def splitFields ``` def splitFields( paths : Seq[String], transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : Seq[DynamicFrame] ``` + `paths` — The paths to include in the first `DynamicFrame`. Returns a sequence of two `DynamicFrame`s. The first `DynamicFrame` contains the specified paths, and the second contains all other columns. **Example** This example takes a DynamicFrame created from the `persons` table in the `legislators` database in the AWS Glue Data Catalog and splits the DynamicFrame into two, with the specified fields going into the first DynamicFrame and the remaining fields going into a second DynamicFrame. The example then chooses the first DynamicFrame from the result. ``` val InputFrame = glueContext.getCatalogSource(database="legislators", tableName="persons", transformationContext="InputFrame").getDynamicFrame() val SplitField_collection = InputFrame.splitFields(paths=Seq("family_name", "name", "links.note", "links.url", "gender", "image", "identifiers.scheme", "identifiers.identifier", "other_names.lang", "other_names.note", "other_names.name"), transformationContext="SplitField_collection") val ResultFrame = SplitField_collection(0) ``` ## Def splitRows ``` def splitRows( paths : Seq[String], values : Seq[Any], operators : Seq[String], transformationContext : String, callSite : CallSite, stageThreshold : Long, totalThreshold : Long ) : Seq[DynamicFrame] ``` Splits rows based on predicates that compare columns to constants. + `paths` — The columns to use for comparison. + `values` — The constant values to use for comparison. + `operators` — The operators to use for comparison. Returns a sequence of two `DynamicFrame`s. The first contains rows for which the predicate is true and the second contains those for which it is false. Predicates are specified using three sequences: '`paths`' contains the (possibly nested) column names, '`values`' contains the constant values to compare to, and '`operators`' contains the operators to use for comparison. All three sequences must be the same length: The `n`th operator is used to compare the `n`th column with the `n`th value. Each operator must be one of "`!=`", "`=`", "`<=`", "`<`", "`>=`", or "`>`". As an example, the following call would split a `DynamicFrame` so that the first output frame would contain records of people over 65 from the United States, and the second would contain all other records. ``` {{{ df.splitRows(Seq("age", "address.country"), Seq(65, "USA"), Seq(">=", "=")) }}} ``` ## Def stageErrorsCount ``` def stageErrorsCount ``` Returns the number of error records created while computing this `DynamicFrame`. This excludes errors from previous operations that were passed into this `DynamicFrame` as input. ## Def toDF ``` def toDF( specs : Seq[ResolveSpec] = Seq.empty[ResolveSpec] ) : DataFrame ``` Converts this `DynamicFrame` to an Apache Spark SQL `DataFrame` with the same schema and records. **Note** Because `DataFrame`s don't support `ChoiceType`s, this method automatically converts `ChoiceType` columns into `StructType`s. For more information and options for resolving choice, see [resolveChoice](#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice). ## Def unbox ``` def unbox( path : String, format : String, optionString : String = "{}", transformationContext : String = "", callSite : CallSite = CallSite("Not provided"), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` + `path` — The column to parse. Must be a string or binary. + `format` — The format to use for parsing. + `optionString` — Options to pass to the format, such as the CSV separator. Parses an embedded string or binary column according to the specified format. Parsed columns are nested under a struct with the original column name. For example, suppose that you have a CSV file with an embedded JSON column. ``` name, age, address Sally, 36, {"state": "NE", "city": "Omaha"} ... ``` After an initial parse, you would get a `DynamicFrame` with the following schema. ``` {{{ root |-- name: string |-- age: int |-- address: string }}} ``` You can call `unbox` on the address column to parse the specific components. ``` {{{ df.unbox("address", "json") }}} ``` This gives us a `DynamicFrame` with the following schema. ``` {{{ root |-- name: string |-- age: int |-- address: struct | |-- state: string | |-- city: string }}} ``` ## Def unnest ``` def unnest( transformationContext : String = "", callSite : CallSite = CallSite("Not Provided"), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame ``` Returns a new `DynamicFrame` with all nested structures flattened. Names are constructed using the '`.`' (period) character. For example, suppose that you have a `DynamicFrame` with the following schema. ``` {{{ root |-- name: string |-- age: int |-- address: struct | |-- state: string | |-- city: string }}} ``` The following call unnests the address struct. ``` {{{ df.unnest() }}} ``` The resulting schema is as follows. ``` {{{ root |-- name: string |-- age: int |-- address.state: string |-- address.city: string }}} ``` This method also unnests nested structs inside of arrays. But for historical reasons, the names of such fields are prepended with the name of the enclosing array and "`.val`". ## Def unnestDDBJson ``` unnestDDBJson(transformationContext : String = "", callSite : CallSite = CallSite("Not Provided"), stageThreshold : Long = 0, totalThreshold : Long = 0): DynamicFrame ``` Unnests nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure, and returns a new unnested `DynamicFrame`. Columns that are of an array of struct types will not be unnested. Note that this is a specific type of unnesting transform that behaves differently from the regular `unnest` transform and requires the data to already be in the DynamoDB JSON structure. For more information, see [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data). For example, the schema of a reading an export with the DynamoDB JSON structure might look like the following: ``` root |-- Item: struct | |-- ColA: struct | | |-- S: string | |-- ColB: struct | | |-- S: string | |-- ColC: struct | | |-- N: string | |-- ColD: struct | | |-- L: array | | | |-- element: null ``` The `unnestDDBJson()` transform would convert this to: ``` root |-- ColA: string |-- ColB: string |-- ColC: string |-- ColD: array | |-- element: null ``` The following code example shows how to use the AWS Glue DynamoDB export connector, invoke a DynamoDB JSON unnest, and print the number of partitions: ``` import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.DynamoDbDataSink import org.apache.spark.SparkContext import scala.collection.JavaConverters._ object GlueApp { def main(sysArgs: Array[String]): Unit = { val glueContext = new GlueContext(SparkContext.getOrCreate()) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val dynamicFrame = glueContext.getSourceWithFormat( connectionType = "dynamodb", options = JsonOptions(Map( "dynamodb.export" -> "ddb", "dynamodb.tableArn" -> "", "dynamodb.s3.bucket" -> "", "dynamodb.s3.prefix" -> "", "dynamodb.s3.bucketOwner" -> "", )) ).getDynamicFrame() val unnested = dynamicFrame.unnestDDBJson() print(unnested.getNumPartitions()) Job.commit() } } ``` ## Def withFrameSchema ``` def withFrameSchema( getSchema : () => Schema ) : DynamicFrame ``` + `getSchema` — A function that returns the schema to use. Specified as a zero-parameter function to defer potentially expensive computation. Sets the schema of this `DynamicFrame` to the specified value. This is primarily used internally to avoid costly schema recomputation. The passed-in schema must contain all columns present in the data. ## Def withName ``` def withName( name : String ) : DynamicFrame ``` + `name` — The new name to use. Returns a copy of this `DynamicFrame` with a new name. ## Def withTransformationContext ``` def withTransformationContext( ctx : String ) : DynamicFrame ``` Returns a copy of this `DynamicFrame` with the specified transformation context. # The DynamicFrame object **Package: com.amazonaws.services.glue** ``` object DynamicFrame ``` ## Def apply ``` def apply( df : DataFrame, glueContext : GlueContext ) : DynamicFrame ``` ## Def emptyDynamicFrame ``` def emptyDynamicFrame( glueContext : GlueContext ) : DynamicFrame ``` ## Def fromPythonRDD ``` def fromPythonRDD( rdd : JavaRDD[Array[Byte]], glueContext : GlueContext ) : DynamicFrame ``` ## Def ignoreErrors ``` def ignoreErrors( fn : DynamicRecord => DynamicRecord ) : DynamicRecord ``` ## Def inlineErrors ``` def inlineErrors( msg : String, callSite : CallSite ) : (DynamicRecord => DynamicRecord) ``` ## Def newFrameWithErrors ``` def newFrameWithErrors( prevFrame : DynamicFrame, rdd : RDD[DynamicRecord], name : String = "", transformationContext : String = "", callSite : CallSite, stageThreshold : Long, totalThreshold : Long ) : DynamicFrame ``` # AWS Glue Scala DynamicRecord class **Topics** + [ ## Def addField ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-addField) + [ ## Def dropField ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-dropField) + [ ## Def setError ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-setError) + [ ## Def isError ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-isError) + [ ## Def getError ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getError) + [ ## Def clearError ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clearError) + [ ## Def write ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-write) + [ ## Def readFields ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-readFields) + [ ## Def clone ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clone) + [ ## Def schema ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-schema) + [ ## Def getRoot ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getRoot) + [ ## Def toJson ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-toJson) + [ ## Def getFieldNode ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getFieldNode) + [ ## Def getField ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getField) + [ ## Def hashCode ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-hashCode) + [ ## Def equals ](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-equals) + [ ## DynamicRecord object ](#glue-etl-scala-apis-glue-dynamicrecord-object) + [ ## RecordTraverser trait ](#glue-etl-scala-apis-glue-recordtraverser-trait) **Package: com.amazonaws.services.glue** ``` class DynamicRecord extends Serializable with Writable with Cloneable ``` A `DynamicRecord` is a self-describing data structure that represents a row of data in the dataset that is being processed. It is self-describing in the sense that you can get the schema of the row that is represented by the `DynamicRecord` by inspecting the record itself. A `DynamicRecord` is similar to a `Row` in Apache Spark. ## Def addField ``` def addField( path : String, dynamicNode : DynamicNode ) : Unit ``` Adds a [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) to the specified path. + `path` — The path for the field to be added. + `dynamicNode` — The [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) to be added at the specified path. ## Def dropField ``` def dropField(path: String, underRename: Boolean = false): Option[DynamicNode] ``` Drops a [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) from the specified path and returns the dropped node if there is not an array in the specified path. + `path` — The path to the field to drop. + `underRename` — True if `dropField` is called as part of a rename transform, or false otherwise (false by default). Returns a `scala.Option Option` ([DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)). ## Def setError ``` def setError( error : Error ) ``` Sets this record as an error record, as specified by the `error` parameter. Returns a `DynamicRecord`. ## Def isError ``` def isError ``` Checks whether this record is an error record. ## Def getError ``` def getError ``` Gets the `Error` if the record is an error record. Returns `scala.Some Some` (Error) if this record is an error record, or otherwise `scala.None` . ## Def clearError ``` def clearError ``` Set the `Error` to `scala.None.None` . ## Def write ``` override def write( out : DataOutput ) : Unit ``` ## Def readFields ``` override def readFields( in : DataInput ) : Unit ``` ## Def clone ``` override def clone : DynamicRecord ``` Clones this record to a new `DynamicRecord` and returns it. ## Def schema ``` def schema ``` Gets the `Schema` by inspecting the record. ## Def getRoot ``` def getRoot : ObjectNode ``` Gets the root `ObjectNode` for the record. ## Def toJson ``` def toJson : String ``` Gets the JSON string for the record. ## Def getFieldNode ``` def getFieldNode( path : String ) : Option[DynamicNode] ``` Gets the field's value at the specified `path` as an option of `DynamicNode`. Returns `scala.Some Some` ([DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)) if the field exists, or otherwise `scala.None.None` . ## Def getField ``` def getField( path : String ) : Option[Any] ``` Gets the field's value at the specified `path` as an option of `DynamicNode`. Returns `scala.Some Some` (value). ## Def hashCode ``` override def hashCode : Int ``` ## Def equals ``` override def equals( other : Any ) ``` ## DynamicRecord object ``` object DynamicRecord ``` ### Def apply ``` def apply( row : Row, schema : SparkStructType ) ``` Apply method to convert an Apache Spark SQL `Row` to a [DynamicRecord](#glue-etl-scala-apis-glue-dynamicrecord-class). + `row` — A Spark SQL `Row`. + `schema` — The `Schema` of that row. Returns a `DynamicRecord`. ## RecordTraverser trait ``` trait RecordTraverser { def nullValue(): Unit def byteValue(value: Byte): Unit def binaryValue(value: Array[Byte]): Unit def booleanValue(value: Boolean): Unit def shortValue(value: Short) : Unit def intValue(value: Int) : Unit def longValue(value: Long) : Unit def floatValue(value: Float): Unit def doubleValue(value: Double): Unit def decimalValue(value: BigDecimal): Unit def stringValue(value: String): Unit def dateValue(value: Date): Unit def timestampValue(value: Timestamp): Unit def objectStart(length: Int): Unit def objectKey(key: String): Unit def objectEnd(): Unit def mapStart(length: Int): Unit def mapKey(key: String): Unit def mapEnd(): Unit def arrayStart(length: Int): Unit def arrayEnd(): Unit } ``` # AWS Glue Scala GlueContext APIs **Package: com.amazonaws.services.glue** ``` class GlueContext extends SQLContext(sc) ( @transient val sc : SparkContext, val defaultSourcePartitioner : PartitioningStrategy ) ``` `GlueContext` is the entry point for reading and writing a [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md) from and to Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, JDBC, and so on. This class provides utility functions to create [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) and [DataSink](glue-etl-scala-apis-glue-datasink-class.md) objects that can in turn be used to read and write `DynamicFrame`s. You can also use `GlueContext` to set a target number of partitions (default 20) in the `DynamicFrame` if the number of partitions created from the source is less than a minimum threshold for partitions (default 10). ## def addIngestionTimeColumns ``` def addIngestionTimeColumns( df : DataFrame, timeGranularity : String = "") : dataFrame ``` Appends ingestion time columns like `ingest_year`, `ingest_month`, `ingest_day`, `ingest_hour`, `ingest_minute` to the input `DataFrame`. This function is automatically generated in the script generated by the AWS Glue when you specify a Data Catalog table with Amazon S3 as the target. This function automatically updates the partition with ingestion time columns on the output table. This allows the output data to be automatically partitioned on ingestion time without requiring explicit ingestion time columns in the input data. + `dataFrame` – The `dataFrame` to append the ingestion time columns to. + `timeGranularity` – The granularity of the time columns. Valid values are "`day`", "`hour`" and "`minute`". For example, if "`hour`" is passed in to the function, the original `dataFrame` will have "`ingest_year`", "`ingest_month`", "`ingest_day`", and "`ingest_hour`" time columns appended. Returns the data frame after appending the time granularity columns. Example: ``` glueContext.addIngestionTimeColumns(dataFrame, "hour") ``` ## def createDataFrameFromOptions ``` def createDataFrameFromOptions( connectionType : String, connectionOptions : JsonOptions, transformationContext : String = "", format : String = null, formatOptions : JsonOptions = JsonOptions.empty ) : DataSource ``` Returns a `DataFrame` created with the specified connection and format. Use this function only with AWS Glue streaming sources. + `connectionType` – The streaming connection type. Valid values include `kinesis` and `kafka`. + `connectionOptions` – Connection options, which are different for Kinesis and Kafka. You can find the list of all connection options for each streaming data source at [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). Note the following differences in streaming connection options: + Kinesis streaming sources require `streamARN`, `startingPosition`, `inferSchema`, and `classification`. + Kafka streaming sources require `connectionName`, `topicName`, `startingOffsets`, `inferSchema`, and `classification`. + `transformationContext` – The transformation context to use (optional). + `format` – A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. For information about the supported formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) + `formatOptions` – Format options for the specified format. For information about the supported format options, see [Data format options](aws-glue-programming-etl-format.md). Example for Amazon Kinesis streaming source: ``` val data_frame_datasource0 = glueContext.createDataFrameFromOptions(transformationContext = "datasource0", connectionType = "kinesis", connectionOptions = JsonOptions("""{"streamName": "example_stream", "startingPosition": "TRIM_HORIZON", "inferSchema": "true", "classification": "json"}}""")) ``` Example for Kafka streaming source: ``` val data_frame_datasource0 = glueContext.createDataFrameFromOptions(transformationContext = "datasource0", connectionType = "kafka", connectionOptions = JsonOptions("""{"connectionName": "example_connection", "topicName": "example_topic", "startingPosition": "earliest", "inferSchema": "false", "classification": "json", "schema":"`column1` STRING, `column2` STRING"}""")) ``` ## forEachBatch **`forEachBatch(frame, batch_function, options)`** Applies the `batch_function` passed in to every micro batch that is read from the Streaming source. + `frame` – The DataFrame containing the current micro batch. + `batch_function` – A function that will be applied for every micro batch. + `options` – A collection of key-value pairs that holds information about how to process micro batches. The following options are required: + `windowSize` – The amount of time to spend processing each batch. + `checkpointLocation` – The location where checkpoints are stored for the streaming ETL job. + `batchMaxRetries` – The maximum number of times to retry the batch if it fails. The default value is 3. This option is only configurable for Glue version 2.0 and above. **Example:** ``` glueContext.forEachBatch(data_frame_datasource0, (dataFrame: Dataset[Row], batchId: Long) => { if (dataFrame.count() > 0) { val datasource0 = DynamicFrame(glueContext.addIngestionTimeColumns(dataFrame, "hour"), glueContext) // @type: DataSink // @args: [database = "tempdb", table_name = "fromoptionsoutput", stream_batch_time = "100 seconds", // stream_checkpoint_location = "s3://from-options-testing-eu-central-1/fromOptionsOutput/checkpoint/", // transformation_ctx = "datasink1"] // @return: datasink1 // @inputs: [frame = datasource0] val options_datasink1 = JsonOptions( Map("partitionKeys" -> Seq("ingest_year", "ingest_month","ingest_day", "ingest_hour"), "enableUpdateCatalog" -> true)) val datasink1 = glueContext.getCatalogSink( database = "tempdb", tableName = "fromoptionsoutput", redshiftTmpDir = "", transformationContext = "datasink1", additionalOptions = options_datasink1).writeDynamicFrame(datasource0) } }, JsonOptions("""{"windowSize" : "100 seconds", "checkpointLocation" : "s3://from-options-testing-eu-central-1/fromOptionsOutput/checkpoint/"}""")) ``` ## def getCatalogSink ``` def getCatalogSink( database : String, tableName : String, redshiftTmpDir : String = "", transformationContext : String = "" additionalOptions: JsonOptions = JsonOptions.empty, catalogId: String = null ) : DataSink ``` Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes to a location specified in a table that is defined in the Data Catalog. + `database` — The database name in the Data Catalog. + `tableName` — The table name in the Data Catalog. + `redshiftTmpDir` — The temporary staging directory to be used with certain data sinks. Set to empty by default. + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `additionalOptions` – Additional options provided to AWS Glue. + `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. Returns the `DataSink`. ## def getCatalogSource ``` def getCatalogSource( database : String, tableName : String, redshiftTmpDir : String = "", transformationContext : String = "" pushDownPredicate : String = " " additionalOptions: JsonOptions = JsonOptions.empty, catalogId: String = null ) : DataSource ``` Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a table definition in the Data Catalog. + `database` — The database name in the Data Catalog. + `tableName` — The table name in the Data Catalog. + `redshiftTmpDir` — The temporary staging directory to be used with certain data sinks. Set to empty by default. + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `pushDownPredicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns). + `additionalOptions` – A collection of optional name-value pairs. The possible options include those listed in [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) except for `endpointUrl`, `streamName`, `bootstrap.servers`, `security.protocol`, `topicName`, `classification`, and `delimiter`. Another supported option is `catalogPartitionPredicate`: `catalogPartitionPredicate` — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see [AWS Glue Partition Indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). Note that `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser. + `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. Returns the `DataSource`. **Example for streaming source** ``` val data_frame_datasource0 = glueContext.getCatalogSource( database = "tempdb", tableName = "test-stream-input", redshiftTmpDir = "", transformationContext = "datasource0", additionalOptions = JsonOptions("""{ "startingPosition": "TRIM_HORIZON", "inferSchema": "false"}""") ).getDataFrame() ``` ## def getJDBCSink ``` def getJDBCSink( catalogConnection : String, options : JsonOptions, redshiftTmpDir : String = "", transformationContext : String = "", catalogId: String = null ) : DataSink ``` Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes to a JDBC database that is specified in a `Connection` object in the Data Catalog. The `Connection` object has information to connect to a JDBC sink, including the URL, user name, password, VPC, subnet, and security groups. + `catalogConnection` — The name of the connection in the Data Catalog that contains the JDBC URL to write to. + `options` — A string of JSON name-value pairs that provide additional information that is required to write to a JDBC data store. This includes: + *dbtable* (required) — The name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used. The following example shows an options parameter that points to a schema named `test` and a table named `test_table` in database `test_db`. ``` options = JsonOptions("""{"dbtable": "test.test_table", "database": "test_db"}""") ``` + *database* (required) — The name of the JDBC database. + Any additional options passed directly to the SparkSQL JDBC writer. For more information, see [Redshift data source for Spark](https://github.com/databricks/spark-redshift). + `redshiftTmpDir` — A temporary staging directory to be used with certain data sinks. Set to empty by default. + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. Example code: ``` getJDBCSink(catalogConnection = "my-connection-name", options = JsonOptions("""{"dbtable": "my-jdbc-table", "database": "my-jdbc-db"}"""), redshiftTmpDir = "", transformationContext = "datasink4") ``` Returns the `DataSink`. ## def getSink ``` def getSink( connectionType : String, connectionOptions : JsonOptions, transformationContext : String = "" ) : DataSink ``` Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes data to a destination like Amazon Simple Storage Service (Amazon S3), JDBC, or the AWS Glue Data Catalog, or an Apache Kafka or Amazon Kinesis data stream. + `connectionType` — The type of the connection. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `connectionOptions` — A string of JSON name-value pairs that provide additional information to establish the connection with the data sink. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. Returns the `DataSink`. ## def getSinkWithFormat ``` def getSinkWithFormat( connectionType : String, options : JsonOptions, transformationContext : String = "", format : String = null, formatOptions : JsonOptions = JsonOptions.empty ) : DataSink ``` Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes data to a destination like Amazon S3, JDBC, or the Data Catalog, or an Apache Kafka or Amazon Kinesis data stream. Also sets the format for the data to be written out to the destination. + `connectionType` — The type of the connection. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `options` — A string of JSON name-value pairs that provide additional information to establish a connection with the data sink. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `format` — The format of the data to be written out to the destination. + `formatOptions` — A string of JSON name-value pairs that provide additional options for formatting data at the destination. See [Data format options](aws-glue-programming-etl-format.md). Returns the `DataSink`. ## def getSource ``` def getSource( connectionType : String, connectionOptions : JsonOptions, transformationContext : String = "" pushDownPredicate ) : DataSource ``` Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog. Also supports Kafka and Kinesis streaming data sources. + `connectionType` — The type of the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `connectionOptions` — A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). A Kinesis streaming source requires the following connection options: `streamARN`, `startingPosition`, `inferSchema`, and `classification`. A Kafka streaming source requires the following connection options: `connectionName`, `topicName`, `startingOffsets`, `inferSchema`, and `classification`. + `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `pushDownPredicate` — Predicate on partition columns. Returns the `DataSource`. Example for Amazon Kinesis streaming source: ``` val kinesisOptions = jsonOptions() data_frame_datasource0 = glueContext.getSource("kinesis", kinesisOptions).getDataFrame() private def jsonOptions(): JsonOptions = { new JsonOptions( s"""{"streamARN": "arn:aws:kinesis:eu-central-1:123456789012:stream/fromOptionsStream", |"startingPosition": "TRIM_HORIZON", |"inferSchema": "true", |"classification": "json"}""".stripMargin) } ``` Example for Kafka streaming source: ``` val kafkaOptions = jsonOptions() val data_frame_datasource0 = glueContext.getSource("kafka", kafkaOptions).getDataFrame() private def jsonOptions(): JsonOptions = { new JsonOptions( s"""{"connectionName": "ConfluentKafka", |"topicName": "kafka-auth-topic", |"startingOffsets": "earliest", |"inferSchema": "true", |"classification": "json"}""".stripMargin) } ``` ## def getSourceWithFormat ``` def getSourceWithFormat( connectionType : String, options : JsonOptions, transformationContext : String = "", format : String = null, formatOptions : JsonOptions = JsonOptions.empty ) : DataSource ``` Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog, and also sets the format of data stored in the source. + `connectionType` – The type of the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `options` – A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). + `transformationContext` – The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default. + `format` – The format of the data that is stored at the source. When the `connectionType` is "s3", you can also specify `format`. Can be one of “avro”, “csv”, “grokLog”, “ion”, “json”, “xml”, “parquet”, or “orc”. + `formatOptions` – A string of JSON name-value pairs that provide additional options for parsing data at the source. See [Data format options](aws-glue-programming-etl-format.md). Returns the `DataSource`. **Examples** Create a DynamicFrame from a data source that is a comma-separated values (CSV) file on Amazon S3: ``` val datasource0 = glueContext.getSourceWithFormat( connectionType="s3", options =JsonOptions(s"""{"paths": [ "s3://csv/nycflights.csv"]}"""), transformationContext = "datasource0", format = "csv", formatOptions=JsonOptions(s"""{"withHeader":"true","separator": ","}""") ).getDynamicFrame() ``` Create a DynamicFrame from a data source that is a PostgreSQL using a JDBC connection: ``` val datasource0 = glueContext.getSourceWithFormat( connectionType="postgresql", options =JsonOptions(s"""{ "url":"jdbc:postgresql://databasePostgres-1.rds.amazonaws.com:5432/testdb", "dbtable": "public.company", "redshiftTmpDir":"", "user":"username", "password":"password123" }"""), transformationContext = "datasource0").getDynamicFrame() ``` Create a DynamicFrame from a data source that is a MySQL using a JDBC connection: ``` val datasource0 = glueContext.getSourceWithFormat( connectionType="mysql", options =JsonOptions(s"""{ "url":"jdbc:mysql://databaseMysql-1.rds.amazonaws.com:3306/testdb", "dbtable": "athenatest_nycflights13_csv", "redshiftTmpDir":"", "user":"username", "password":"password123" }"""), transformationContext = "datasource0").getDynamicFrame() ``` ## def getSparkSession ``` def getSparkSession : SparkSession ``` Gets the `SparkSession` object associated with this GlueContext. Use this SparkSession object to register tables and UDFs for use with `DataFrame` created from DynamicFrames. Returns the SparkSession. ## def startTransaction ``` def startTransaction(readOnly: Boolean):String ``` Start a new transaction. Internally calls the Lake Formation [startTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-StartTransaction) API. + `readOnly` – (Boolean) Indicates whether this transaction should be read only or read and write. Writes made using a read-only transaction ID will be rejected. Read-only transactions do not need to be committed. Returns the transaction ID. ## def commitTransaction ``` def commitTransaction(transactionId: String, waitForCommit: Boolean): Boolean ``` Attempts to commit the specified transaction. `commitTransaction` may return before the transaction has finished committing. Internally calls the Lake Formation [commitTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CommitTransaction) API. + `transactionId` – (String) The transaction to commit. + `waitForCommit` – (Boolean) Determines whether the `commitTransaction` returns immediately. The default value is true. If false, `commitTransaction` polls and waits until the transaction is committed. The amount of wait time is restricted to 1 minute using exponential backoff with a maximum of 6 retry attempts. Returns a Boolean to indicate whether the commit is done or not. ## def cancelTransaction ``` def cancelTransaction(transactionId: String): Unit ``` Attempts to cancel the specified transaction. Internally calls the Lake Formation [CancelTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CancelTransaction) API. + `transactionId` – (String) The transaction to cancel. Returns a `TransactionCommittedException` exception if the transaction was previously committed. ## def this ``` def this( sc : SparkContext, minPartitions : Int, targetPartitions : Int ) ``` Creates a `GlueContext` object using the specified `SparkContext`, minimum partitions, and target partitions. + `sc` — The `SparkContext`. + `minPartitions` — The minimum number of partitions. + `targetPartitions` — The target number of partitions. Returns the `GlueContext`. ## def this ``` def this( sc : SparkContext ) ``` Creates a `GlueContext` object with the provided `SparkContext`. Sets the minimum partitions to 10 and target partitions to 20. + `sc` — The `SparkContext`. Returns the `GlueContext`. ## def this ``` def this( sparkContext : JavaSparkContext ) ``` Creates a `GlueContext` object with the provided `JavaSparkContext`. Sets the minimum partitions to 10 and target partitions to 20. + `sparkContext` — The `JavaSparkContext`. Returns the `GlueContext`. # MappingSpec **Package: com.amazonaws.services.glue** ## MappingSpec case class ``` case class MappingSpec( sourcePath: SchemaPath, sourceType: DataType, targetPath: SchemaPath, targetType: DataTyp ) extends Product4[String, String, String, String] { override def _1: String = sourcePath.toString override def _2: String = ExtendedTypeName.fromDataType(sourceType) override def _3: String = targetPath.toString override def _4: String = ExtendedTypeName.fromDataType(targetType) } ``` + `sourcePath` — The `SchemaPath` of the source field. + `sourceType` — The `DataType` of the source field. + `targetPath` — The `SchemaPath` of the target field. + `targetType` — The `DataType` of the target field. A `MappingSpec` specifies a mapping from a source path and a source data type to a target path and a target data type. The value at the source path in the source frame appears in the target frame at the target path. The source data type is cast to the target data type. It extends from `Product4` so that you can handle any `Product4` in your `applyMapping` interface. ## MappingSpec object ``` object MappingSpec ``` The `MappingSpec` object has the following members: ## Val orderingByTarget ``` val orderingByTarget: Ordering[MappingSpec] ``` ## Def apply ``` def apply( sourcePath : String, sourceType : DataType, targetPath : String, targetType : DataType ) : MappingSpec ``` Creates a `MappingSpec`. + `sourcePath` — A string representation of the source path. + `sourceType` — The source `DataType`. + `targetPath` — A string representation of the target path. + `targetType` — The target `DataType`. Returns a `MappingSpec`. ## Def apply ``` def apply( sourcePath : String, sourceTypeString : String, targetPath : String, targetTypeString : String ) : MappingSpec ``` Creates a `MappingSpec`. + `sourcePath` — A string representation of the source path. + `sourceType` — A string representation of the source data type. + `targetPath` — A string representation of the target path. + `targetType` — A string representation of the target data type. Returns a MappingSpec. ## Def apply ``` def apply( product : Product4[String, String, String, String] ) : MappingSpec ``` Creates a `MappingSpec`. + `product` — The `Product4` of the source path, source data type, target path, and target data type. Returns a `MappingSpec`. # AWS Glue Scala ResolveSpec APIs **Topics** + [ ## ResolveSpec object ](#glue-etl-scala-apis-glue-resolvespec-object) + [ ## ResolveSpec case class ](#glue-etl-scala-apis-glue-resolvespec-case-class) **Package: com.amazonaws.services.glue** ## ResolveSpec object **ResolveSpec** ``` object ResolveSpec ``` ### Def ``` def apply( path : String, action : String ) : ResolveSpec ``` Creates a `ResolveSpec`. + `path` — A string representation of the choice field that needs to be resolved. + `action` — A resolution action. The action can be one of the following: `Project`, `KeepAsStruct`, or `Cast`. Returns the `ResolveSpec`. ### Def ``` def apply( product : Product2[String, String] ) : ResolveSpec ``` Creates a `ResolveSpec`. + `product` — `Product2` of: source path, resolution action. Returns the `ResolveSpec`. ## ResolveSpec case class ``` case class ResolveSpec extends Product2[String, String] ( path : SchemaPath, action : String ) ``` Creates a `ResolveSpec`. + `path` — The `SchemaPath` of the choice field that needs to be resolved. + `action` — A resolution action. The action can be one of the following: `Project`, `KeepAsStruct`, or `Cast`. ### ResolveSpec def methods ``` def _1 : String ``` ``` def _2 : String ``` # AWS Glue Scala ArrayNode APIs **Package: com.amazonaws.services.glue.types** ## ArrayNode case class **ArrayNode** ``` case class ArrayNode extends DynamicNode ( value : ArrayBuffer[DynamicNode] ) ``` ### ArrayNode def methods ``` def add( node : DynamicNode ) ``` ``` def clone ``` ``` def equals( other : Any ) ``` ``` def get( index : Int ) : Option[DynamicNode] ``` ``` def getValue ``` ``` def hashCode : Int ``` ``` def isEmpty : Boolean ``` ``` def nodeType ``` ``` def remove( index : Int ) ``` ``` def this ``` ``` def toIterator : Iterator[DynamicNode] ``` ``` def toJson : String ``` ``` def update( index : Int, node : DynamicNode ) ``` # AWS Glue Scala BinaryNode APIs **Package: com.amazonaws.services.glue.types** ## BinaryNode case class **BinaryNode** ``` case class BinaryNode extends ScalarNode(value, TypeCode.BINARY) ( value : Array[Byte] ) ``` ### BinaryNode val fields + `ordering` ### BinaryNode def methods ``` def clone ``` ``` def equals( other : Any ) ``` ``` def hashCode : Int ``` # AWS Glue Scala BooleanNode APIs **Package: com.amazonaws.services.glue.types** ## BooleanNode case class **BooleanNode** ``` case class BooleanNode extends ScalarNode(value, TypeCode.BOOLEAN) ( value : Boolean ) ``` ### BooleanNode val fields + `ordering` ### BooleanNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala ByteNode APIs **Package: com.amazonaws.services.glue.types** ## ByteNode case class **ByteNode** ``` case class ByteNode extends ScalarNode(value, TypeCode.BYTE) ( value : Byte ) ``` ### ByteNode val fields + `ordering` ### ByteNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala DateNode APIs **Package: com.amazonaws.services.glue.types** ## DateNode case class **DateNode** ``` case class DateNode extends ScalarNode(value, TypeCode.DATE) ( value : Date ) ``` ### DateNode val fields + `ordering` ### DateNode def methods ``` def equals( other : Any ) ``` ``` def this( value : Int ) ``` # AWS Glue Scala DecimalNode APIs **Package: com.amazonaws.services.glue.types** ## DecimalNode case class **DecimalNode** ``` case class DecimalNode extends ScalarNode(value, TypeCode.DECIMAL) ( value : BigDecimal ) ``` ### DecimalNode val fields + `ordering` ### DecimalNode def methods ``` def equals( other : Any ) ``` ``` def this( value : Decimal ) ``` # AWS Glue Scala DoubleNode APIs **Package: com.amazonaws.services.glue.types** ## DoubleNode case class **DoubleNode** ``` case class DoubleNode extends ScalarNode(value, TypeCode.DOUBLE) ( value : Double ) ``` ### DoubleNode val fields + `ordering` ### DoubleNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala DynamicNode APIs **Topics** + [ ## DynamicNode class ](#glue-etl-scala-apis-glue-types-dynamicnode-class) + [ ## DynamicNode object ](#glue-etl-scala-apis-glue-types-dynamicnode-object) **Package: com.amazonaws.services.glue.types** ## DynamicNode class **DynamicNode** ``` class DynamicNode extends Serializable with Cloneable ``` ### DynamicNode def methods ``` def getValue : Any ``` Get plain value and bind to the current record: ``` def nodeType : TypeCode ``` ``` def toJson : String ``` Method for debug: ``` def toRow( schema : Schema, options : Map[String, ResolveOption] ) : Row ``` ``` def typeName : String ``` ## DynamicNode object **DynamicNode** ``` object DynamicNode ``` ### DynamicNode def methods ``` def quote( field : String, useQuotes : Boolean ) : String ``` ``` def quote( node : DynamicNode, useQuotes : Boolean ) : String ``` # EvaluateDataQuality class | | | --- | | AWS Glue Data Quality is in preview release for AWS Glue and is subject to change. | **Package: com.amazonaws.services.glue.dq** ``` object EvaluateDataQuality ``` ## Def apply ``` def apply(frame: DynamicFrame, ruleset: String, publishingOptions: JsonOptions = JsonOptions.empty): DynamicFrame ``` Evaluates a data quality ruleset against a `DynamicFrame`, and returns a new `DynamicFrame` with results of the evaluation. To learn more about AWS Glue Data Quality, see [AWS Glue Data Quality](glue-data-quality.md). + `frame` – The `DynamicFrame` that you want to evaluate the data quality of. + `ruleset` – A Data Quality Definition Language (DQDL) ruleset in string format. To learn more about DQDL, see the [Data Quality Definition Language (DQDL) reference](dqdl.md) guide. + `publishingOptions` – A dictionary that specifies the following options for publishing evaluation results and metrics: + `dataQualityEvaluationContext` – A string that specifies the namespace under which AWS Glue should publish Amazon CloudWatch metrics and the data quality results. The aggregated metrics appear in CloudWatch, while the full results appear in the AWS Glue Studio interface. + Required: No + Default value: `default_context` + `enableDataQualityCloudWatchMetrics` – Specifies whether the results of the data quality evaluation should be published to CloudWatch. You specify a namespace for the metrics using the `dataQualityEvaluationContext` option. + Required: No + Default value: False + `enableDataQualityResultsPublishing` – Specifies whether the data quality results should be visible on the **Data Quality** tab in the AWS Glue Studio interface. + Required: No + Default value: True + `resultsS3Prefix` – Specifies the Amazon S3 location where AWS Glue can write the data quality evaluation results. + Required: No + Default value: "" (empty string) ## Example The following example code demonstrates how to evaluate data quality for a `DynamicFrame` before performing a `SelectFields` transform. The script verifies that all data quality rules pass before it attempts the transform. ``` import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext import scala.collection.JavaConverters._ import com.amazonaws.services.glue.dq.EvaluateDataQuality object GlueApp { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) // @params: [JOB_NAME] val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) // Create DynamicFrame with data val Legislators_Area = glueContext.getCatalogSource(database="legislators", tableName="areas_json", transformationContext="S3bucket_node1").getDynamicFrame() // Define data quality ruleset val DQ_Ruleset = """ Rules = [ColumnExists "id"] """ // Evaluate data quality val DQ_Results = EvaluateDataQuality.apply(frame=Legislators_Area, ruleset=DQ_Ruleset, publishingOptions=JsonOptions("""{"dataQualityEvaluationContext": "Legislators_Area", "enableDataQualityMetrics": "true", "enableDataQualityResultsPublishing": "true"}""")) assert(DQ_Results.filter(_.getField("Outcome").contains("Failed")).count == 0, "Failing DQ rules for Legislators_Area caused the job to fail.") // Script generated for node Select Fields val SelectFields_Results = Legislators_Area.selectFields(paths=Seq("id", "name"), transformationContext="Legislators_Area") Job.commit() } } ``` # AWS Glue Scala FloatNode APIs **Package: com.amazonaws.services.glue.types** ## FloatNode case class **FloatNode** ``` case class FloatNode extends ScalarNode(value, TypeCode.FLOAT) ( value : Float ) ``` ### FloatNode val fields + `ordering` ### FloatNode def methods ``` def equals( other : Any ) ``` # FillMissingValues class **Package: com.amazonaws.services.glue.ml** ``` object FillMissingValues ``` ## Def apply ``` def apply(frame: DynamicFrame, missingValuesColumn: String, outputColumn: String = "", transformationContext: String = "", callSite: CallSite = CallSite("Not provided", ""), stageThreshold: Long = 0, totalThreshold: Long = 0): DynamicFrame ``` Fills a dynamic frame's missing values in a specified column and returns a new frame with estimates in a new column. For rows without missing values, the specified column's value is duplicated to the new column. + `frame` — The DynamicFrame in which to fill missing values. Required. + `missingValuesColumn` — The column containing missing values (`null` values and empty strings). Required. + `outputColumn` — The name of the new column that will contain estimated values for all rows whose value was missing. Optional; the default is the value of `missingValuesColumn` suffixed by `"_filled"`. + `transformationContext` — A unique string that is used to identify state information (optional). + `callSite` — Used to provide context information for error reporting. (optional). + `stageThreshold` — The maximum number of errors that can occur in the transformation before it errors out (optional; the default is zero). + `totalThreshold` — The maximum number of errors that can occur overall before processing errors out (optional; the default is zero). Returns a new dynamic frame with one additional column that contains estimations for rows with missing values and the present value for other rows. # FindMatches class **Package: com.amazonaws.services.glue.ml** ``` object FindMatches ``` ## Def apply ``` def apply(frame: DynamicFrame, transformId: String, transformationContext: String = "", callSite: CallSite = CallSite("Not provided", ""), stageThreshold: Long = 0, totalThreshold: Long = 0, enforcedMatches: DynamicFrame = null): DynamicFrame, computeMatchConfidenceScores: Boolean ``` Find matches in an input frame and return a new frame with a new column containing a unique ID per match group. + `frame` — The DynamicFrame in which to find matches. Required. + `transformId` — A unique ID associated with the FindMatches transform to apply on the input frame. Required. + `transformationContext` — Identifier for this `DynamicFrame`. The `transformationContext` is used as a key for the job bookmark state that is persisted across runs. Optional. + `callSite` — Used to provide context information for error reporting. These values are automatically set when calling from Python. Optional. + `stageThreshold` — The maximum number of error records allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records present in the previous `DynamicFrame`. Optional. The default is zero. + `totalThreshold` — The maximum number of total errors records before an exception is thrown, including those from previous frames. Optional. The default is zero. + `enforcedMatches` — The frame for enforced matches. Optional. The default is `null`. + `computeMatchConfidenceScores` — A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false. Returns a new dynamic frame with a unique identifier assigned to each group of matching records. # FindIncrementalMatches class **Package: com.amazonaws.services.glue.ml** ``` object FindIncrementalMatches ``` ## Def apply ``` apply(existingFrame: DynamicFrame, incrementalFrame: DynamicFrame, transformId: String, transformationContext: String = "", callSite: CallSite = CallSite("Not provided", ""), stageThreshold: Long = 0, totalThreshold: Long = 0, enforcedMatches: DynamicFrame = null): DynamicFrame, computeMatchConfidenceScores: Boolean ``` Find matches across the existing and incremental frames and return a new frame with a column containing a unique ID per match group. + `existingframe` — An existing frame which has been assigned a matching ID for each group. Required. + `incrementalframe` — An incremental frame used to find matches against the existing frame. Required. + `transformId` — A unique ID associated with the FindIncrementalMatches transform to apply on the input frames. Required. + `transformationContext` — Identifier for this `DynamicFrame`. The `transformationContext` is used as a key for the job bookmark state that is persisted across runs. Optional. + `callSite` — Used to provide context information for error reporting. These values are automatically set when calling from Python. Optional. + `stageThreshold` — The maximum number of error records allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records present in the previous `DynamicFrame`. Optional. The default is zero. + `totalThreshold` — The maximum number of total errors records before an exception is thrown, including those from previous frames. Optional. The default is zero. + `enforcedMatches` — The frame for enforced matches. Optional. The default is `null`. + `computeMatchConfidenceScores` — A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false. Returns a new dynamic frame with a unique identifier assigned to each group of matching records. # AWS Glue Scala IntegerNode APIs **Package: com.amazonaws.services.glue.types** ## IntegerNode case class **IntegerNode** ``` case class IntegerNode extends ScalarNode(value, TypeCode.INT) ( value : Int ) ``` ### IntegerNode val fields + `ordering` ### IntegerNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala LongNode APIs **Package: com.amazonaws.services.glue.types** ## LongNode case class **LongNode** ``` case class LongNode extends ScalarNode(value, TypeCode.LONG) ( value : Long ) ``` ### LongNode val fields + `ordering` ### LongNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala MapLikeNode APIs **Package: com.amazonaws.services.glue.types** ## MapLikeNode class **MapLikeNode** ``` class MapLikeNode extends DynamicNode ( value : mutable.Map[String, DynamicNode] ) ``` ### MapLikeNode def methods ``` def clear : Unit ``` ``` def get( name : String ) : Option[DynamicNode] ``` ``` def getValue ``` ``` def has( name : String ) : Boolean ``` ``` def isEmpty : Boolean ``` ``` def put( name : String, node : DynamicNode ) : Option[DynamicNode] ``` ``` def remove( name : String ) : Option[DynamicNode] ``` ``` def toIterator : Iterator[(String, DynamicNode)] ``` ``` def toJson : String ``` ``` def toJson( useQuotes : Boolean ) : String ``` **Example:** Given this JSON: ``` {"foo": "bar"} ``` If `useQuotes == true`, `toJson` yields `{"foo": "bar"}`. If `useQuotes == false`, `toJson` yields `{foo: bar}` @return. # AWS Glue Scala MapNode APIs **Package: com.amazonaws.services.glue.types** ## MapNode case class **MapNode** ``` case class MapNode extends MapLikeNode(value) ( value : mutable.Map[String, DynamicNode] ) ``` ### MapNode def methods ``` def clone ``` ``` def equals( other : Any ) ``` ``` def hashCode : Int ``` ``` def nodeType ``` ``` def this ``` # AWS Glue Scala NullNode APIs **Topics** + [ ## NullNode class ](#glue-etl-scala-apis-glue-types-nullnode-class) + [ ## NullNode case object ](#glue-etl-scala-apis-glue-types-nullnode-case-object) **Package: com.amazonaws.services.glue.types** ## NullNode class **NullNode** ``` class NullNode ``` ## NullNode case object **NullNode** ``` case object NullNode extends NullNode ``` # AWS Glue Scala ObjectNode APIs **Topics** + [ ## ObjectNode object ](#glue-etl-scala-apis-glue-types-objectnode-object) + [ ## ObjectNode case class ](#glue-etl-scala-apis-glue-types-objectnode-case-class) **Package: com.amazonaws.services.glue.types** ## ObjectNode object **ObjectNode** ``` object ObjectNode ``` ### ObjectNode def methods ``` def apply( frameKeys : Set[String], v1 : mutable.Map[String, DynamicNode], v2 : mutable.Map[String, DynamicNode], resolveWith : String ) : ObjectNode ``` ## ObjectNode case class **ObjectNode** ``` case class ObjectNode extends MapLikeNode(value) ( val value : mutable.Map[String, DynamicNode] ) ``` ### ObjectNode def methods ``` def clone ``` ``` def equals( other : Any ) ``` ``` def hashCode : Int ``` ``` def nodeType ``` ``` def this ``` # AWS Glue Scala ScalarNode APIs **Topics** + [ ## ScalarNode class ](#glue-etl-scala-apis-glue-types-scalarnode-class) + [ ## ScalarNode object ](#glue-etl-scala-apis-glue-types-scalarnode-object) **Package: com.amazonaws.services.glue.types** ## ScalarNode class **ScalarNode** ``` class ScalarNode extends DynamicNode ( value : Any, scalarType : TypeCode ) ``` ### ScalarNode def methods ``` def compare( other : Any, operator : String ) : Boolean ``` ``` def getValue ``` ``` def hashCode : Int ``` ``` def nodeType ``` ``` def toJson ``` ## ScalarNode object **ScalarNode** ``` object ScalarNode ``` ### ScalarNode def methods ``` def apply( v : Any ) : DynamicNode ``` ``` def compare( tv : Ordered[T], other : T, operator : String ) : Boolean ``` ``` def compareAny( v : Any, y : Any, o : String ) ``` ``` def withEscapedSpecialCharacters( jsonToEscape : String ) : String ``` # AWS Glue Scala ShortNode APIs **Package: com.amazonaws.services.glue.types** ## ShortNode case class **ShortNode** ``` case class ShortNode extends ScalarNode(value, TypeCode.SHORT) ( value : Short ) ``` ### ShortNode val fields + `ordering` ### ShortNode def methods ``` def equals( other : Any ) ``` # AWS Glue Scala StringNode APIs **Package: com.amazonaws.services.glue.types** ## StringNode case class **StringNode** ``` case class StringNode extends ScalarNode(value, TypeCode.STRING) ( value : String ) ``` ### StringNode val fields + `ordering` ### StringNode def methods ``` def equals( other : Any ) ``` ``` def this( value : UTF8String ) ``` # AWS Glue Scala TimestampNode APIs **Package: com.amazonaws.services.glue.types** ## TimestampNode case class **TimestampNode** ``` case class TimestampNode extends ScalarNode(value, TypeCode.TIMESTAMP) ( value : Timestamp ) ``` ### TimestampNode val fields + `ordering` ### TimestampNode def methods ``` def equals( other : Any ) ``` ``` def this( value : Long ) ``` # AWS Glue Scala GlueArgParser APIs **Package: com.amazonaws.services.glue.util** ## GlueArgParser object **GlueArgParser** ``` object GlueArgParser ``` This is strictly consistent with the Python version of `utils.getResolvedOptions` in the `AWSGlueDataplanePython` package. ### GlueArgParser def methods ``` def getResolvedOptions( args : Array[String], options : Array[String] ) : Map[String, String] ``` ``` def initParser( userOptionsSet : mutable.Set[String] ) : ArgumentParser ``` **Example Retrieving arguments passed to a job** To retrieve job arguments, you can use the `getResolvedOptions` method. Consider the following example, which retrieves a job argument named `aws_region`. ``` val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME","aws_region").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val region = args("aws_region") println(region) ``` # AWS Glue Scala job APIs **Package: com.amazonaws.services.glue.util** ## Job object **Job** ``` object Job ``` ### Job def methods ``` def commit ``` ``` def init( jobName : String, glueContext : GlueContext, args : java.util.Map[String, String] = Map[String, String]().asJava ) : this.type ``` ``` def init( jobName : String, glueContext : GlueContext, endpoint : String, args : java.util.Map[String, String] ) : this.type ``` ``` def isInitialized ``` ``` def reset ``` ``` def runId ```