RemoveDuplicates class
The RemoveDuplicates transform deletes an entire row, if a duplicate value is encountered in a
selected source column.
Example
from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.RemoveDuplicates.apply( data_frame=input_df, spark_context=sc, source_column="source_column_1" ) except: print("Unexpected Error happened ") raise
Output
The output will be a PySpark DataFrame with duplicates removed based on the
source_column_1 column. The resulting `df_output` DataFrame will contain the following rows:
``` +---------------+---------------+ |source_column_1|source_column_2| +---------------+---------------+ | 105.111| 13.12| | 13.12| 13.12| | null| 13.12| +---------------+---------------+ ```
Note that the rows with source_column_1 values of `13.12` and `null` appear only once in the output
DataFrame, as the duplicates have been removed based on the source_column_1 column.
Methods
__call__(spark_context, data_frame, source_column)
The RemoveDuplicates transform deletes an entire row, if a duplicate value is encountered in a
selected source column.
-
source_column– The name of an existing column.
apply(cls, *args, **kwargs)
Inherited from GlueTransform
apply.
name(cls)
Inherited from GlueTransform
name.
describeArgs(cls)
Inherited from GlueTransform
describeArgs.
describeReturn(cls)
Inherited from GlueTransform
describeReturn.
describeTransform(cls)
Inherited from GlueTransform
describeTransform.
describeErrors(cls)
Inherited from GlueTransform
describeErrors.
describe(cls)
Inherited from GlueTransform
describe.