MLSEC02-BP01 Design data encryption and obfuscation

Consider how to protect personal data. Use field level encryption or obfuscation to protect personally identifiable data.

Desired outcome: You establish robust protection for sensitive information by implementing data encryption and obfuscation techniques in your machine learning workflows. You identify and secure personally identifiable information (PII) through field-level encryption and data masking, which improves your adherence to privacy regulations while maintaining data utility for ML models.

Common anti-patterns:

Storing personally identifiable information in plain text format.
Using the same encryption keys across different environments.
Implementing inconsistent data protection policies across ML pipelines.
Overlooking data protection requirements during the design phase.
Failing to audit data for attributes requiring special treatment.

Benefits of establishing this best practice:

Enhanced protection of sensitive and personally identifiable data.
Improves adherence to data privacy regulations.
Reduced risk of data breaches and unauthorized access.
Improved trust from users and stakeholders.
Ability to utilize sensitive data for ML training while maintaining privacy.

Level of risk exposed if this best practice is not established: High

Implementation guidance

When designing machine learning workflows, protect personal and sensitive data throughout the entire data lifecycle. You should evaluate your data early in the process to identify fields containing PII or other sensitive information requiring protection. Implementing field-level encryption or data obfuscation techniques maintains data utility for machine learning while safeguarding individual privacy.

AWS provides multiple services to identify, classify, and protect sensitive data within your ML workflows. Services like AWS Glue can automatically detect PII, while AWS Key Management Service (KMS) and AWS CloudHSM support robust encryption strategies. You should establish consistent policies for handling sensitive data across your organization and regularly audit your data protection measures to improve your adherence to privacy regulations.

Implementation steps

Audit data for attributes requiring special treatment. Identify fields containing data requiring special treatment, such as field-level encryption, data masking, or obfuscation. Use automated tools like AWS Glue to identify PII and sensitive data patterns within your datasets.
Establish a data classification framework. Develop a systematic approach to categorize data based on sensitivity levels. Define which categories require encryption, masking, or other protection techniques, and document these requirements in your organization's security policies.
Implement field-level encryption. Apply encryption selectively to sensitive fields rather than entire datasets. Use AWS Key Management Service (KMS) to manage encryption keys and integrate with services like Amazon S3 or Amazon DynamoDB for transparent encryption of selected fields.
Apply data obfuscation techniques. Use methods such as tokenization, data masking, or anonymization to protect sensitive information while preserving data utility for machine learning. Consider using services like AWS Glue DataBrew for data transformation and masking operations.
Establish key rotation policies. Implement regular rotation of encryption keys to minimize the impact of potential key compromises. Configure AWS KMS to automate key rotation according to your security policies and regulatory requirements.
Secure ML model artifacts. Verify that trained models and their associated metadata do not inadvertently expose sensitive information. Use Amazon SageMaker AI's security features to encrypt model artifacts and secure API endpoints that serve predictions.
Implement access controls. Restrict access to sensitive data and encryption keys using AWS Identity and Access Management (IAM) policies. Apply the principle of least privilege to verify that only authorized personnel can access protected information.
Monitor and audit access patterns. Implement continuous monitoring to detect unauthorized access attempts or unusual patterns that might indicate a security breach. Configure AWS CloudTrail and Amazon CloudWatch to track and alert on suspicious activities.
Implement differential privacy techniques. When working with AI models, consider implementing differential privacy techniques to add statistical noise to training data, protecting individual privacy while maintaining overall data utility.
Establish mechanisms to stop model memorization. Implement safeguards to block AI models from memorizing and potentially reproducing sensitive information from training data, especially when using large language models.

Resources

Related documents:

Related videos:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

ML problem framing

Data processing