Capability 3. Providing secure access to data and systems for generative AI
Retrieval
Augmented Generation (RAG)
RAG enables the LLM to provide up-to-date, context-specific responses by dynamically pulling relevant information from enterprise data sources. However, this integration introduces critical security challenges. Securing RAG implementations requires extending defense-in-depth principles from Capability 1 and Capability 2 to address how LLMs securely use data from external sources. The following diagram illustrates recommended AWS services for the Generative AI account RAG capability.
The Generative AI account includes services for storing embeddings in a vector database, storing conversations for users, and maintaining a prompt store. The account includes security services to implement security guardrails and centralized security governance. Create Amazon Simple Storage Service (Amazon S3) gateway endpoints for the model invocation logs, prompt store, and knowledge base data source buckets in Amazon S3 that the VPC environment accesses. Create an Amazon CloudWatch Logs gateway endpoint for the CloudWatch logs that the VPC environment accesses.
Rationale
RAG enhances FM responses by retrieving information from external, authoritative knowledge bases before generating answers. This approach overcomes FM limitations by providing access to up-to-date, context-specific data, improving the accuracy and relevance of generated responses.
RAG can be implemented across Scopes 2-5 of the Generative AI Security Scoping Matrix
In Scope 3, you build a generative AI application using a pre-trained FM such as those offered on Amazon Bedrock. You control your application and any customer data your application uses. The FM provider controls the pre-trained model and its training data.
RAG systems face the following unique security risks:
-
Data exfiltration of RAG data sources by threat actors
-
Poisoning of RAG data sources with prompt injections or malware
-
Unauthorized access to sensitive information through inadequate access controls
-
Sensitive information disclosure through uncontrolled model outputs
-
Lack of data provenance leading to compliance and auditability challenges
Design considerations
Avoid customizing an FM with sensitive data (for more information, see Capability 2). Instead, use the RAG technique to interact with sensitive information. RAG provides the following advantages:
-
Tighter control and visibility – Keep sensitive data separate from the model. You can edit, update, or remove data without retraining the model, ensuring data governance and compliance with regulatory requirements.
-
Reduced sensitive information disclosure – RAG controls interactions with sensitive data during model invocation. This reduces the risk of unintended disclosure that occurs when you incorporate data directly into the model's parameters.
-
Flexibility and adaptability – Update or modify sensitive information as data requirements or regulations change without retraining or rebuilding the language model.
-
Enhanced security posture – Implement multiple security layers including metadata filtering, access controls, and data redaction at different stages of the RAG pipeline.
Multi-layered security strategy
Implement a defense-in-depth approach with security controls at the following stages:
-
Ingestion time – Filter and validate data before it enters the knowledge base.
-
Storage level – Encrypt data at rest and implement access controls.
-
Retrieval time – Apply metadata filtering and role-based access controls.
-
Inference time – Use guardrails to filter model outputs and detect sensitive information.
Amazon Bedrock Knowledge Bases
Amazon Bedrock Knowledge Bases provides a fully managed solution for building
RAG applications by securely connecting FMs to your organization's data. This
service uses vector stores (such as Amazon OpenSearch Serverless) to retrieve
relevant information efficiently. The FM uses this information to generate
responses. Amazon Bedrock synchronizes your data from Amazon S3 to the knowledge base and
generates embeddings
Key features of Amazon Bedrock Knowledge Bases include the following:
-
Source attribution – Knowledge bases include source attribution for all retrieved information to improve transparency and minimize hallucinations. This provenance tracking enables you to:
-
Verify the accuracy of generated responses.
-
Maintain audit trails for compliance.
-
Build user trust in AI-generated content.
-
Support troubleshooting and investigations during security events.
-
-
Automated vector store management – Amazon Bedrock automatically creates and manages vector stores in OpenSearch Serverless, synchronizing data from Amazon S3 and generating embeddings for efficient retrieval.
-
Metadata filtering – Knowledge bases support metadata filtering capabilities that enable access control by pre-filtering the vector store based on document metadata before searching for relevant documents. This filtering reduces noise, improves retrieval accuracy, and enforces data access policies.
-
Multimodal support – Knowledge bases process documents with visual resources, extracting and retrieving images in responses to queries, which supports comprehensive document understanding.
For each vector database option, configure the following:
-
Field mappings for vector embeddings, text chunks, and metadata
-
Customer managed AWS KMS keys for encrypting secrets and data
-
AWS Secrets Manager secrets for authentication credentials
-
Network connectivity through AWS PrivateLink where supported
Security considerations
Generative AI RAG workloads face unique risks, including data exfiltration of RAG data sources. Another risk is indirect prompt injection attacks where threat actors insert malicious documents into the knowledge base to manipulate model outputs.
Amazon Bedrock knowledge bases provide security controls for data protection, access control, network security, logging and monitoring, and metadata filtering for secure retrieval. These controls address data exfiltration and unauthorized access risks. To mitigate indirect prompt injection attacks, implement input validation and content filtering on documents before ingestion.
Remediations
This section reviews the AWS services and features that address the risks that are specific to this capability.
Data protection
Encrypt your knowledge base data in transit and at rest using an AWS Key Management Service (AWS KMS) customer managed key. When you configure a data ingestion job for your knowledge base, encrypt the job with a customer managed key. If you let Amazon Bedrock create a vector store in Amazon OpenSearch Service for your knowledge base, Amazon Bedrock passes an AWS KMS key of your choice to OpenSearch Service for encryption.
You can encrypt sessions in which you generate responses from querying a knowledge base with an AWS KMS key. You store the data sources for your knowledge base in your Amazon S3 bucket. If you encrypt your data sources in Amazon S3 with a customer managed key, attach the required policies to your knowledge base service role.
If you configure vector stores with AWS Secrets Manager secrets, encrypt the secrets with customer managed keys and attach decryption permissions to the knowledge base service role. Ensure all data in transit uses TLS 1.2 or higher with secure cipher suites.
For more information and the policies to use, see Encryption of knowledge base resources in the Amazon Bedrock documentation.
Data classification and handling
Implement data classification schemes to categorize data based on sensitivity and criticality. Establish clear classification tiers (for example, Public, Internal, Confidential, and Restricted) with specific handling requirements for each level.
Classify data at the point of ingestion. Use automated tools like Amazon Macie to detect and classify sensitive data in Amazon S3 buckets that contain knowledge base data sources.
Use AWS resource tags to categorize sensitive data and monitor compliance with protection requirements. AWS Organizations tag policies enforce tagging standards across accounts.
Maintain a data catalog that maps data in your organization, its location, sensitivity level, and the controls in place to protect it. AWS Glue Data Catalog supports metadata storage and management.
Data lineage and provenance tracking
Implement comprehensive data provenance tracking to record the history of data as it progresses through your RAG workload.
Data lineage provides the following benefits:
-
Regulatory compliance – Demonstrates data handling practices for audits and certifications
-
Troubleshooting – Enables root cause analysis when data quality issues arise
-
Security investigations – Provides audit trails during security incidents
-
Data quality – Ensures confidence in data origin, transformations, and ownership
-
Impact analysis – Identifies downstream effects of data changes
Implementation approaches for data provenance tracking include the following:
-
AWS Glue Data Catalog – Store metadata and track lineage across data processing pipelines.
-
Amazon SageMaker ML Lineage Tracking – Track model training data, hyperparameters, and deployment artifacts.
-
AWS CloudTrail – Capture API activities across AI services for audit trails.
-
Amazon CloudWatch – Monitor data quality, usage, and model drift with generative AI-driven debugging and root cause analysis.
-
Third-party integration – Support open telemetry with integration to third-party observability tools.
Identity and access management
Create a custom service role for knowledge bases for Amazon Bedrock following the principle of least privilege. Create a trust relationship that allows Amazon Bedrock to assume this role, and create and manage knowledge bases.
Attach identity policies to the custom knowledge base service role that grant permissions to access Amazon Bedrock models, data sources in Amazon S3, vector databases, and encryption keys. For the complete list of required permissions, see Create a service role for Amazon Bedrock Knowledge Bases in the Amazon Bedrock documentation.
Knowledge bases support security configurations to set up data access policies for your knowledge base and network access policies for your private Amazon OpenSearch Serverless knowledge base. For more information, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases in the Amazon Bedrock documentation.
Metadata filtering for secure retrieval
Amazon Bedrock Knowledge Bases supports metadata filtering
Metadata filtering enables fine-grained access control for RAG systems. By attaching metadata as key-value pairs to each vector during ingestion, you can do the following:
-
Filter queries – Filter queries based on user attributes such as department, role, or clearance level. For example, metadata can include
{"department": "finance", "classification": "confidential"}to restrict access to financial data. -
Enforce data classification policies – Tag vectors with sensitivity levels (public, internal, confidential, and restricted) and filter based on user permissions.
-
Support multi-tenant architectures – Use metadata to isolate data between different tenants or business units, ensuring data segregation in shared infrastructure.
-
Enable temporal access controls – Include timestamp metadata to implement time-based access restrictions or data retention policies.
It's up to the application or agent to add the correct metadata to each API call with Amazon Bedrock to filter results based on required key-value pairs.
Input and output validation
Input validation protects Amazon Bedrock knowledge bases from malicious content. Use
malware protection in Amazon S3 to scan files for malicious content before uploading
them to a data source. For an example implementation, see Integrating Malware Scanning into Your Data Ingestion Pipeline with
Antivirus for Amazon S3
Use Amazon Comprehend to detect and redact sensitive information in documents before
indexing them in your RAG knowledge base. For an example implementation, see
Protect sensitive data in RAG applications with
Use Amazon Macie to detect and generate alerts on potential sensitive data in Amazon S3 data sources to enhance security and compliance.
Recommended AWS services
This section discusses the AWS services that are recommended to build this capability securely. In addition to the services in this section, use Amazon CloudWatch and AWS CloudTrail as explained in Capability 2.
Amazon OpenSearch Serverless
Amazon
OpenSearch Serverless is an on-demand, auto-scaling configuration
for Amazon OpenSearch Service. An OpenSearch Serverless collection is an OpenSearch cluster
that scales compute capacity based on your application's needs. Amazon Bedrock knowledge
bases use OpenSearchServerless for embeddings
Implement authentication and authorization for your OpenSearch Serverless vector store following the principle of least privilege. With data access control in OpenSearch Serverless, you can allow users to access collections and indexes regardless of their access mechanisms or network sources. Access permissions are done at the generative AI application layer.
OpenSearch Serverless supports server-side encryption with AWS KMS to protect data at rest. Use a customer managed key to encrypt that data. To allow the creation of an AWS KMS key for transient data storage during data ingestion, attach a policy to your knowledge bases for the Amazon Bedrock service role.
Private access can apply to OpenSearch Serverless-managed VPC endpoints, supported AWS services such as Amazon Bedrock, or both. Use AWS PrivateLink to create a private connection between your VPC and OpenSearch Serverless endpoint services. Use network policy rules to specify Amazon Bedrock access.
Monitor OpenSearch Serverless using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. OpenSearch Serverless integrates with AWS CloudTrail, which captures API calls for OpenSearch Serverless as events. OpenSearch Service integrates with Amazon EventBridge to notify you of events that affect your domains.
Amazon S3
Store your data sources for your knowledge base in an Amazon S3 bucket. If you encrypted your data sources in Amazon S3 using a custom AWS KMS key (recommended), attach a policy to your knowledge base service role.
Use malware protection
For additional network security hardening, create a gateway endpoint for the S3 buckets that the VPC environment accesses. Log and monitor all access. Enable versioning if you have a business need to retain the history of Amazon S3 objects. Apply object-level immutability with Amazon S3 Object Lock. Use resource-based policies to control access to your Amazon S3 files.
Amazon Comprehend
Amazon Comprehend uses natural language processing (NLP) to extract insights from document content. You can use Amazon Comprehend to detect and redact PII entities in English or Spanish text documents.
Integrate Amazon Comprehend into your data ingestion pipeline
With Amazon S3, you can encrypt your input documents when creating a text analysis,
topic modeling, or custom Amazon Comprehend job. Amazon Comprehend integrates with AWS KMS
to encrypt the data in the storage volume for Start*
and Create* jobs. Amazon Comprehend encrypts the output results of
Start* jobs by using a customer managed key.
Use the aws:SourceArn and aws:SourceAccount global
condition context keys in resource policies to limit the permissions that Amazon Comprehend gives another
service to the resource. Use AWS PrivateLink to create a private connection between your virtual
private cloud (VPC) and Amazon Comprehend endpoint services. Implement identity-based
policies for Amazon Comprehend with the principle of least privilege.
Amazon Comprehend integrates with AWS CloudTrail, which captures API calls for Amazon Comprehend as events.
Amazon Macie
Macie identifies sensitive data in your knowledge bases that is stored as data sources, model invocation logs, and prompt stores in Amazon S3 buckets. For Macie security best practices, see the Amazon Macie section in Capability 2.
AWS KMS
Use AWS Key Management Service (AWS KMS) customer managed keys to encrypt the following:
-
Data ingestion jobs for your knowledge base
-
Amazon OpenSearch Service vector database
-
Sessions in which you generate responses from querying a knowledge base
-
Model invocation logs in Amazon S3
-
Amazon S3 bucket that hosts the data sources