Capability 3. Providing secure access to data and systems for generative AI - AWS Prescriptive Guidance

Capability 3. Providing secure access to data and systems for generative AI

Retrieval Augmented Generation (RAG) is a foundational pattern that enhances large language model (LLM) responses by retrieving information from external knowledge bases before generating answers. This approach addresses a core limitation of foundation models (FMs): They are trained on data with a fixed knowledge cutoff and lack access to current enterprise data such as customer records, product catalogs, internal documentation, and business systems.

RAG enables the LLM to provide up-to-date, context-specific responses by dynamically pulling relevant information from enterprise data sources. However, this integration introduces critical security challenges. Securing RAG implementations requires extending defense-in-depth principles from Capability 1 and Capability 2 to address how LLMs securely use data from external sources. The following diagram illustrates recommended AWS services for the Generative AI account RAG capability.

AWS services for the Generative AI account RAG capability.

The Generative AI account includes services for storing embeddings in a vector database, storing conversations for users, and maintaining a prompt store. The account includes security services to implement security guardrails and centralized security governance. Create Amazon Simple Storage Service (Amazon S3) gateway endpoints for the model invocation logs, prompt store, and knowledge base data source buckets in Amazon S3 that the VPC environment accesses. Create an Amazon CloudWatch Logs gateway endpoint for the CloudWatch logs that the VPC environment accesses.

Rationale

RAG enhances FM responses by retrieving information from external, authoritative knowledge bases before generating answers. This approach overcomes FM limitations by providing access to up-to-date, context-specific data, improving the accuracy and relevance of generated responses.

RAG can be implemented across Scopes 2-5 of the Generative AI Security Scoping Matrix. Scope 2 applications represent scenarios where organizations use third-party AI services (like Salesforce Einstein or ChatGPT) where the service provider controls both the FM and the application layer. You control only the prompts and customer data you provide to the service. You can enhance responses from third-party enterprise applications by implementing RAG to extract information from internal data, which augments queries processed by the third-party application. In Scope 2, you implement RAG either by connecting to your organization's data sources or by uploading and referencing custom documents.

In Scope 3, you build a generative AI application using a pre-trained FM such as those offered on Amazon Bedrock. You control your application and any customer data your application uses. The FM provider controls the pre-trained model and its training data.

RAG systems face the following unique security risks:

  • Data exfiltration of RAG data sources by threat actors

  • Poisoning of RAG data sources with prompt injections or malware

  • Unauthorized access to sensitive information through inadequate access controls

  • Sensitive information disclosure through uncontrolled model outputs

  • Lack of data provenance leading to compliance and auditability challenges

Design considerations

Avoid customizing an FM with sensitive data (for more information, see Capability 2). Instead, use the RAG technique to interact with sensitive information. RAG provides the following advantages:

  • Tighter control and visibility – Keep sensitive data separate from the model. You can edit, update, or remove data without retraining the model, ensuring data governance and compliance with regulatory requirements.

  • Reduced sensitive information disclosure – RAG controls interactions with sensitive data during model invocation. This reduces the risk of unintended disclosure that occurs when you incorporate data directly into the model's parameters.

  • Flexibility and adaptability – Update or modify sensitive information as data requirements or regulations change without retraining or rebuilding the language model.

  • Enhanced security posture – Implement multiple security layers including metadata filtering, access controls, and data redaction at different stages of the RAG pipeline.

Multi-layered security strategy

Implement a defense-in-depth approach with security controls at the following stages:

  • Ingestion time – Filter and validate data before it enters the knowledge base.

  • Storage level – Encrypt data at rest and implement access controls.

  • Retrieval time – Apply metadata filtering and role-based access controls.

  • Inference time – Use guardrails to filter model outputs and detect sensitive information.

Amazon Bedrock Knowledge Bases

Amazon Bedrock Knowledge Bases provides a fully managed solution for building RAG applications by securely connecting FMs to your organization's data. This service uses vector stores (such as Amazon OpenSearch Serverless) to retrieve relevant information efficiently. The FM uses this information to generate responses. Amazon Bedrock synchronizes your data from Amazon S3 to the knowledge base and generates embeddings for efficient retrieval.

Key features of Amazon Bedrock Knowledge Bases include the following:

  • Source attribution – Knowledge bases include source attribution for all retrieved information to improve transparency and minimize hallucinations. This provenance tracking enables you to:

    • Verify the accuracy of generated responses.

    • Maintain audit trails for compliance.

    • Build user trust in AI-generated content.

    • Support troubleshooting and investigations during security events.

  • Automated vector store management – Amazon Bedrock automatically creates and manages vector stores in OpenSearch Serverless, synchronizing data from Amazon S3 and generating embeddings for efficient retrieval.

  • Metadata filtering – Knowledge bases support metadata filtering capabilities that enable access control by pre-filtering the vector store based on document metadata before searching for relevant documents. This filtering reduces noise, improves retrieval accuracy, and enforces data access policies.

  • Multimodal support – Knowledge bases process documents with visual resources, extracting and retrieving images in responses to queries, which supports comprehensive document understanding.

For each vector database option, configure the following:

Security considerations

Generative AI RAG workloads face unique risks, including data exfiltration of RAG data sources. Another risk is indirect prompt injection attacks where threat actors insert malicious documents into the knowledge base to manipulate model outputs.

Amazon Bedrock knowledge bases provide security controls for data protection, access control, network security, logging and monitoring, and metadata filtering for secure retrieval. These controls address data exfiltration and unauthorized access risks. To mitigate indirect prompt injection attacks, implement input validation and content filtering on documents before ingestion.

Remediations

This section reviews the AWS services and features that address the risks that are specific to this capability.

Data protection

Encrypt your knowledge base data in transit and at rest using an AWS Key Management Service (AWS KMS) customer managed key. When you configure a data ingestion job for your knowledge base, encrypt the job with a customer managed key. If you let Amazon Bedrock create a vector store in Amazon OpenSearch Service for your knowledge base, Amazon Bedrock passes an AWS KMS key of your choice to OpenSearch Service for encryption.

You can encrypt sessions in which you generate responses from querying a knowledge base with an AWS KMS key. You store the data sources for your knowledge base in your Amazon S3 bucket. If you encrypt your data sources in Amazon S3 with a customer managed key, attach the required policies to your knowledge base service role.

If you configure vector stores with AWS Secrets Manager secrets, encrypt the secrets with customer managed keys and attach decryption permissions to the knowledge base service role. Ensure all data in transit uses TLS 1.2 or higher with secure cipher suites.

For more information and the policies to use, see Encryption of knowledge base resources in the Amazon Bedrock documentation.

Data classification and handling

Implement data classification schemes to categorize data based on sensitivity and criticality. Establish clear classification tiers (for example, Public, Internal, Confidential, and Restricted) with specific handling requirements for each level.

Classify data at the point of ingestion. Use automated tools like Amazon Macie to detect and classify sensitive data in Amazon S3 buckets that contain knowledge base data sources.

Use AWS resource tags to categorize sensitive data and monitor compliance with protection requirements. AWS Organizations tag policies enforce tagging standards across accounts.

Maintain a data catalog that maps data in your organization, its location, sensitivity level, and the controls in place to protect it. AWS Glue Data Catalog supports metadata storage and management.

Data lineage and provenance tracking

Implement comprehensive data provenance tracking to record the history of data as it progresses through your RAG workload.

Data lineage provides the following benefits:

  • Regulatory compliance – Demonstrates data handling practices for audits and certifications

  • Troubleshooting – Enables root cause analysis when data quality issues arise

  • Security investigations – Provides audit trails during security incidents

  • Data quality – Ensures confidence in data origin, transformations, and ownership

  • Impact analysis – Identifies downstream effects of data changes

Implementation approaches for data provenance tracking include the following:

  • AWS Glue Data Catalog – Store metadata and track lineage across data processing pipelines.

  • Amazon SageMaker ML Lineage Tracking – Track model training data, hyperparameters, and deployment artifacts.

  • AWS CloudTrail – Capture API activities across AI services for audit trails.

  • Amazon CloudWatch – Monitor data quality, usage, and model drift with generative AI-driven debugging and root cause analysis.

  • Third-party integration – Support open telemetry with integration to third-party observability tools.

Identity and access management

Create a custom service role for knowledge bases for Amazon Bedrock following the principle of least privilege. Create a trust relationship that allows Amazon Bedrock to assume this role, and create and manage knowledge bases.

Attach identity policies to the custom knowledge base service role that grant permissions to access Amazon Bedrock models, data sources in Amazon S3, vector databases, and encryption keys. For the complete list of required permissions, see Create a service role for Amazon Bedrock Knowledge Bases in the Amazon Bedrock documentation.

Knowledge bases support security configurations to set up data access policies for your knowledge base and network access policies for your private Amazon OpenSearch Serverless knowledge base. For more information, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases in the Amazon Bedrock documentation.

Metadata filtering for secure retrieval

Amazon Bedrock Knowledge Bases supports metadata filtering to refine and secure contextual retrieval from vector stores. For every document added, you can supply metadata files (up to 10KB each) with attributes such as tags, dates, project IDs, and business units.

Metadata filtering enables fine-grained access control for RAG systems. By attaching metadata as key-value pairs to each vector during ingestion, you can do the following:

  • Filter queries – Filter queries based on user attributes such as department, role, or clearance level. For example, metadata can include {"department": "finance", "classification": "confidential"} to restrict access to financial data.

  • Enforce data classification policies – Tag vectors with sensitivity levels (public, internal, confidential, and restricted) and filter based on user permissions.

  • Support multi-tenant architectures – Use metadata to isolate data between different tenants or business units, ensuring data segregation in shared infrastructure.

  • Enable temporal access controls – Include timestamp metadata to implement time-based access restrictions or data retention policies.

It's up to the application or agent to add the correct metadata to each API call with Amazon Bedrock to filter results based on required key-value pairs.

Input and output validation

Input validation protects Amazon Bedrock knowledge bases from malicious content. Use malware protection in Amazon S3 to scan files for malicious content before uploading them to a data source. For an example implementation, see Integrating Malware Scanning into Your Data Ingestion Pipeline with Antivirus for Amazon S3 (AWS Blog post).

Use Amazon Comprehend to detect and redact sensitive information in documents before indexing them in your RAG knowledge base. For an example implementation, see Protect sensitive data in RAG applications with Amazon Bedrock (AWS blog post). For more information, see Detecting PII entities in the Amazon Comprehend documentation.

Use Amazon Macie to detect and generate alerts on potential sensitive data in Amazon S3 data sources to enhance security and compliance.

Recommended AWS services

This section discusses the AWS services that are recommended to build this capability securely. In addition to the services in this section, use Amazon CloudWatch and AWS CloudTrail as explained in Capability 2.

Amazon OpenSearch Serverless

Amazon OpenSearch Serverless is an on-demand, auto-scaling configuration for Amazon OpenSearch Service. An OpenSearch Serverless collection is an OpenSearch cluster that scales compute capacity based on your application's needs. Amazon Bedrock knowledge bases use OpenSearchServerless for embeddings and Amazon S3 for the data sources that sync with the OpenSearch Serverless vector index.

Implement authentication and authorization for your OpenSearch Serverless vector store following the principle of least privilege. With data access control in OpenSearch Serverless, you can allow users to access collections and indexes regardless of their access mechanisms or network sources. Access permissions are done at the generative AI application layer.

OpenSearch Serverless supports server-side encryption with AWS KMS to protect data at rest. Use a customer managed key to encrypt that data. To allow the creation of an AWS KMS key for transient data storage during data ingestion, attach a policy to your knowledge bases for the Amazon Bedrock service role.

Private access can apply to OpenSearch Serverless-managed VPC endpoints, supported AWS services such as Amazon Bedrock, or both. Use AWS PrivateLink to create a private connection between your VPC and OpenSearch Serverless endpoint services. Use network policy rules to specify Amazon Bedrock access.

Monitor OpenSearch Serverless using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. OpenSearch Serverless integrates with AWS CloudTrail, which captures API calls for OpenSearch Serverless as events. OpenSearch Service integrates with Amazon EventBridge to notify you of events that affect your domains.

Amazon S3

Store your data sources for your knowledge base in an Amazon S3 bucket. If you encrypted your data sources in Amazon S3 using a custom AWS KMS key (recommended), attach a policy to your knowledge base service role.

Use malware protection in Amazon S3 to scan files for malicious content before uploading them to a data source. Host your model invocation logs and commonly used prompts as a prompt store in Amazon S3. Encrypt all buckets with a customer managed key.

For additional network security hardening, create a gateway endpoint for the S3 buckets that the VPC environment accesses. Log and monitor all access. Enable versioning if you have a business need to retain the history of Amazon S3 objects. Apply object-level immutability with Amazon S3 Object Lock. Use resource-based policies to control access to your Amazon S3 files.

Amazon Comprehend

Amazon Comprehend uses natural language processing (NLP) to extract insights from document content. You can use Amazon Comprehend to detect and redact PII entities in English or Spanish text documents.

Integrate Amazon Comprehend into your data ingestion pipeline to automatically detect and redact PII entities from documents before you index them in your RAG knowledge base. This approach helps to ensure compliance and protects user privacy. Depending on the document types, you can use Amazon Textract to extract and send text to Amazon Comprehend for analysis and redaction.

With Amazon S3, you can encrypt your input documents when creating a text analysis, topic modeling, or custom Amazon Comprehend job. Amazon Comprehend integrates with AWS KMS to encrypt the data in the storage volume for Start* and Create* jobs. Amazon Comprehend encrypts the output results of Start* jobs by using a customer managed key.

Use the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Comprehend gives another service to the resource. Use AWS PrivateLink to create a private connection between your virtual private cloud (VPC) and Amazon Comprehend endpoint services. Implement identity-based policies for Amazon Comprehend with the principle of least privilege.

Amazon Comprehend integrates with AWS CloudTrail, which captures API calls for Amazon Comprehend as events.

Amazon Macie

Macie identifies sensitive data in your knowledge bases that is stored as data sources, model invocation logs, and prompt stores in Amazon S3 buckets. For Macie security best practices, see the Amazon Macie section in Capability 2.

AWS KMS

Use AWS Key Management Service (AWS KMS) customer managed keys to encrypt the following: