Complete synchronous schema Complete asynchronous schema File limitations for Nova Embeddings

Complete embeddings request and response schema

Complete synchronous schema



{
    "schemaVersion": "nova-multimodal-embed-v1",
    "taskType": "SINGLE_EMBEDDING",
    "singleEmbeddingParams": {
        "embeddingPurpose": "GENERIC_INDEX" | "GENERIC_RETRIEVAL" | "TEXT_RETRIEVAL" | "IMAGE_RETRIEVAL" | "VIDEO_RETRIEVAL" | "DOCUMENT_RETRIEVAL" | "AUDIO_RETRIEVAL" | "CLASSIFICATION" | "CLUSTERING",
        "embeddingDimension": 256 | 384 | 1024 | 3072,
        "text": {
            "truncationMode": "START" | "END" | "NONE",
            "value": string,
            "source": SourceObject,
        },
        "image": {
            "detailLevel": "STANDARD_IMAGE" | "DOCUMENT_IMAGE",
            "format": "png" | "jpeg" | "gif" | "webp",
            "source": SourceObject
        },
        "audio": {
            "format": "mp3" | "wav" | "ogg",
            "source": SourceObject
        },
        "video": {
            "format": "mp4" | "mov" | "mkv" | "webm" | "flv" | "mpeg" | "mpg" | "wmv" | "3gp",
            "source": SourceObject,
            "embeddingMode": "AUDIO_VIDEO_COMBINED" | "AUDIO_VIDEO_SEPARATE"
        }
    }
}

The following list includes all of the parameters for the request:

schemaVersion (Optional) - The schema version for the multimodal embedding model request
- Type: string
- Allowed values: "nova-multimodal-embed-v1"
- Default: "nova-multimodal-embed-v1"
taskType (Required) - Specifies the type of embedding operation to perform on the input content. single_embedding refers to generating one embedding per model input. segmented_embedding refers to first segmenting the model input per user specification and then generating a single embedding per segment.
- Type: string
- Allowed values: Must be "SINGLE_EMBEDDING" for synchronous calls.
singleEmbeddingParams (Required)
- embeddingPurpose (Required) - Nova Multimodal Embeddings enables you to optimize your embeddings depending on the intended application. Examples include MM-RAG, Digital Asset Management for image and video search, similarity comparison for multimodal content, or document classification for Intelligent Document Processing. embeddingPurpose enables you to specify the embedding use-case. Select the correct value depending on the use-case below.
  - Search and Retrieval: Embedding use cases like RAG and search involve two main steps: first, creating an index by generating embeddings for the content, and second, retrieving the most relevant content from the index during search. Use the following values when working with search and retrieval use-cases:
    
    Indexing:
    
    "GENERIC_INDEX" - Creates embeddings optimized for use as indexes in a vector data store. This value should be used irrespective of the modality you are indexing.
    
    Search/retrieval: Optimize your embeddings depending on the type of content you are retrieving:
    
    "TEXT_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only text embeddings.
    "IMAGE_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only image embeddings created with the "STANDARD_IMAGE" detailLevel.
    "VIDEO_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only video embeddings or embeddings created with the "AUDIO_VIDEO_COMBINED" embedding mode.
    "DOCUMENT_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only document image embeddings created with the "DOCUMENT_IMAGE" detailLevel.
    "AUDIO_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only audio embeddings.
    "GENERIC_RETRIEVAL" - Creates embeddings optimized for searching a repository containing mixed modality embeddings.
    
    Example: In an image search app where users retrieve images using text queries, use embeddingPurpose = generic_index when creating an embedding index based on the images and use embeddingPurpose = image_retrieval when creating an embedding of the query used to retrieve the images.
  - "CLASSIFICATION" - Creates embeddings optimized for performing classification.
  - "CLUSTERING" - Creates embeddings optimized for clustering.
- embeddingDimension (Optional) - The size of the vector to generate.
  - Type: int
  - Allowed values: 256 | 384 | 1024 | 3072
  - Default: 3072
- text (Optional) - Represents text content. Exactly one of text, image, video, audio must be present.
  - truncationMode (Required) - Specifies which part of the text will be truncated in cases where the tokenized version of the text exceeds the maximum supported by the model.
    
    Type: string
    Allowed values:
    
    "START" - Omit characters from the start of the text when necessary.
    "END" - Omit characters from the end of the text when necessary.
    "NONE" - Fail if text length exceeds the model's maximum token limit.
  - value (Optional; Either value or source must be provided) - The text value for which to create the embedding.
    
    Type: string
    Max length: 8192 characters
  - source (Optional; Either value or source must be provided) - Reference to a text file stored in S3. Note that the bytes option of the SourceObject is not applicable for text inputs. To pass text inline as part of the request, use the value parameter instead.
    
    Type: SourceObject (see "Common Objects" section)
- image (Optional) - Represents image content. Exactly one of text, image, video, audio must be present.
  - detailLevel (Optional) - Dictates the resolution at which the image will be processed with "STANDARD_IMAGE" using a lower image resolution and "DOCUMENT_IMAGE" using a higher resolution image to better interpret text.
    
    Type: string
    Allowed values: "STANDARD_IMAGE" | "DOCUMENT_IMAGE"
    Default: "STANDARD_IMAGE"
  - format (Required)
    
    Type: string
    Allowed values: "png" | "jpeg" | "gif" | "webp"
  - source (Required) - An image content source.
    
    Type: SourceObject (see "Common Objects" section)
- audio (Optional) - Represents audio content. Exactly one of text, image, video, audio must be present.
  - format (Required)
    
    Type: string
    Allowed values: "mp3" | "wav" | "ogg"
  - source (Required) - An audio content source.
    
    Type: SourceObject (see "Common Objects" section)
    Maximum audio duration: 30 seconds
- video (Optional) - Represents video content. Exactly one of text, image, video, audio must be present.
  - format (Required)
    
    Type: string
    Allowed values: "mp4" | "mov" | "mkv" | "webm" | "flv" | "mpeg" | "mpg" | "wmv" | "3gp"
  - source (Required) - A video content source.
    
    Type: SourceObject (see "Common Objects" section)
    Maximum video duration: 30 seconds
  - embeddingMode (Required)
    
    Type: string
    Values: "AUDIO_VIDEO_COMBINED" | "AUDIO_VIDEO_SEPARATE"
    
    "AUDIO_VIDEO_COMBINED" - Will produce a single embedding combining both audible and visual content.
    "AUDIO_VIDEO_SEPARATE" - Will produce two embeddings, one for the audible content and one for the visual content.

InvokeModel Response Body

When InvokeModel returns a successful result, the body of the response will have the following structure:



{
   "embeddings": [
      {
          "embeddingType": "TEXT" | "IMAGE" | "VIDEO" | "AUDIO" | "AUDIO_VIDEO_COMBINED",
          "embedding": number[],
          "truncatedCharLength": int // Only included if text input was truncated
      }
    ]                       
}

The following list includes all of the parameters for the response:

embeddings (Required) - For most requests, this array will contain a single embedding. For video requests where the "AUDIO_VIDEO_SEPARATE" embeddingMode mode was selected, this array will contain two embeddings - one embedding for the video content and one for the the audio content.
- Type: array of embeddings with the following properties
  - embeddingType (Required) - Reports the type of embedding that was created.
    
    Type: string
    Allowed values: "TEXT" | "IMAGE" | "VIDEO" | "AUDIO" | "AUDIO_VIDEO_COMBINED"
  - embedding (Required) - The embedding vector.
    
    Type: number[]
  - truncatedCharLength (Optional) - Only applies to text embedding requests. Returned if the tokenized version of the input text exceeded the model's limitations. The value indicates the character after which the text was truncated before generating the embedding.
    
    Type: int

Complete asynchronous schema

You can generate embeddings asynchronously using the Amazon Bedrock Runtime API functions StartAsyncInvoke, GetAsyncInvoke, and ListAsyncInvokes. The asynchronous API must be used if you want to use Nova Embeddings to segment long content such as long passage of text or video and audio longer than 30 seconds.

When calling StartAsyncInvoke, you must provide modelId, outputDataConfig, and modelInput parameters.



response = bedrock_runtime.start_async_invoke(
    modelId="amazon.nova-2-multimodal-embeddings-v1:0",
    outputDataConfig=Data Config,
    modelInput=Model Input
)

outputDataConfig specifies the S3 bucket to which you'd like to save the generated output. It has the following structure:



{
    "s3OutputDataConfig": {
        "s3Uri": "s3://your-s3-bucket"
    }
}

The s3Uri is the S3 URI of the destination bucket. For additional optional parameters, see the StartAsyncInvoke documentation.

The following structure is used for the modelInput parameter.



{
    "schemaVersion": "nova-multimodal-embed-v1",
    "taskType": "SEGMENTED_EMBEDDING",
    "segmentedEmbeddingParams": {
        "embeddingPurpose": "GENERIC_INDEX" | "GENERIC_RETRIEVAL" | "TEXT_RETRIEVAL" | "IMAGE_RETRIEVAL" | "VIDEO_RETRIEVAL" | "DOCUMENT_RETRIEVAL" | "AUDIO_RETRIEVAL" | "CLASSIFICATION" | "CLUSTERING",
        "embeddingDimension": 256 | 384 | 1024 | 3072,
        "text": {
            "truncationMode": "START" | "END" | "NONE",
            "value": string,
            "source": {
                "s3Location": {
                    "uri": "s3://Your S3 Object"
                }
            },
            "segmentationConfig": {
                "maxLengthChars": int
            }
        },
        "image": {
            "format": "png" | "jpeg" | "gif" | "webp",
            "source": SourceObject,
            "detailLevel": "STANDARD_IMAGE" | "DOCUMENT_IMAGE"
        },
        "audio": {
            "format": "mp3" | "wav" | "ogg",
            "source": SourceObject,
            "segmentationConfig": {
                "durationSeconds": int
            }
        },
        "video": {
            "format": "mp4" | "mov" | "mkv" | "webm" | "flv" | "mpeg" | "mpg" | "wmv" | "3gp",
            "source": SourceObject,
            "embeddingMode": "AUDIO_VIDEO_COMBINED" | "AUDIO_VIDEO_SEPARATE",
            "segmentationConfig": {
                "durationSeconds": int
            }
        }
    }
}

The following list includes all of the parameters for the request:

schemaVersion (Optional) - The schema version for the multimodal embedding model request
- Type: string
- Allowed values: "nova-multimodal-embed-v1"
- Default: "nova-multimodal-embed-v1"
taskType (Required) - Specifies the type of embedding operation to perform on the input content. single_embedding refers to generating one embedding per model input. segmented_embedding refers to first segmenting the model input per user specification and then generating a single embedding per segment.
- Type: string
- Allowed values: Must be "SEGMENTED_EMBEDDING" for asynchronous calls.
segmentedEmbeddingParams (Required)
- embeddingPurpose (Required) - Nova Multimodal Embeddings enables you to optimize your embeddings depending on the intended application. Examples include MM-RAG, Digital Asset Management for image and video search, similarity comparison for multimodal content, or document classification for Intelligent Document Processing. embeddingPurpose enables you to specify the embedding use-case. Select the correct value depending on the use-case below.
  - Search and Retrieval: Embedding use cases like RAG and search involve two main steps: first, creating an index by generating embeddings for the content, and second, retrieving the most relevant content from the index during search. Use the following values when working with search and retrieval use-cases:
    
    Indexing:
    
    "GENERIC_INDEX" - Creates embeddings optimized for use as indexes in a vector data store. This value should be used irrespective of the modality you are indexing.
    
    Search/retrieval: Optimize your embeddings depending on the type of content you are retrieving:
    
    "TEXT_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only text embeddings.
    "IMAGE_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only image embeddings created with the "STANDARD_IMAGE" detailLevel.
    "VIDEO_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only video embeddings or embeddings created with the "AUDIO_VIDEO_COMBINED" embedding mode.
    "DOCUMENT_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only document image embeddings created with the "DOCUMENT_IMAGE" detailLevel.
    "AUDIO_RETRIEVAL" - Creates embeddings optimized for searching a repository containing only audio embeddings.
    "GENERIC_RETRIEVAL" - Creates embeddings optimized for searching a repository containing mixed modality embeddings.
    
    Example: In an image search app where users retrieve images using text queries, use embeddingPurpose = generic_index when creating an embedding index based on the images and use embeddingPurpose = image_retrieval when creating an embedding of the query used to retrieve the images.
  - "CLASSIFICATION" - Creates embeddings optimized for performing classification.
  - "CLUSTERING" - Creates embeddings optimized for clustering.
- embeddingDimension (Optional) - The size of the vector to generate.
  - Type: int
  - Allowed values: 256 | 384 | 1024 | 3072
  - Default: 3072
- text (Optional) - Represents text content. Exactly one of text, image, video, audio must be present.
  - truncationMode (Required) - Specifies which part of the text will be truncated in cases where the tokenized version of the text exceeds the maximum supported by the model.
    
    Type: string
    Allowed values:
    
    "START" - Omit characters from the start of the text when necessary.
    "END" - Omit characters from the end of the text when necessary.
    "NONE" - Fail if text length exceeds the model's maximum token limit.
  - value (Optional; Either value or source must be provided) - The text value for which to create the embedding.
    
    Type: string
    Max length: 8192 characters
  - source (Optional; Either value or source must be provided) - Reference to a text file stored in S3. Note that the bytes option of the SourceObject is not applicable for text inputs. To pass text inline as part of the request, use the value parameter instead.
  - segmentationConfig (Required) - Controls how text content should be segmented into multiple embeddings.
    
    maxLengthChars (Optional) - The maximum length to allow for each segment. The model will attempt to segment only at word boundaries.
    
    Type: int
    Valid range: 800-50,000
    Default: 32,000
- image (Optional) - Represents image content. Exactly one of text, image, video, audio must be present.
  - format (Required)
    
    Type: string
    Allowed values: "png" | "jpeg" | "gif" | "webp"
  - source (Required) - An image content source.
    
    Type: SourceObject (see "Common Objects" section)
  - detailLevel (Optional) - Dictates the resolution at which the image will be processed with "STANDARD_IMAGE" using a lower image resolution and "DOCUMENT_IMAGE" using a higher resolution image to better interpret text.
    
    Type: string
    Allowed values: "STANDARD_IMAGE" | "DOCUMENT_IMAGE"
    Default: "STANDARD_IMAGE"
- audio (Optional) - Represents audio content. Exactly one of text, image, video, audio must be present.
  - format (Required)
    
    Type: string
    Allowed values: "mp3" | "wav" | "ogg"
  - source (Required) - An audio content source.
    
    Type: SourceObject (see "Common Objects" section)
  - segmentationConfig (Required) - Controls how audio content should be segmented into multiple embeddings.
    
    durationSeconds (Optional) - The maximum duration of audio (in seconds) to use for each segment.
    
    Type: int
    Valid range: 1-30
    Default: 5
- video (Optional) - Represents video content. Exactly one of text, image, video, audio must be present.
  - format (Required)
    
    Type: string
    Allowed values: "mp4" | "mov" | "mkv" | "webm" | "flv" | "mpeg" | "mpg" | "wmv" | "3gp"
  - source (Required) - A video content source.
    
    Type: SourceObject (see "Common Objects" section)
  - embeddingMode (Required)
    
    Type: string
    Values: "AUDIO_VIDEO_COMBINED" | "AUDIO_VIDEO_SEPARATE"
    
    "AUDIO_VIDEO_COMBINED" - Will produce a single embedding for each segment combining both audible and visual content.
    "AUDIO_VIDEO_SEPARATE" - Will produce two embeddings for each segment, one for the audio content and one for the video content.
  - segmentationConfig (Required) - Controls how video content should be segmented into multiple embeddings.
    
    durationSeconds (Optional) - The maximum duration of video (in seconds) to use for each segment.
    
    Type: int
    Valid range: 1-30
    Default: 5

StartAsyncInvoke Response

The response from a call to StartAsyncInvoke will have the structure below. The invocationArn can be used to query the status of the asynchronous job using the GetAsyncInvoke function.



{
    "invocationArn": "arn:aws:bedrock:us-east-1:xxxxxxxxxxxx:async-invoke/lvmxrnjf5mo3",
}

Asynchronous Output

When asynchronous embeddings generation is complete, output artifacts are written to the S3 bucket you specified as the output destination. The files will have the following structure:



   amzn-s3-demo-bucket/
    job-id/
        segmented-embedding-result.json
        embedding-audio.jsonl
        embedding-image.json
        embedding-text.jsonl
        embedding-video.jsonl
        manifest.json

The segmented-embedding-result.json will contain the overall job result and reference to the corresponding jsonl files which contain actual embeddings per modality. Below is a truncated example of a file:



{
    "sourceFileUri": string, 
    "embeddingDimension": 256 | 384 | 1024 | 3072,
    "embeddingResults": [
        {
            "embeddingType": "TEXT" | "IMAGE" | "VIDEO" | "AUDIO" | "AUDIO_VIDEO_COMBINED",
            "status": "SUCCESS" | "FAILURE" | "PARTIAL_SUCCESS",
            "failureReason": string, // Granular error codes
            "message": string, // Human-readbale failure message
            "outputFileUri": string // S3 URI to a "embedding-modality.jsonl" file
        }
        ...
    ]
}

The embedding-modality.json will be jsonl files which contain the embedding output for each modality. Each line in the jsonl file will adhere to the following schema:



{
    "embedding": number[], // The generated embedding vector
    "segmentMetadata": {
        "segmentIndex": number,
        "segmentStartCharPosition": number, // Included for text only
        "segmentEndCharPosition": number, // Included for text only
        "truncatedCharLength": number, // Included only when text gets truncated
        "segmentStartSeconds": number, // Included for audio/video only
        "segmentEndSeconds": number // Included for audio/video only
    },
    "status": "SUCCESS" | "FAILURE",
    "failureReason": string, // Granular error codes
    "message": string // Human-readable failure message
}

The following list includes all of the parameters for the response. For text characters or audio/video times, all starting and ending times are zero-based. Additionally, all ending text positions or audio/video time values are inclusive.

embedding (Required) — The embedding vector.
- Type: number
segmentMetadata — The metadata for the segment.
- segmentIndex — The index of the segment within the array provided in the request.
- segmentStartCharPosition — For text only. The starting (inclusive) character position of the embedded content within the segment.
- segmentEndCharPosition — For text only. The ending character (exclusive) position of the embedded content within the segment.
- truncatedCharLength (Optional) — Returned if the tokenized version of the input text exceeded the model’s limitations. The value indicates the character after which the text was truncated before generating the embedding.
  - Type: integer
- segmentStartSeconds — For audio/video only. The starting time position of the embedded content within the segment.
- segmentEndSeconds — For audio/video only. The ending time position of the embedded content within the segment.
status — The status for the segment.
failureReason — The detailed reasons on the failure for the segment.
- RAI_VIOLATION_INPUT_TEXT_DEFLECTION — Input text violates RAI policy.
- RAI_VIOLATION_INPUT_IMAGE_DEFLECTION — input image violates RAI policy.
- INVALID_CONTENT — Invalid input.
- RATE_LIMIT_EXCEEDED — Embedding request is throttled due to service unavailability.
- INTERNAL_SERVER_EXCEPTION — Something went wrong.
message — Related failure message.

File limitations for Nova Embeddings

Synchronous operations can accept both S3 inputs and inline chunks. Asynchronous operations can only accept S3 inputs.

When generating embeddings asynchronously, you'll need to ensure that your file is separated into an appropriate number of segments. For text embeddings you cannot have more than 1900 segments. For audio and video embeddings you cannot have more than 1434 segments.

Synchronous Input size limits
File Type	Size Limit
(Inline) All file types	25 MB
(S3) Text	1 MB; 50,000 characters
(S3) Image	50 MB
(S3) Video	30 seconds; 100 MB
(S3) Audio	30 seconds; 100 MB

Note

The 25 MB inline file restriction is after Base64 embedding. This causes a file size inflation of about 33%

Asynchronous Input size limits
File Type	Size Limit
(S3) Text	634 MB
(S3) Image	50 MB
(S3) Video	2 GB; 2 hours
(S3) Audio	1 GB; 2 hours

Input file types
Modality	File types
Image Formats	PNG, JPEG, WEBP, GIF
Audio Formats	MP3, WAV, OGG
Video Formats	MP4, MOV, MKV, WEBM, FLV, MPEG, MPG, WMV, 3GP

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using Nova Embeddings

Prompting understanding models