TwelveLabs Marengo Embed 3.0
The TwelveLabs Marengo Embed 3.0 model generates enhanced embeddings from video, text, audio, or image inputs. This latest version offers improved performance and accuracy for similarity search, clustering, and other machine learning tasks.
Provider — TwelveLabs
Model ID — twelvelabs.marengo-embed-3-0-v1:0
Marengo Embed 3.0 delivers several key enhancements:
Extended video processing capacity – Process up to 4 hours of video and audio content and files up to 6 GB—double the capacity of previous versions—making it ideal for analyzing full sporting events, extended training videos, and complete film productions.
Enhanced sports analysis – The model delivers significant improvements with better understanding of gameplay dynamics, player movements, and event detection.
Global multilingual support – Expanded language capabilities from 12 to 36 languages, enabling global organizations to build unified search and retrieval systems that work seamlessly across diverse regions and markets.
Multimodal search precision – Combine images and descriptive text in a single embedding request, merging visual similarity with semantic understanding to deliver more accurate and contextually relevant search results.
Reduced embedding dimension – Reduced from 1024 to 512, cutting storage costs.
The TwelveLabs Marengo Embed 3.0 model supports the Amazon Bedrock Runtime operations in the following table.
-
For more information about use cases for different API methods, see Learn about use cases for different model inference methods.
-
For more information about model types, see How inference works in Amazon Bedrock.
-
For a list of model IDs and to see the models and AWS Regions that TwelveLabs Marengo Embed 3.0 is supported in, search for the model in the table at Supported foundation models in Amazon Bedrock.
-
For a full list of inference profile IDs, see Supported Regions and models for inference profiles. The inference profile ID is based on the AWS Region.
-
| API operation | Supported model types | Input modalities | Output modalities |
|---|---|---|---|
|
InvokeModel |
US East (N. Virginia) – Base models and Inference profiles Europe (Ireland) – Inference profiles Asia Pacific (Seoul) - Base models |
Text Image Note: Text and image interleaved is also supported. |
Embedding |
| StartAsyncInvoke | Base models |
Video Audio Image Text Note: Text and image interleaved is also supported. |
Embedding |
Note
Use InvokeModel to generate embeddings for search query. Use StartAsyncInvoke to generate embeddings for assets at a large scale.
The following quotas apply to the input:
| Input modality | Maximum |
|---|---|
| Text | 500 tokens |
| Image | 5 MB per image |
| Video (S3) | 6 GB, 4 hour length |
| Audio (S3) | 6 GB, 4 hour length |
Note
If you define audio or video inline by using base64-encoding, make sure that the request body payload doesn't exceed the Amazon Bedrock 25 MB model invocation quota.
Topics
TwelveLabs Marengo Embed 3.0 request parameters
When you make a request, the field in which the model-specific input is specified depends on the API operation:
-
InvokeModel – In the request
body. -
StartAsyncInvoke – In the
modelInputfield of the request body.
The format of the model input depends on the input modality:
Expand the following sections for details about the input parameters:
Modality for the embedding.
Type: String
Required: Yes
-
Valid values:
text|image|text_image|audio|video
Text to be embedded.
Type: String
Required: Yes (for compatible input types)
-
Compatible input types: Text
Contains information about the media source.
Type: Object
Required: Yes (if compatible type)
-
Compatible input types: Image, Video, Audio
The format of the mediaSource object in the request body depends on whether the media is defined as a Base64-encoded string or as an S3 location.
-
Base64-encoded string
{ "mediaSource": { "base64String": "base64-encoded string" } }-
base64String– The Base64-encoded string for the media.
-
-
S3 location – Specify the S3 URI and the bucket owner.
{ "s3Location": { "uri": "string", "bucketOwner": "string" } }-
uri– The S3 URI containing the media. -
bucketOwner– The AWS account ID of the S3 bucket owner.
-
Specifies which types of embeddings to retrieve.
Type: List
Required: No
Valid values for list members:
-
visual– Visual embeddings from the video. -
audio– Embeddings of the audio in the video. -
transcription– Embeddings of the transcribed text.
-
-
Default value:
Video: ["visual", "audio", "transcription"]
Audio: ["audio", "transcription"]
-
Compatible input types: Video, Audio
Specifies the scope of the embeddings to retrieve.
Type: List
Required: No
Valid values for list members:
-
clip– Returns embeddings for each clip. -
asset– Returns embeddings for the entire asset.
-
-
Compatible input types: Video, Audio
The time point in seconds of the clip where processing should begin.
Type: Double
Required: No
Minimum value: 0
Default value: 0
-
Compatible input types: Video, Audio
The time point in seconds where processing should end.
Type: Double
Required: No
Minimum value: startSec + segment length
Maximum value: Duration of media
Default value: Duration of media
-
Compatible input types: Video, Audio
Defines how the media is divided into segments for embedding generation.
Type: Object
Required: No
-
Compatible input types: Video, Audio
The segmentation object contains a method field and method-specific parameters:
-
method– The segmentation method to use. Valid values:dynamic|fixed -
dynamic– For video, uses shot boundary detection to divide content dynamically. Contains:-
minDurationSec– Minimum duration for each segment in seconds. Type: Integer. Range: 1-5. Default: 4.
-
-
fixed– Divides content into segments of equal duration. Contains:-
durationSec– Duration of each segment in seconds. Type: Integer. Range: 1-10. Default: 6.
-
Default behavior:
-
Video: Uses dynamic segmentation with shot boundary detection.
-
Audio: Uses fixed segmentation, dividing content as evenly as possible with segments close to 10 seconds.
Unique identifier for the inference request.
Type: String
Required: No
TwelveLabs Marengo Embed 3.0 response
The location of the output embeddings and associated metadata depends on the invocation method:
-
InvokeModel – In the response body.
-
StartAsyncInvoke – In the S3 bucket defined in
s3OutputDataConfig, after the asynchronous invocation job completes.
If there are multiple embeddings vectors, the output is a list of objects, each containing a vector and its associated metadata.
The format of the output embeddings vector is as follows:
{ "data": { "embedding": [ 0.111, 0.234, ... ], "embeddingOption": ["visual", "audio", "transcription" (for video input) | "audio", "transcription" (for audio input)], "embeddingScope": ["asset" | "clip"], "startSec": 0, "endSec": 4.2 } }
The embeddings are returned as an array of floats.
Where you see this response depends on the API method you used:
-
InvokeModel – Appears in the response body.
-
StartAsyncInvoke – Appears at the S3 location that you specified in the request. The response returns an
invocationArnthat you can use to get metadata about the asynchronous invocation, including the status and the S3 location to which the results are written.
Expand the following sections for details about the response parameters:
Embeddings vector representation of input.
Type: List of doubles
The type of embeddings.
Type: String
Possible values:
-
visual – Visual embeddings from the video.
-
audio – Embeddings of the audio in the video.
-
transcription – Embeddings of the transcribed text.
-
-
Compatible input types: Video, Audio
Specifies the scope of the embeddings to retrieve.
Type: String
You can include one or more of the following values:
-
clip: Returns embeddings for each clip.
-
asset: Returns embeddings for the entire asset.
The start offset of the clip.
Type: Double
-
Compatible input types: Video, Audio
The end offset of the clip. Not applicable for text, image and text_image embeddings.
Type: Double
-
Compatible input types: Video, Audio
TwelveLabs Marengo Embed 3.0 code examples
This section shows how to use the TwelveLabs Marengo Embed 3.0 model with different input types using Python.
Note
Currently, InvokeModel supports text, image, and text and image interleaved input.
Put your code together in the following steps:
1. Define model-specific input
Define the model-specific input depending on your input type:
2. Run model invocation using the model input
Then, add the code snippet that corresponds to your model invocation method of choice.