TwelveLabs Marengo Embed 2.7
The TwelveLabs Marengo Embed 2.7 model generates embeddings from video, text, audio, or image inputs. These embeddings can be used for similarity search, clustering, and other machine learning tasks. The model supports asynchronous inference through the StartAsyncInvoke
API.
Provider — TwelveLabs
Categories — Embeddings, multimodal
Model ID —
twelvelabs.marengo-embed-2-7-v1:0
Input modality — Video, Text, Audio, Image
Output modality — Embeddings
Max video size — 2 hours long video (< 2GB file size)
TwelveLabs Marengo Embed 2.7 request parameters
The following table describes the input parameters for the TwelveLabs Marengo Embed 2.7 model:
Field | Type | Required | Description |
---|---|---|---|
inputType |
string | Yes | Modality for the embedding. Valid values: video , text , audio , image . |
inputText |
string | No | Text to be embedded when inputType is text . Required if inputType is text . Text input is not available by S3 URI but only by the inputText field. |
startSec |
double | No | The start offset in seconds from the beginning of the video or audio where processing should begin. Specifying 0 means starting from the beginning of the media. Default: 0, Min: 0. |
lengthSec |
double | No | The length in seconds of the video or audio where the processing would take from startSec . Default: media duration, Max: media duration. |
useFixedLengthSec |
double | No | For audio or video inputs only. The desired fixed duration in seconds for each clip for which the platform generates an embedding. Min: 2, Max: 10. If missing, for video: segments are divided dynamically by shot boundary detection; for audio: segments are divided evenly to be closest to 10 seconds (so if it's a 50 second clip then it will be 5 segments with 10 seconds each, but if it's a 16 second clip it will be 2 segments 8 seconds each). |
textTruncate |
string | No | For text input only. Specifies how the platform truncates text that exceeds 77 tokens. Valid values: end (truncate the end of the text), none (return an error if text exceeds limit). Default: end . |
embeddingOption |
list | No | For video input only. Specifies which types of embeddings to retrieve. Valid values: visual-text (visual embeddings optimized for text search), visual-image (visual embeddings optimized for image search), audio (audio embeddings). If not provided, all available embeddings are returned. |
mediaSource |
object | No | Describes the media source. Required for input types: image , video , and audio . |
mediaSource.base64String |
string | No | Base64 encoded byte string for the media. Max: 36MB. Either base64String or s3Location must be provided if mediaSource is used. |
mediaSource.s3Location.uri |
string | No | S3 URI where the media could be downloaded from. For video, max: 2 hours long (< 2GB file size). Required if using s3Location . |
mediaSource.s3Location.bucketOwner |
string | No | AWS account ID of the bucket owner. |
minClipSec |
int | No | For video input only. Set a minimum clip second. Note: useFixedLengthSec should be larger than this value. Default: 4, Min: 1, Max: 5. |
TwelveLabs Marengo Embed 2.7 response fields
The following table describes the output fields for the TwelveLabs Marengo Embed 2.7 model:
Field | Type | Description |
---|---|---|
embedding |
List of doubles | Embedding values |
embeddingOption |
string | The type of embeddings for multi-vector output (only applicable for video). Valid values: visual-text (visual embeddings closely aligned with text embeddings), visual-image (visual embeddings closely aligned with image embeddings), audio (audio embeddings). |
startSec |
double | The start offset of the clip. Not applicable for text and image embeddings. |
endSec |
double | The end offset of the clip. Not applicable for text and image embeddings. |
TwelveLabs Marengo Embed 2.7 request and response
The following examples show how to use the TwelveLabs Marengo Embed 2.7 model with different input types. Note that TwelveLabs Marengo Embed 2.7 uses the StartAsyncInvoke API for processing.