Loading...
Discovering amazing AI tools


Multimodal embedding model that creates holistic video/audio/text/image embeddings for semantic search and video understanding.

Multimodal embedding model that creates holistic video/audio/text/image embeddings for semantic search and video understanding.
Marengo 3.0 is TwelveLabs' multimodal embedding model designed to analyze and encode visuals, audio, and text from videos into dense embeddings. It enables semantic video search, content indexing, and downstream tasks by producing high-quality, multimodal representations that capture scene, audio, and textual context. The model is accessible via TwelveLabs SDKs (Python and JavaScript) and is used in embedding pipelines (including asynchronous workflows on platforms like AWS Bedrock) to create searchable indexes, frame- and shot-level analyses, and transcript-linked embeddings. Its value lies in providing holistic, human-like video understanding that supports image- and natural-language queries for efficient retrieval and analysis.

