Easy Multimodal Music Search with S3 Vectors
With AWS S3 Vectors, you can do vector-based similarity search using S3 alone — no separate vector DB required.
In this post, I embed audio from the MusicCaps music dataset (including captions) with CLAP, and wire up a multimodal music search system that retrieves songs by text query — built in moments with aws-cdk and uv.
In the demo app built with gradio, you can search songs with text inputs.
The full source code is in the repository below.
atsukoba/MusicCap-S3VectorsSearch-CLAP
S3Vectors
Amazon S3 Vectors is a native vector storage feature of S3, which became generally available in July 2025.
Amazon S3 Vectors is the first cloud storage with native vector support at scale. — AWS Blog: Introducing Amazon S3 Vectors
Previously, setting up vector search required dedicated infrastructure — Pinecone, Weaviate, pgvector, etc. S3 Vectors introduces a new "vector bucket" type where you can store and search vectors with the same feel as ordinary object storage.
- Cost: Up to 90% cost reduction compared to managed vector databases
- Scale: Up to 10,000 vector indexes per bucket, each holding tens of millions of vectors
- Serverless: No infrastructure provisioning required, pay-per-use
- AWS ecosystem: Native integration with Amazon Bedrock Knowledge Bases and SageMaker Unified Studio (not used in this post)
Usingboto3
S3 Vectors is operated via a new boto3 client called s3vectors. Inserting vectors uses put_vectors and searching uses query_vectors.
import boto3 s3vectors = boto3.client("s3vectors", "ap-northeast-1") s3vectors.put_vectors( vectorBucketName="music-cap-vectors", indexName="music-embeddings", vectors=[ { "key": "track_001", "data": {"float32": [0.12, -0.34, ...]}, "metadata": {"caption": "A soft piano melody ...", "ytid": "abc123"}, # arbitrary metadata }, ], ) response = s3vectors.query_vectors( vectorBucketName="music-cap-vectors", indexName="music-embeddings", queryVector={"float32": [0.05, -0.28, ...]}, topK=10, returnMetadata=True, returnDistance=True, )
MusicCapsDataset
MusicCaps is a music captioning dataset published by Google, containing 5,521 music clip–text description pairs.
- Each clip is a 10-second audio segment sourced from YouTube (via AudioSet timing metadata)
- Descriptions are written by professional musicians — roughly 4 sentences on average covering genre, instrumentation, mood, tempo, and timbre
- An aspect list of keywords ("mellow piano melody", "fast-paced drums", etc.) is also included
- License: CC-BY-SA 4.0
The dataset has columns such as ytid (YouTube ID), start_s, end_s, caption, and aspect_list. The actual audio must be fetched from YouTube with yt-dlp.
Note: Downloading videos from YouTube is prohibited under YouTube's Terms of Service. The methods described in this article are for research and educational purposes only and should be used at your own risk.
from datasets import load_dataset ds = load_dataset("google/MusicCaps", split="train", token=HF_TOKEN) # 5521 examples # Features: ytid, start_s, end_s, audioset_positive_labels, aspect_list, caption, ...
Fetching audio with yt-dlp using --download-sections to clip the exact segment:
yt-dlp --quiet --no-warnings -x --audio-format wav -f bestaudio \ -o "{ytid}.wav" \ --download-sections "*{start_s}-{end_s}" \ "https://www.youtube.com/watch?v={ytid}"
Note: Some clips may not be available. Without
HF_TOKENset in.env, the dataset is limited to 32 samples.

CLAP:JointAudio-TextEmbeddings
CLAP (Contrastive Language-Audio Pretraining) is an open-source multimodal model from LAION — think of it as the audio counterpart of OpenAI's CLIP (images × text).
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation — Wu et al., ICASSP 2023 (arXiv
.06687)
Architecture
CLAP has an audio encoder (HTSAT-based) and a text encoder (RoBERTa-based), both trained contrastively so that their embeddings land in the same latent space.

Because audio and text embeddings share a common space, computing cosine distance is all it takes to do text → music retrieval.
LoadingviaHuggingFaceTransformers
This project uses laion/clap-htsat-unfused loaded via HuggingFace Transformers.
import torch from transformers import AutoModel, AutoTokenizer, AutoProcessor, ClapModel MODEL_ID = "laion/clap-htsat-unfused" # Use float32 on Apple Silicon (float16 causes MPS broadcast errors); float16 elsewhere dtype = torch.float32 if torch.backends.mps.is_available() else torch.float16 model: ClapModel = AutoModel.from_pretrained( MODEL_ID, dtype=dtype, device_map="auto" ).eval() tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) processor = AutoProcessor.from_pretrained(MODEL_ID)
TextEmbeddings
import numpy as np def get_text_embeddings(texts: list[str]) -> list[float]: inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device) features = model.get_text_features(**inputs) return ( features["pooler_output"][0, :] .detach().cpu().numpy().astype(np.float32).tolist() )
AudioEmbeddings
from datasets import Dataset, Audio def get_audio_embeddings(audio_path: str) -> list[float]: input_features = processor( audio=Dataset.from_dict({"audio": [audio_path]}).cast_column( "audio", Audio(sampling_rate=48000) )[0]["audio"]["array"], sampling_rate=48000, return_tensors="pt", )["input_features"].to(model.device) features = model.get_audio_features(input_features=input_features) return ( features["pooler_output"][0, :] .detach().cpu().numpy().astype(np.float32).tolist() )
pooler_output yields a 512-dimensional L2-normalized vector.
ImplementationWalkthrough
DownloadtheDatasetandAudioFiles
scripts/create_demo_datasets.py loads the MusicCaps metadata from HuggingFace Hub and downloads audio clips with yt-dlp. datasets.map with num_proc handles parallel downloads.
from datasets import Audio, load_dataset ds = load_dataset( "google/MusicCaps", split="train", cache_dir=str(data_dir / "MusicCaps"), token=HF_TOKEN, ).select(range(samples_to_load)) ds = ds.map( lambda example: _process(example, data_dir / "audio"), num_proc=4, ).cast_column("audio", Audio(sampling_rate=44100))
CreateEmbeddingswithCLAP
scripts/create_embeddings.py processes each downloaded clip through CLAP's audio encoder and saves the result as a per-ytid .npy file.
from tqdm import tqdm from torch import Tensor musiccaps_ds = ( load_dataset("google/MusicCaps", split="train", ...) .map(link_audio_fn(str(data_dir / "audio")), batched=True) .filter(lambda d: os.path.exists(d["audio"])) .cast_column("audio", Audio(sampling_rate=48000)) .map(process_audio_fn(processor, sampling_rate=48000)) ) for _data in tqdm(musiccaps_ds): _input = Tensor(_data["input_features"]).unsqueeze(0).to(model.device) audio_embed = model.get_audio_features(_input) np.save( data_dir / "embeddings" / f"{_data['ytid']}.npy", audio_embed["pooler_output"][0, :].detach().cpu().numpy(), )
ProvisionS3VectorswithAWSCDK
The vector bucket and index are provisioned with AWS CDK (TypeScript). aws-cdk-lib/aws-s3vectors provides L1 constructs that map directly to the CloudFormation resource types.
// cdk/lib/cdk-stack.ts import * as s3vectors from "aws-cdk-lib/aws-s3vectors"; // Vector bucket const vectorBucket = new s3vectors.CfnVectorBucket( this, "MusicCapVectorBucket", { vectorBucketName: "music-cap-vectors", }, ); // 512-dim float32 index with cosine similarity const vectorIndex = new s3vectors.CfnIndex(this, "MusicEmbeddingsIndex", { vectorBucketName: vectorBucket.vectorBucketName!, indexName: "music-embeddings", dataType: "float32", dimension: 512, distanceMetric: "cosine", }); vectorIndex.addDependency(vectorBucket);
Deploy with the CDK CLI:
cd cdk/ pnpm install pnpm cdk bootstrap # first time only pnpm cdk deploy
On success you get:
CdkStack.VectorBucketArn = arn:aws:s3vectors:ap-northeast-1:123456789012:bucket/music-cap-vectors CdkStack.VectorIndexArn = arn:aws:s3vectors:ap-northeast-1:123456789012:bucket/music-cap-vectors/index/music-embeddings
PUTVectorswithboto3
scripts/upload_vectors.py handles ingestion. It first paginates through list_vectors to fetch existing keys so only new vectors are uploaded (incremental ingestion).
import boto3 import numpy as np S3_BUCKET_NAME = "music-cap-vectors" S3_INDEX_NAME = "music-embeddings" BATCH_SIZE = 100 s3vectors = boto3.client("s3vectors", region_name="ap-northeast-1") vectors_to_put = [] for sample in musiccaps_ds: ytid = sample["ytid"] embedding = np.load(f"data/embeddings/{ytid}.npy").astype(np.float32) vec = embedding.squeeze() vectors_to_put.append({ "key": ytid, "data": {"float32": vec.tolist()}, "metadata": { "ytid": ytid, "caption": str(sample["caption"]), "aspect_list": ", ".join(sample["aspect_list"]), "start_s": str(sample["start_s"]), "end_s": str(sample["end_s"]), }, }) if len(vectors_to_put) >= BATCH_SIZE: s3vectors.put_vectors( vectorBucketName=S3_BUCKET_NAME, indexName=S3_INDEX_NAME, vectors=vectors_to_put, ) vectors_to_put = []
Metadata type constraint: all metadata values must be strings for S3 Vectors.
SearchforMusicwithaTextQuery
demo/search.py embeds the query with CLAP's text encoder and calls query_vectors.
# demo/search.py from botocore.config import Config import boto3 from demo.config import S3_BUCKET_NAME, S3_INDEX_NAME s3vectors_client = boto3.client( "s3vectors", "ap-northeast-1", config=Config(connect_timeout=10, read_timeout=10), ) def search(embedding: list[float], top_k: int = 10): response = s3vectors_client.query_vectors( vectorBucketName=S3_BUCKET_NAME, indexName=S3_INDEX_NAME, queryVector={"float32": embedding}, topK=top_k, returnMetadata=True, returnDistance=True, ) return response["vectors"]
Example result (lower distance = more similar under cosine distance):
{ 'distance': 0.360, 'key': '2unse6chkMU', 'metadata': { 'caption': 'This is a piece that would be suitable as calming study music...', 'aspect_list': 'calming piano music, soothing, bedtime music, sleep music, piano, reverb, violin', 'ytid': '2unse6chkMU', ... } }
BuildtheDemoUIwithGradio
demo/app.py uses gr.Blocks to display results in a gr.Dataframe. Clicking a row triggers local audio playback if the file is available.
# demo/app.py import gradio as gr import pandas as pd from demo.feature_extract import get_text_embeddings from demo.local_data import get_local_audio_by_ytid from demo.search import search def do_search(query: str, top_k: int) -> tuple[pd.DataFrame, dict]: embedding = get_text_embeddings([query]) results = search(embedding, top_k=int(top_k)) rows = [ { "rank": i + 1, "ytid": r["key"], "distance": round(r["distance"], 4), "caption": r.get("metadata", {}).get("caption", ""), "aspect_list": r.get("metadata", {}).get("aspect_list", ""), } for i, r in enumerate(results) ] return pd.DataFrame(rows), gr.update(value=None, visible=False) def on_select(df: pd.DataFrame, evt: gr.SelectData) -> dict: ytid = str(df.iloc[evt.index[0]]["ytid"]) audio_path = get_local_audio_by_ytid(ytid) return gr.update(value=audio_path, label=f"Preview: {ytid}", visible=True) with gr.Blocks(title="MusicCap Search") as demo: gr.Markdown("# Text -> Audio Search on S3Vectors") with gr.Row(): query_input = gr.Textbox(placeholder="ex: calming piano music with soft strings", label="Query", scale=4) top_k_slider = gr.Slider(minimum=1, maximum=30, value=10, step=1, label="Top K", scale=1) search_btn = gr.Button("検索", variant="primary") results_df = gr.Dataframe( headers=["rank", "ytid", "distance", "caption", "aspect_list"], label="検索結果(行をクリックして再生)", interactive=False, wrap=True, ) audio_player = gr.Audio(label="Preview", visible=False) search_btn.click(fn=do_search, inputs=[query_input, top_k_slider], outputs=[results_df, audio_player]) query_input.submit(fn=do_search, inputs=[query_input, top_k_slider], outputs=[results_df, audio_player]) results_df.select(fn=on_select, inputs=[results_df], outputs=[audio_player]) demo.launch()
uv run python -m demo.app

Evaluation
This is a simple validation to verify that the search system is working properly.
While all current vector data is computed from audio, the original MusicCaps dataset includes human-written captions and taxonomies. We test whether we can retrieve samples by searching with their ground truth captions and aspect lists.
For this demo, I randomly selected 1024 samples from the 5,521-sample dataset, successfully downloaded 960 of them, and built an S3 Vectors index. Queries use the caption text and aspect_list tags.
Example: id=-7B9tPuIP-w
caption
A male voice narrates a monologue to the rhythm of a song in the background. The song is fast tempo with enthusiastic drumming, groovy bass lines,cymbal ride, keyboard accompaniment ,electric guitar and animated vocals. The song plays softly in the background as the narrator speaks and burgeons when he stops. The song is a classic Rock and Roll and the narration is a Documentary.
aspect_list
['r&b', 'soul', 'male vocal', 'melodic singing', 'strings sample', 'strong bass', 'electronic drums', 'sensual', 'groovy', 'urban']
Queries retrieve top_k=100 results ranked by cosine similarity, and we compute Precision@k (or rather, top-k accuracy).

Results show that when searching with 960 samples, approximately 40% of queries return the original sample in the top 10.
ReflectionsandSummary
S3Vectorsreallyiseasy
A few lines of CDK to define the bucket and index, then plain put_vectors / query_vectors boto3 calls — no extra managed service to keep running. The developer experience is remarkably clean.
For prototypes or small-to-medium workloads (up to a few million vectors), S3 Vectors is well worth evaluating before reaching for Pinecone or Weaviate.
CLAPsearchquality
Using laion/clap-htsat-unfused via HuggingFace Transformers gave good results for queries about instruments, genre, and tempo. Emotional queries ("sad", "happy") were a bit noisier. It is worth noting that some MusicCaps captions may overlap with CLAP's training data, so benchmarking results should be interpreted with care.
AppleSiliconworkstoo
Setting dtype=torch.float32 when torch.backends.mps.is_available() keeps things running on M-series Macs — float16 triggers MPS broadcast errors with this model. The M4 Max ran inference without issues.
References
- Introducing Amazon S3 Vectors - AWS Blog
- Amazon S3 Vectors - Getting Started
- boto3 S3Vectors Reference
- LAION-AI/CLAP - GitHub
- laion/clap-htsat-unfused - Hugging Face
- Large-scale Contrastive Language-Audio Pretraining (arXiv.06687)
- google/MusicCaps - Hugging Face
- MusicCaps overview article - Zenn (tatexh)