Easy Multimodal Music Search with S3 Vectors

Category:Tech BlogTags:
#AWS#S3Vectors#Python#CLAP#MusicCaps#Gradio#MachineLearning
Published: 2025 - 12 - 23

With AWS S3 Vectors, you can do vector-based similarity search using S3 alone — no separate vector DB required.

In this post, I embed audio from the MusicCaps music dataset (including captions) with CLAP, and wire up a multimodal music search system that retrieves songs by text query — built in moments with aws-cdk and uv.

In the demo app built with gradio, you can search songs with text inputs.

The full source code is in the repository below.

atsukoba/MusicCap-S3VectorsSearch-CLAP


Amazon S3 Vectors is a native vector storage feature of S3, which became generally available in July 2025.

Amazon S3 Vectors is the first cloud storage with native vector support at scale. — AWS Blog: Introducing Amazon S3 Vectors

Previously, setting up vector search required dedicated infrastructure — Pinecone, Weaviate, pgvector, etc. S3 Vectors introduces a new "vector bucket" type where you can store and search vectors with the same feel as ordinary object storage.

  • Cost: Up to 90% cost reduction compared to managed vector databases
  • Scale: Up to 10,000 vector indexes per bucket, each holding tens of millions of vectors
  • Serverless: No infrastructure provisioning required, pay-per-use
  • AWS ecosystem: Native integration with Amazon Bedrock Knowledge Bases and SageMaker Unified Studio (not used in this post)

Using
boto3

S3 Vectors is operated via a new boto3 client called s3vectors. Inserting vectors uses put_vectors and searching uses query_vectors.

import boto3

s3vectors = boto3.client("s3vectors", "ap-northeast-1")

s3vectors.put_vectors(
    vectorBucketName="music-cap-vectors",
    indexName="music-embeddings",
    vectors=[
        {
            "key": "track_001",
            "data": {"float32": [0.12, -0.34, ...]},
            "metadata": {"caption": "A soft piano melody ...", "ytid": "abc123"},  # arbitrary metadata
        },
    ],
)

response = s3vectors.query_vectors(
    vectorBucketName="music-cap-vectors",
    indexName="music-embeddings",
    queryVector={"float32": [0.05, -0.28, ...]},
    topK=10,
    returnMetadata=True,
    returnDistance=True,
)

MusicCaps is a music captioning dataset published by Google, containing 5,521 music clip–text description pairs.

  • Each clip is a 10-second audio segment sourced from YouTube (via AudioSet timing metadata)
  • Descriptions are written by professional musicians — roughly 4 sentences on average covering genre, instrumentation, mood, tempo, and timbre
  • An aspect list of keywords ("mellow piano melody", "fast-paced drums", etc.) is also included
  • License: CC-BY-SA 4.0

The dataset has columns such as ytid (YouTube ID), start_s, end_s, caption, and aspect_list. The actual audio must be fetched from YouTube with yt-dlp.

Note: Downloading videos from YouTube is prohibited under YouTube's Terms of Service. The methods described in this article are for research and educational purposes only and should be used at your own risk.

from datasets import load_dataset

ds = load_dataset("google/MusicCaps", split="train", token=HF_TOKEN)
# 5521 examples
# Features: ytid, start_s, end_s, audioset_positive_labels, aspect_list, caption, ...

Fetching audio with yt-dlp using --download-sections to clip the exact segment:

yt-dlp --quiet --no-warnings -x --audio-format wav -f bestaudio \
    -o "{ytid}.wav" \
    --download-sections "*{start_s}-{end_s}" \
    "https://www.youtube.com/watch?v={ytid}"

Note: Some clips may not be available. Without HF_TOKEN set in .env, the dataset is limited to 32 samples.

MusicCaps dataset


CLAP (Contrastive Language-Audio Pretraining) is an open-source multimodal model from LAION — think of it as the audio counterpart of OpenAI's CLIP (images × text).

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation — Wu et al., ICASSP 2023 (arXiv

.06687)

Architecture

CLAP has an audio encoder (HTSAT-based) and a text encoder (RoBERTa-based), both trained contrastively so that their embeddings land in the same latent space.

CLAP architecture

Because audio and text embeddings share a common space, computing cosine distance is all it takes to do text → music retrieval.

Loading
via
HuggingFace
Transformers

This project uses laion/clap-htsat-unfused loaded via HuggingFace Transformers.

import torch
from transformers import AutoModel, AutoTokenizer, AutoProcessor, ClapModel

MODEL_ID = "laion/clap-htsat-unfused"

# Use float32 on Apple Silicon (float16 causes MPS broadcast errors); float16 elsewhere
dtype = torch.float32 if torch.backends.mps.is_available() else torch.float16
model: ClapModel = AutoModel.from_pretrained(
    MODEL_ID, dtype=dtype, device_map="auto"
).eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)

Text
Embeddings

import numpy as np

def get_text_embeddings(texts: list[str]) -> list[float]:
    inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device)
    features = model.get_text_features(**inputs)
    return (
        features["pooler_output"][0, :]
        .detach().cpu().numpy().astype(np.float32).tolist()
    )

Audio
Embeddings

from datasets import Dataset, Audio

def get_audio_embeddings(audio_path: str) -> list[float]:
    input_features = processor(
        audio=Dataset.from_dict({"audio": [audio_path]}).cast_column(
            "audio", Audio(sampling_rate=48000)
        )[0]["audio"]["array"],
        sampling_rate=48000,
        return_tensors="pt",
    )["input_features"].to(model.device)
    features = model.get_audio_features(input_features=input_features)
    return (
        features["pooler_output"][0, :]
        .detach().cpu().numpy().astype(np.float32).tolist()
    )

pooler_output yields a 512-dimensional L2-normalized vector.


Download
the
Dataset
and
Audio
Files

scripts/create_demo_datasets.py loads the MusicCaps metadata from HuggingFace Hub and downloads audio clips with yt-dlp. datasets.map with num_proc handles parallel downloads.

from datasets import Audio, load_dataset

ds = load_dataset(
    "google/MusicCaps", split="train",
    cache_dir=str(data_dir / "MusicCaps"),
    token=HF_TOKEN,
).select(range(samples_to_load))

ds = ds.map(
    lambda example: _process(example, data_dir / "audio"),
    num_proc=4,
).cast_column("audio", Audio(sampling_rate=44100))

Create
Embeddings
with
CLAP

scripts/create_embeddings.py processes each downloaded clip through CLAP's audio encoder and saves the result as a per-ytid .npy file.

from tqdm import tqdm
from torch import Tensor

musiccaps_ds = (
    load_dataset("google/MusicCaps", split="train", ...)
    .map(link_audio_fn(str(data_dir / "audio")), batched=True)
    .filter(lambda d: os.path.exists(d["audio"]))
    .cast_column("audio", Audio(sampling_rate=48000))
    .map(process_audio_fn(processor, sampling_rate=48000))
)

for _data in tqdm(musiccaps_ds):
    _input = Tensor(_data["input_features"]).unsqueeze(0).to(model.device)
    audio_embed = model.get_audio_features(_input)
    np.save(
        data_dir / "embeddings" / f"{_data['ytid']}.npy",
        audio_embed["pooler_output"][0, :].detach().cpu().numpy(),
    )

Provision
S3
Vectors
with
AWS
CDK

The vector bucket and index are provisioned with AWS CDK (TypeScript). aws-cdk-lib/aws-s3vectors provides L1 constructs that map directly to the CloudFormation resource types.

// cdk/lib/cdk-stack.ts
import * as s3vectors from "aws-cdk-lib/aws-s3vectors";

// Vector bucket
const vectorBucket = new s3vectors.CfnVectorBucket(
  this,
  "MusicCapVectorBucket",
  {
    vectorBucketName: "music-cap-vectors",
  },
);

// 512-dim float32 index with cosine similarity
const vectorIndex = new s3vectors.CfnIndex(this, "MusicEmbeddingsIndex", {
  vectorBucketName: vectorBucket.vectorBucketName!,
  indexName: "music-embeddings",
  dataType: "float32",
  dimension: 512,
  distanceMetric: "cosine",
});
vectorIndex.addDependency(vectorBucket);

Deploy with the CDK CLI:

cd cdk/
pnpm install
pnpm cdk bootstrap   # first time only
pnpm cdk deploy

On success you get:

CdkStack.VectorBucketArn = arn:aws:s3vectors:ap-northeast-1:123456789012:bucket/music-cap-vectors
CdkStack.VectorIndexArn  = arn:aws:s3vectors:ap-northeast-1:123456789012:bucket/music-cap-vectors/index/music-embeddings

PUT
Vectors
with
boto3

scripts/upload_vectors.py handles ingestion. It first paginates through list_vectors to fetch existing keys so only new vectors are uploaded (incremental ingestion).

import boto3
import numpy as np

S3_BUCKET_NAME = "music-cap-vectors"
S3_INDEX_NAME  = "music-embeddings"
BATCH_SIZE     = 100

s3vectors = boto3.client("s3vectors", region_name="ap-northeast-1")

vectors_to_put = []

for sample in musiccaps_ds:
    ytid = sample["ytid"]
    embedding = np.load(f"data/embeddings/{ytid}.npy").astype(np.float32)
    vec = embedding.squeeze()

    vectors_to_put.append({
        "key": ytid,
        "data": {"float32": vec.tolist()},
        "metadata": {
            "ytid": ytid,
            "caption": str(sample["caption"]),
            "aspect_list": ", ".join(sample["aspect_list"]),
            "start_s": str(sample["start_s"]),
            "end_s": str(sample["end_s"]),
        },
    })

    if len(vectors_to_put) >= BATCH_SIZE:
        s3vectors.put_vectors(
            vectorBucketName=S3_BUCKET_NAME,
            indexName=S3_INDEX_NAME,
            vectors=vectors_to_put,
        )
        vectors_to_put = []

Metadata type constraint: all metadata values must be strings for S3 Vectors.

Search
for
Music
with
a
Text
Query

demo/search.py embeds the query with CLAP's text encoder and calls query_vectors.

# demo/search.py
from botocore.config import Config
import boto3
from demo.config import S3_BUCKET_NAME, S3_INDEX_NAME

s3vectors_client = boto3.client(
    "s3vectors", "ap-northeast-1",
    config=Config(connect_timeout=10, read_timeout=10),
)

def search(embedding: list[float], top_k: int = 10):
    response = s3vectors_client.query_vectors(
        vectorBucketName=S3_BUCKET_NAME,
        indexName=S3_INDEX_NAME,
        queryVector={"float32": embedding},
        topK=top_k,
        returnMetadata=True,
        returnDistance=True,
    )
    return response["vectors"]

Example result (lower distance = more similar under cosine distance):

{
    'distance': 0.360,
    'key': '2unse6chkMU',
    'metadata': {
        'caption': 'This is a piece that would be suitable as calming study music...',
        'aspect_list': 'calming piano music, soothing, bedtime music, sleep music, piano, reverb, violin',
        'ytid': '2unse6chkMU',
        ...
    }
}

Build
the
Demo
UI
with
Gradio

demo/app.py uses gr.Blocks to display results in a gr.Dataframe. Clicking a row triggers local audio playback if the file is available.

# demo/app.py
import gradio as gr
import pandas as pd
from demo.feature_extract import get_text_embeddings
from demo.local_data import get_local_audio_by_ytid
from demo.search import search

def do_search(query: str, top_k: int) -> tuple[pd.DataFrame, dict]:
    embedding = get_text_embeddings([query])
    results = search(embedding, top_k=int(top_k))
    rows = [
        {
            "rank": i + 1,
            "ytid": r["key"],
            "distance": round(r["distance"], 4),
            "caption": r.get("metadata", {}).get("caption", ""),
            "aspect_list": r.get("metadata", {}).get("aspect_list", ""),
        }
        for i, r in enumerate(results)
    ]
    return pd.DataFrame(rows), gr.update(value=None, visible=False)

def on_select(df: pd.DataFrame, evt: gr.SelectData) -> dict:
    ytid = str(df.iloc[evt.index[0]]["ytid"])
    audio_path = get_local_audio_by_ytid(ytid)
    return gr.update(value=audio_path, label=f"Preview: {ytid}", visible=True)

with gr.Blocks(title="MusicCap Search") as demo:
    gr.Markdown("# Text -> Audio Search on S3Vectors")
    with gr.Row():
        query_input = gr.Textbox(placeholder="ex: calming piano music with soft strings", label="Query", scale=4)
        top_k_slider = gr.Slider(minimum=1, maximum=30, value=10, step=1, label="Top K", scale=1)
    search_btn = gr.Button("検索", variant="primary")
    results_df = gr.Dataframe(
        headers=["rank", "ytid", "distance", "caption", "aspect_list"],
        label="検索結果(行をクリックして再生)",
        interactive=False, wrap=True,
    )
    audio_player = gr.Audio(label="Preview", visible=False)
    search_btn.click(fn=do_search, inputs=[query_input, top_k_slider], outputs=[results_df, audio_player])
    query_input.submit(fn=do_search, inputs=[query_input, top_k_slider], outputs=[results_df, audio_player])
    results_df.select(fn=on_select, inputs=[results_df], outputs=[audio_player])

demo.launch()
uv run python -m demo.app

Gradio demo UI

This is a simple validation to verify that the search system is working properly.

While all current vector data is computed from audio, the original MusicCaps dataset includes human-written captions and taxonomies. We test whether we can retrieve samples by searching with their ground truth captions and aspect lists.

For this demo, I randomly selected 1024 samples from the 5,521-sample dataset, successfully downloaded 960 of them, and built an S3 Vectors index. Queries use the caption text and aspect_list tags.

Example: id=-7B9tPuIP-w

caption

A male voice narrates a monologue to the rhythm of a song in the background. The song is fast tempo with enthusiastic drumming, groovy bass lines,cymbal ride, keyboard accompaniment ,electric guitar and animated vocals. The song plays softly in the background as the narrator speaks and burgeons when he stops. The song is a classic Rock and Roll and the narration is a Documentary.

aspect_list

['r&b', 'soul', 'male vocal', 'melodic singing', 'strings sample', 'strong bass', 'electronic drums', 'sensual', 'groovy', 'urban']

Queries retrieve top_k=100 results ranked by cosine similarity, and we compute Precision@k (or rather, top-k accuracy).

eval

Results show that when searching with 960 samples, approximately 40% of queries return the original sample in the top 10.


S3
Vectors
really
is
easy

A few lines of CDK to define the bucket and index, then plain put_vectors / query_vectors boto3 calls — no extra managed service to keep running. The developer experience is remarkably clean.

For prototypes or small-to-medium workloads (up to a few million vectors), S3 Vectors is well worth evaluating before reaching for Pinecone or Weaviate.

CLAP
search
quality

Using laion/clap-htsat-unfused via HuggingFace Transformers gave good results for queries about instruments, genre, and tempo. Emotional queries ("sad", "happy") were a bit noisier. It is worth noting that some MusicCaps captions may overlap with CLAP's training data, so benchmarking results should be interpreted with care.

Apple
Silicon
works
too

Setting dtype=torch.float32 when torch.backends.mps.is_available() keeps things running on M-series Macs — float16 triggers MPS broadcast errors with this model. The M4 Max ran inference without issues.


Read more articles