Guide · DIY Lab
How to build an audio corpus for music research
An audio corpus is not a playlist: it is a structured, labelled, and reproducible collection. What it takes to make one genuinely useful for MIR research, and how to get started with accessible tools.
An audio corpus is not a playlist. A playlist is a listening sequence; a corpus is a collection with structure, labels, and documented recording conditions that allow an experiment to be reproduced. That difference matters because a MIR (Music Information Retrieval) system — whether for transcription, key detection, or instrument classification — is only as good as the data you train or evaluate it on.
Here I explain how to put a corpus together from scratch with accessible tools, and which decisions separate a folder of files from a resource someone else can actually reuse.

The full journey of an audio corpus, from score to structured data. The example —Beethoven’s Sonata No. 14 in a historic public-domain recording (Schnabel, 1932)— shows how each recording is paired with metadata that makes it filterable, comparable, and reproducible.
What defines a corpus (and what separates it from a collection)
Three properties distinguish a corpus from a folder full of WAV files:
- Structure: the files are organised according to an explicit criterion. Whether that criterion is instrument, performer, recording context, or analysis task does not matter — what matters is that it is stated and applied consistently.
- Labels: every file is accompanied by metadata describing its contents. Without labels, a corpus cannot serve as training data or as an evaluation reference.
- Reproducibility: someone else should be able to use the corpus and obtain the same results. That requires documenting how each recording was made, with what equipment, and under what conditions.
If any of the three is missing, you have material, not a corpus.
Minimum viable equipment
You do not need a recording studio. You need control over noise.
The minimum that works:
- A mid-range condenser microphone (or, failing that, a recent smartphone recording at 44.1 kHz / 16-bit WAV, uncompressed). Capsule quality matters more than the brand of interface.
- A field recorder — something in the H4n range — if you will be recording outside the studio or in live performance contexts. Recording to SD card avoids USB latency and computer fan noise.
- A controlled quiet space: a room without noticeable reverb, no audible HVAC, no background traffic. Noise that enters the recording cannot be fully removed in post-processing.
The practical rule: record in uncompressed WAV from the start. MP3 discards information you may need for spectral analysis.
Software: from recording to annotation
Audacity is the natural starting point. It is free, cross-platform, and lets you record, trim, normalise, and export in the formats any MIR pipeline needs. For an initial corpus, it is enough.
When analysis demands become more exacting, Sonic Visualiser enters the
picture. It does not record, but it lets you display spectrograms, add temporal
annotation layers (onset, pitch, segmentation) and export them in standard
formats like CSV or .svl. It is the tool I use for detailed temporal annotation.
For larger corpora with collaborative annotation or dataset version control, tools such as Label Studio or Praat (the latter speech-oriented but useful for pitch analysis) cover needs Audacity cannot reach.
Metadata: what to record to make the corpus useful
Metadata is half the work. Without it, recordings cannot be filtered, reproduced, or compared. At minimum:
| Field | Description |
|---|---|
instrument | Canonical name of the recorded instrument |
performer | Performer identifier (can be anonymised) |
context | Studio / field / live performance |
date | Recording date (ISO 8601) |
recorder | Device and microphone used |
sample_rate | Sampling frequency in Hz |
bit_depth | Bit depth (16 / 24) |
duration_s | Duration in seconds |
annotation | Path to annotation file if one exists |
Further fields depend on the task: a melody-detection corpus needs a reference transcription; an instrument-identification corpus uses the instrument itself as the label.
File organisation
A flat structure does not scale. A structure that works:
corpus/
metadata.csv # master table (one row per recording)
recordings/
<id>_<context>.wav # audio files with a consistent ID scheme
annotations/
<id>.csv # per-file temporal annotations
README.md # recording protocol and criteria
The corpus README.md is as important as the data itself: it must explain who
recorded, when, with what equipment, and following what protocol. Without that
document, the corpus is not reproducible.
Start small
The most common mistake when building a corpus is chasing exhaustiveness from the outset. You do not need it. A small but well-annotated initial set — with clear criteria for what gets recorded, how, and why — is worth more than thousands of unlabelled files. The question that should guide the design is not “how much audio can I gather” but “what do I want to be able to evaluate with this data”.
That is the intersection between lab tinkering and research rigour: building the data that does not exist in order to ask the question that cannot be answered without it.
References
The references this article draws on, and where to read further:
- Müller, M. (2015). Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications. Springer.
- Cannam, C., Landone, C. and Sandler, M. (2010). “Sonic Visualiser: An Application for Viewing and Analysing Music Audio Files”. In Proceedings of the ACM Multimedia International Conference.
- Wilkinson, M. D. et al. (2016). “The FAIR Guiding Principles for scientific data management and stewardship”. Scientific Data, 3, 160018.
- Free tools: Audacity · Sonic Visualiser.
Frequently asked questions
-
¿Cuántas grabaciones hacen falta para empezar un corpus de audio útil?
No hay un número mágico. Un corpus funcional para un primer experimento AMT puede partir de 20-50 grabaciones si están bien etiquetadas y el protocolo de grabación es consistente. Lo que penaliza no es el tamaño sino la heterogeneidad no documentada: mezclar micrófonos, salas y condiciones de grabación sin registrarlo hace que el corpus sea difícil de reproducir. La estrategia de construcción paso a paso está en Cómo construir un corpus de audio para investigación musical.