---
title: "How to build an audio corpus for music research"
slug: construir-corpus-audio.en
kind: guide
summary: "An audio corpus is not a playlist: it is a structured, labelled, and reproducible collection. What it takes to make one genuinely useful for MIR research, and how to get started with accessible tools."
publishedAt: 2026-06-22
updatedAt: 2026-06-22
---
import { Image } from "astro:assets";
import infografiaCorpus from "../../assets/blog/posts/infografias/infografia-corpus-audio.jpg";

An audio corpus is not a playlist. A playlist is a listening sequence; a corpus
is a collection with structure, labels, and documented recording conditions that
allow an experiment to be reproduced. That difference matters because a **MIR**
(Music Information Retrieval) system — whether for transcription, key detection,
or instrument classification — is only as good as the data you train or evaluate
it on.

Here I explain how to put a corpus together from scratch with accessible tools,
and which decisions separate a folder of files from a resource someone else can
actually reuse.

<figure>
  <Image
    src={infografiaCorpus}
    alt="Infographic of the audio-corpus pipeline: from score to digitisation, to audio (waveform and spectrogram), to metadata annotation, and finally to the structured corpus. The example uses a public-domain Beethoven recording. The figure labels are in Spanish."
    widths={[480, 768, 1200]}
    sizes="(min-width: 760px) 680px, 92vw"
    loading="lazy"
  />
  <figcaption>
    The full journey of an audio corpus, from score to structured data. The
    example —Beethoven's Sonata No. 14 in a historic public-domain recording
    (Schnabel, 1932)— shows how each recording is paired with metadata that
    makes it filterable, comparable, and reproducible.
  </figcaption>
</figure>

## What defines a corpus (and what separates it from a collection)

Three properties distinguish a corpus from a folder full of WAV files:

- **Structure**: the files are organised according to an explicit criterion.
  Whether that criterion is instrument, performer, recording context, or analysis
  task does not matter — what matters is that it is stated and applied
  consistently.
- **Labels**: every file is accompanied by metadata describing its contents.
  Without labels, a corpus cannot serve as training data or as an evaluation
  reference.
- **Reproducibility**: someone else should be able to use the corpus and obtain
  the same results. That requires documenting how each recording was made, with
  what equipment, and under what conditions.

If any of the three is missing, you have material, not a corpus.

## Minimum viable equipment

You do not need a recording studio. You need control over noise.

The minimum that works:

- **A mid-range condenser microphone** (or, failing that, a recent smartphone
  recording at 44.1 kHz / 16-bit WAV, uncompressed). Capsule quality matters
  more than the brand of interface.
- **A field recorder** — something in the H4n range — if you will be recording
  outside the studio or in live performance contexts. Recording to SD card avoids
  USB latency and computer fan noise.
- **A controlled quiet space**: a room without noticeable reverb, no audible
  HVAC, no background traffic. Noise that enters the recording cannot be fully
  removed in post-processing.

The practical rule: record in uncompressed WAV from the start. MP3 discards
information you may need for spectral analysis.

## Software: from recording to annotation

**Audacity** is the natural starting point. It is free, cross-platform, and lets
you record, trim, normalise, and export in the formats any MIR pipeline needs.
For an initial corpus, it is enough.

When analysis demands become more exacting, **Sonic Visualiser** enters the
picture. It does not record, but it lets you display spectrograms, add temporal
annotation layers (onset, pitch, segmentation) and export them in standard
formats like CSV or `.svl`. It is the tool I use for detailed temporal annotation.

For larger corpora with collaborative annotation or dataset version control,
tools such as **Label Studio** or **Praat** (the latter speech-oriented but
useful for pitch analysis) cover needs Audacity cannot reach.

## Metadata: what to record to make the corpus useful

Metadata is half the work. Without it, recordings cannot be filtered, reproduced,
or compared. At minimum:

| Field | Description |
|---|---|
| `instrument` | Canonical name of the recorded instrument |
| `performer` | Performer identifier (can be anonymised) |
| `context` | Studio / field / live performance |
| `date` | Recording date (ISO 8601) |
| `recorder` | Device and microphone used |
| `sample_rate` | Sampling frequency in Hz |
| `bit_depth` | Bit depth (16 / 24) |
| `duration_s` | Duration in seconds |
| `annotation` | Path to annotation file if one exists |

Further fields depend on the task: a melody-detection corpus needs a reference
transcription; an instrument-identification corpus uses the instrument itself as
the label.

## File organisation

A flat structure does not scale. A structure that works:

```
corpus/
  metadata.csv          # master table (one row per recording)
  recordings/
    <id>_<context>.wav  # audio files with a consistent ID scheme
  annotations/
    <id>.csv            # per-file temporal annotations
  README.md             # recording protocol and criteria
```

The corpus `README.md` is as important as the data itself: it must explain who
recorded, when, with what equipment, and following what protocol. Without that
document, the corpus is not reproducible.

## Start small

The most common mistake when building a corpus is chasing exhaustiveness from the
outset. You do not need it. A small but well-annotated initial set — with clear
criteria for what gets recorded, how, and why — is worth more than thousands of
unlabelled files. The question that should guide the design is not "how much audio
can I gather" but "what do I want to be able to evaluate with this data".

That is the intersection between lab tinkering and research rigour: building the
data that does not exist in order to ask the question that cannot be answered
without it.

## References

The references this article draws on, and where to read further:

- Müller, M. (2015). [*Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications*](https://doi.org/10.1007/978-3-319-21945-5). Springer.
- Cannam, C., Landone, C. and Sandler, M. (2010). ["Sonic Visualiser: An Application for Viewing and Analysing Music Audio Files"](https://doi.org/10.1145/1873951.1874248). In *Proceedings of the ACM Multimedia International Conference*.
- Wilkinson, M. D. et al. (2016). ["The FAIR Guiding Principles for scientific data management and stewardship"](https://doi.org/10.1038/sdata.2016.18). *Scientific Data*, 3, 160018.
- Free tools: [Audacity](https://www.audacityteam.org) · [Sonic Visualiser](https://www.sonicvisualiser.org).