Tever
EN

Guide · DIY Lab

How to build an audio corpus for music research

An audio corpus is not a playlist: it is a structured, labelled, and reproducible collection. What it takes to make one genuinely useful for MIR research, and how to get started with accessible tools.

How to build an audio corpus for music research

An audio corpus is not a playlist. A playlist is a listening sequence; a corpus is a collection with structure, labels, and documented recording conditions that allow an experiment to be reproduced. That difference matters because a MIR (Music Information Retrieval) system — whether for transcription, key detection, or instrument classification — is only as good as the data you train or evaluate it on.

Here I explain how to put a corpus together from scratch with accessible tools, and which decisions separate a folder of files from a resource someone else can actually reuse.

Infographic of the audio-corpus pipeline: from score to digitisation, to audio (waveform and spectrogram), to metadata annotation, and finally to the structured corpus. The example uses a public-domain Beethoven recording. The figure labels are in Spanish.

The full journey of an audio corpus, from score to structured data. The example —Beethoven’s Sonata No. 14 in a historic public-domain recording (Schnabel, 1932)— shows how each recording is paired with metadata that makes it filterable, comparable, and reproducible.

What defines a corpus (and what separates it from a collection)

Three properties distinguish a corpus from a folder full of WAV files:

  • Structure: the files are organised according to an explicit criterion. Whether that criterion is instrument, performer, recording context, or analysis task does not matter — what matters is that it is stated and applied consistently.
  • Labels: every file is accompanied by metadata describing its contents. Without labels, a corpus cannot serve as training data or as an evaluation reference.
  • Reproducibility: someone else should be able to use the corpus and obtain the same results. That requires documenting how each recording was made, with what equipment, and under what conditions.

If any of the three is missing, you have material, not a corpus.

Minimum viable equipment

You do not need a recording studio. You need control over noise.

The minimum that works:

  • A mid-range condenser microphone (or, failing that, a recent smartphone recording at 44.1 kHz / 16-bit WAV, uncompressed). Capsule quality matters more than the brand of interface.
  • A field recorder — something in the H4n range — if you will be recording outside the studio or in live performance contexts. Recording to SD card avoids USB latency and computer fan noise.
  • A controlled quiet space: a room without noticeable reverb, no audible HVAC, no background traffic. Noise that enters the recording cannot be fully removed in post-processing.

The practical rule: record in uncompressed WAV from the start. MP3 discards information you may need for spectral analysis.

Software: from recording to annotation

Audacity is the natural starting point. It is free, cross-platform, and lets you record, trim, normalise, and export in the formats any MIR pipeline needs. For an initial corpus, it is enough.

When analysis demands become more exacting, Sonic Visualiser enters the picture. It does not record, but it lets you display spectrograms, add temporal annotation layers (onset, pitch, segmentation) and export them in standard formats like CSV or .svl. It is the tool I use for detailed temporal annotation.

For larger corpora with collaborative annotation or dataset version control, tools such as Label Studio or Praat (the latter speech-oriented but useful for pitch analysis) cover needs Audacity cannot reach.

Metadata: what to record to make the corpus useful

Metadata is half the work. Without it, recordings cannot be filtered, reproduced, or compared. At minimum:

FieldDescription
instrumentCanonical name of the recorded instrument
performerPerformer identifier (can be anonymised)
contextStudio / field / live performance
dateRecording date (ISO 8601)
recorderDevice and microphone used
sample_rateSampling frequency in Hz
bit_depthBit depth (16 / 24)
duration_sDuration in seconds
annotationPath to annotation file if one exists

Further fields depend on the task: a melody-detection corpus needs a reference transcription; an instrument-identification corpus uses the instrument itself as the label.

File organisation

A flat structure does not scale. A structure that works:

corpus/
  metadata.csv          # master table (one row per recording)
  recordings/
    <id>_<context>.wav  # audio files with a consistent ID scheme
  annotations/
    <id>.csv            # per-file temporal annotations
  README.md             # recording protocol and criteria

The corpus README.md is as important as the data itself: it must explain who recorded, when, with what equipment, and following what protocol. Without that document, the corpus is not reproducible.

Start small

The most common mistake when building a corpus is chasing exhaustiveness from the outset. You do not need it. A small but well-annotated initial set — with clear criteria for what gets recorded, how, and why — is worth more than thousands of unlabelled files. The question that should guide the design is not “how much audio can I gather” but “what do I want to be able to evaluate with this data”.

That is the intersection between lab tinkering and research rigour: building the data that does not exist in order to ask the question that cannot be answered without it.

References

The references this article draws on, and where to read further:

Frequently asked questions