# Music Information Retrieval tools for Algorithmic Composition

The field of music information retrieval, while relatively young, still abounds with a variety of interesting mathematical techniques. Many of these techniques take as a point of departure a collection of ‘features’ or ‘feature vectors’; e.g., spectral flux, windowed root-mean-square energy, short-time Fourier transform vectors, etc. The hope is to glean some sort of recognition of genre or mood based on certain properties of these feature vectors.

One concept that is occasionally used is the self-similarity matrix (SSM). Assume that we begin with a set of $n$ feature vectors for which we would like to construct the corresponding SSM. By definition, this matrix is the $n \times n$ matrix whose entry at position $(i, j)$ is give by a ‘similarity’ (e.g., Euclidean distance) between vectors $v_i$ and $v_j$. Here is a graphical example of an SSM of approximately $100$ windows of a spectrogram:

This matrix may remind you of a related concept from probability. Given a set of states, we can define the stochastic matrix, which determines with what probability we will transition from one state to another. In other words, the entry at position $(i, j)$ of the stochastic matrix is given by the probability of transitioning from state $i$ to state $j$. Here is an example of a stochastic matrix (note how the rows sum to $1$):

Given an SSM, we can obtain a stochastic matrix by normalizing each row to sum to $1$. A stochastic matrix can then drive an algorithmic process such as a Markov chain. The main question is determining how to map the different ‘states’ into audio. One approach, assuming the SSM has been computed from a spectrogram, is to map each window to a stochastic frequency cloud, where the various frequencies occur in proportion to the energy in the corresponding bin. Here’s a basic audio example:

The previous example focused mostly on frequencies, but we can use other typical MIR features to map to other musical parameters, such as amplitude envelopes. For instance, we can compute a windowed RMS of an audio sample to get an idea of where energy peaks are located in time. We may then use the RMS signal itself as an envelope, creating a sound whose contours follow the energy of the original sound through time. Here’s an example: