The field of music information retrieval, while relatively young, still abounds with a variety of interesting mathematical techniques. Many of these techniques take as a point of departure a collection of ‘features’ or ‘feature vectors’; e.g., spectral flux, windowed root-mean-square energy, short-time Fourier transform vectors, etc. The hope is to glean some sort of recognition of genre or mood based on certain properties of these feature vectors.
One concept that is occasionally used is the self-similarity matrix (SSM). Assume that we begin with a set of \( n \) feature vectors for which we would like to construct the corresponding SSM. By definition, this matrix is the \( n \times n \) matrix whose entry at position \( (i, j) \) is give by a ‘similarity’ (e.g., Euclidean distance) between vectors \( v_i \) and \( v_j \). Here is a graphical example of an SSM of approximately \( 100 \) windows of a spectrogram:
This matrix may remind you of a related concept from probability. Given a set of states, we can define the stochastic matrix, which determines with what probability we will transition from one state to another. In other words, the entry at position \( (i, j) \) of the stochastic matrix is given by the probability of transitioning from state \( i \) to state \( j \). Here is an example of a stochastic matrix (note how the rows sum to \( 1 \)):
0 & 0 & 1/2 & 0 & 1/2\\
0 & 0 & 1 & 0 & 0 \\
1/4 & 1/4 & 0 & 1/4 & 1/4\\
0 & 0 & 1/2 & 0 & 1/2\\
0 & 0 & 0 & 0 & 1
Given an SSM, we can obtain a stochastic matrix by normalizing each row to sum to \( 1 \). A stochastic matrix can then drive an algorithmic process such as a Markov chain. The main question is determining how to map the different ‘states’ into audio. One approach, assuming the SSM has been computed from a spectrogram, is to map each window to a stochastic frequency cloud, where the various frequencies occur in proportion to the energy in the corresponding bin. Here’s a basic audio example:
The previous example focused mostly on frequencies, but we can use other typical MIR features to map to other musical parameters, such as amplitude envelopes. For instance, we can compute a windowed RMS of an audio sample to get an idea of where energy peaks are located in time. We may then use the RMS signal itself as an envelope, creating a sound whose contours follow the energy of the original sound through time. Here’s an example:
Another musical parameter we can control is rhythm. By locating relative extrema (using a simple first-difference approach), we can also approximate onsets. The spacing between those onsets can then be mapped into rhythms.In my example, I randomly choose a spacing with weight corresponding to the intensity of the onset.
Finally, there is the parameter of form. Since music typically evolves through time, it is aesthetically important to devise some system whereby the music will be changing. A purely algorithmic approach, based on self-similarity, is intriguing: starting with an initial SSM, we generate a few seconds of audio using the previously described techniques. That audio is then analyzed to generate a new SSM, which in turn drives new audio. Other data such as RMS can also be sequentially updated. The piece essentially composes itself forever. Of theoretical interest might be the question of whether it converges to some steady state (or perhaps ‘converges’ into a steady loop).
Here is an étude based on combining several of the aforementioned ideas: