booktorrent

at The Evergreen State College

Current Approaches to Electronic Music Database Design

From booktorrent

Jump to: navigation, search

When you are sitting in a fairly crowded restaurant speaking above a tiny cacophony of dishes clattering, glasses and silverware clinking, mediocre elevator music, and the flurry of other conversations around you, your fellow diner is able to hear you due to a certain psychoacoustical phenomenon known as the cocktail party effect. This aural occurrence is what allows your mind to “turn down” background noises and focus on a particular source of sound. It’s also what allows many humans to eavesdrop at will. It is important to understand this phenomenon and how it is intrinsically linked to our brain solely because microphones and speakers do not share our psychological benefit. Each is capable only of receiving or producing signal in its programmed and/or physical range, across all or most the spectrum with fairly equal response. Any slight amplifications or attenuations along the range of frequency is what defines the characteristics of most microphones and speakers, but because most of the music industry is centered around editing direct microphone signals to make them appealing to the human ear, it should be obvious that they record many unwanted noises.

At the same time our ears have the ability to mentally attenuate noises we desire to disregard, our ears are especially sensitive to specific frequencies, particularly the range of about 3-4 kHz—the frequency range for the average human voice. However, this is only the typical point of maximum aural intelligibility. Other specific frequencies along the Hertz-scale have been identified and are known now as Zwicker and Fastl’s “critical frequency bands” for human timbral perception. Persistent noises on any critical band are frequently the irritating culprits that drive people attempting to relax or work crazy—be it a lawnmower or a loud video game. Due to these identified critical frequency bands and the cocktail party effect, for any analysis device linked to a brainless transducer there will be little relevance to human hearing. Because of this, audio analysis and classification meant for navigability first requires something known as psychoacoustic preprocessing.


Psychoacoustic Preprocessing Applied



Elias Pampalk, Simon Dixon, and Gerhard Widmer describe an excellent example of psychoacoustic preprocessing in their 2004 research paper “Exploring Music Collections by Browsing Different Views,” which discusses a hypothetical musical database they constructed in miniature. Utilizing Ernst Terhadt’s classic model of human hearing and the ear’s frequency response as the basis for preprocessing, Pampalk, Dixon, and Widmer must make use of several different sound processing tools. First, they elected to convert all audio samples to 11 kHz monophonic configurations. This is due to the negligible difference between stereo and mono as well as formats such as Ogg Vorbis, MP3, and CD, very specifically when it comes to the critically defining elements of an audio signal—be it melody, rhythm, or timbre—as well as to the greater space it takes to store the more detailed audio file and the time it takes to process it. Short-Time Fourier Transformation (ultra-brief sampling) obtains a graph with 23-msec intervals with 12-msec overlaps at the first 20 of the critical frequency bands, up to about 6 kHz. This allows focus on specific amplitude at frequencies where a song’s human perceptibility is effectively the most audible.

It is important to note that much of Pampalk, Dixon, and Widmer’s modeling and analysis is based on the Hertz-scale, however, after applying a general filter based on Ernst Terhardt’s formulas for human hearing, the researchers convert the sampled critical frequency band information into a psychoacoustical scale known as the Bark-scale and a process known as spectral masking is applied across each band. Spectral masking involves attenuating the specific frequencies on either side of a critical frequency band, distinctly with a higher level of attenuation on higher frequencies. “The main characteristic is that lower frequencies have a stronger masking influence on higher frequencies than vice versa.” (Exploring Music Collections By Browsing Different Views) Overall loudness is calculated in sones and then each audio signal is normalized to an output of one sone. In the end, all of this processing is meant solely to emulate human hearing’s most essential aspects prior to any computer analysis.

One of the most interesting aspects of its analysis and attempted categorization is that to interpret audio with a computer, it must first be represented mathematically, visually, or both. Spectrograms, measurements of the “spectral density” in a sound, are currently the most dominant graphic presentation useful for audio research and categorization. Spectral density is itself the presence and amplitude of certain frequencies during a given sample of a signal. Once Pampalk, Dixon, and Widmer obtained their psychoacoustical preprocessed audio they fed the file data into a computer to create visual representations of the most human-centric qualities in each individual song. While this particular database was limited to only 77 pieces, “the limitation in size is mainly induced by our simple HTML interface. Larger collections would require a hierarchical extension that represents each ‘island’ only by the most typical member and allows the user to zoom in and out, for example.” The team’s reference to islands is affiliated directly with how they utilized the specialized spectrograms and—in this instance—histograms they obtained.

Before moving on to the islands, I must briefly discuss histograms: Histograms are simple bar graph representations of the prevalence and incidence of targeted data, and in this specific instance two different categories were represented mathematically: the occurrence of specific tempos based on peaks in the signal over time, and the occurrence of specific frequencies over time. In conjunction with the spectrograms, these two histograms were utilized to allow a Self-Organizing Map to create visual representations of relationships between different pieces and artists. A Self-Organizing Map (SOM) is an artificial neural network with the capability to learn unsupervised from each algorithm and function it performs. SOMs are generally used for two-dimensional “mapping” of high-dimensional data—in this case to create manipulable “islands” of association among different pieces of music. These islands could be used for straightforward navigation in the midst of an information-packed digital universe. Additionally, “If a relatively large number of pieces have been added, or if the pieces are very different to the pieces with which the map was originally trained, then it is possible to gradually retrain the maps to give the new pieces more space in the display without reorganizing everything.”

Still, it is important to understand: “In general, any of the three views we have used in our demonstration can be replaced. Candidates include similarity measures focusing on melody, harmony, as well as other measures which might be more suitable to describe timbre and rhythm.” Though there is some tailoring towards the use of melody in the preprocessing methods discussed above, here the researchers touch on a point that is largely unexplored within the mainstream audio categorization community: are the dynamics in spectral density and loudness truly what define the “sound” of piece of music? Or are their aspects of tonality, harmony, and melody that could further allow a quantifiable method for search queries? There are few studies into an actual method of programming or computerization that I have found based on these facets of music. The openness that Elias Pampalk, Simon Dixon, and Gerhard Widmer display in their article concerning any limitations of their system makes me inclined to endorse their attempt as a feasible one—it simply must be expanded to include their own prescriptions for success.


Fluctuation Pattern Modeling & MFCCs



The type of analysis and navigation system described above is based on only one method of audio examination, known as Fluctuation Pattern modeling. While, this particular version of the method contains some unique facets, fluctuation modeling in general logs variations in amplitude and frequency over time, maps the results, and then compares different spectrograms using Euclidean geometric distance formulas to determine similarities. More recently, research has been conducted comparing Fluctuation Pattern modeling to a form of modeling based on the non-linear Mel-scale of frequency, rather than the conventional, linear Hertz-scale. The Mel-scale, at its core, is a more accurate measurement of the human perception of a tone, rather than the tone’s actual pitch on the Hertz-scale. In 2009, Austrian researchers Arthur Flexer and Dominik Schnitzer conducted a music information retrieval study comparing Hertz-based Fluctuation Pattern modeling to Mel-frequency Cepstral Coefficient (MFCC) modeling. Mel-frequency cepstral coefficients make up a Mel-frequency cepstrum, which is a visual representation of a sound based on taking a limited excerpt of its audio signal, mapping the powers (a vaguely relative term to amplitude or loudness in the Hertz scale), taking the logarithm of each Mel-frequency’s power, then mapping the resulting list of powers as an overlapping series of individual sine waves (as if it were a signal). The resulting image is the cepstrum, and the MFCCs are the amplitudes of the visualized spectrogram. In other words, the resulting graphic representation more accurately maps human-perceived timbre than a conventional spectrogram.

Flexer and Schnitzer in their study claim, and provide statistical evidence of the position that, “whereas Mel Frequency Cepstrum Coefficients (MFCCs) are a quite direct representation of the spectral information of a signal and therefore of the specific ‘sound’ or ‘timbre’ of a song, Fluctuation Patterns (FPs) are a more abstract kind of feature describing the amplitude modulation of the loudness per frequency band. It is our hypothesis, that MFCCs are more prone to pick up production and mastering effects of a single album as well as the specific ‘sound’ of an artist.”

And further:

“In Music Information Retrieval, one of the central goals is to automatically recommend music to users based on a query song or query artist. This can be done using expert knowledge (e.g. pandora.com), social meta-data (e.g. last.fm), collaborative filtering (e.g. amazon.com/mp3) or by extracting information directly from the audio.”

With this central tenet in mind, it is easy to comprehend how an audio analysis search engine that recognizes similarity in mastering and production would benefit the struggle for a comprehensive, navigable music database. Indeed, with mobile phone applications like Shazam where the sole desire is simply to retrieve the metadata on a single song, MFCC modeling is (would be? Ask Shazam…) extremely beneficial for interpreting audio information accurately. Yet, when it comes to the desire for automated recommendations, the benefits of your query retrieving the same artist across the extent of their work is more troublesome. Audio research is only possible if an audio search function can retrieve new audio information. In other words, what is the use when with, “the timbre based method… about one third of the first recommendations are from the same album and about another third from other albums from the same artist as the query song.” That leaves only a third of the first presented information as possibly unfamiliar and possibly relevant—albeit only in this individual case study—nevertheless, that is unacceptable for any academic research applications.

Flexer and Schnitzer’s study grouped 254,398 different 30-second song excerpts by 1700 different artists across a range of Western genres. Apple’s latest iPod with the largest hard drive claims to store up to 40,000 songs. With a 100-plus year history and millions of songs in global circulation, there seems to be an obvious solution to the researchers’ presented obstacle: “With audio based music recommendation maturing to the scale of the web, our work provides important insight into the behavior of music similarity for very large data bases. Even with hundreds of thousands of songs, album and artist filtering remain an issue.” But the Austrian researchers even state it simply in their own research: “As the data sets get larger, the probability that songs from other artists are more similar to the query song than songs from the same album or artist, clearly seems to increase.” The number of searchable pieces must exceed 250,000, probably several times over.


Sources:



Exploring Music Collections By Browsing Different Views Album and Artist Effects for Audio Similarity at the Scale of the Web