A Brief History of Technology and the Expressive Voice,
with Emphasis on the Synthesis of Singing
Perry Cook
Computer Science. Princeton
University
The voice is arguably the oldest (with possible exception of simple percussion) and most expressive of musical instruments, with the instrument, builder, and performer being inseparable. As with all instruments, however, the voice has influenced, and been influenced by, technology in ways that might not be so obvious. This talk will trace some highlights of technology/voice interaction, from the first human realizing that cupping their hands in front of their mouth made their voice carry farther, through the intentional castration of young boys to give them divine singing voices, through the invention and proliferation of radio and recording, to the newest vocal effects and synthesizers. Many sound and image examples will be presented throughout the talk.
Last
Update:
Summary
of the talk
Ong Bee Suan
Background
Dr. Perry Cook is an Associate Professor of Computer Science, with joint (affiliated) appointment in Music Department at Princeton University. Currently, Dr. Cook is on a Guggenheim Fellowship to research for a new book on the topic of "technology and vocal expression", exploring the interaction and influence of technology on the expressive human voice. The research will look at topics as varied as megaphones, microphones, sound recording, loudspeakers, radio, design of vocal performance spaces (from ancient Roman and Greek theaters to modern opera houses), surgery and drugs, telecommunications, and the computer. Dr. Cook's research interests are on physics-based sound synthesis models; physics of random and real-world sound sources; history of technology and music; singing voice synthesis and control; speech and audio compression; audio analysis and feature extraction; real-time devices for computer musical instrument control and human-computer interaction; applications in audio synthesis and analysis, auditory display, sound for immersive environments.
Research
Seminar
The content of the research seminar is taken from parts of Dr. Cook's book entitled “La Voce Bella Della Macchina” (The Beautiful Voice of the Machine), which covers various topics related to the history of technology and the expressive voice as listed below.
Book
Content
Interviews and Supporting Materials:
Appendix
A - Researching the Voice
Interviews with researchers from Bell Labs, MIT/Lincoln Labs, Haskins Labs,
Stockholm KTH, etc., such as Max Mathews, Jont Allen, Ben Gold, Students of
Dennis Klatt, Gunnar Fant, Johann Sundberg
Appendix B - Composing the Voice Interviews with John Chowning,
Charles Dodge, Paul Lansky, others.
Appendix C - Performing the Voice Interviews with Laurie
Anderson, Meredith Monk, Bobby McFerrin, others.
Appendix D - Switching On the Voice Interviews with many
voice hackers, from Wendy Carlos to Kraftwerk to RadioHead
Appendix E - Description of materials on DVDROM
Technology & Expressive Voice
The speaker gives a broad definition of technology. According to him, technology is "any intentionally fashioned tool or technique". Examples of technology in the context of human voice are amplification, delay, surgery, drugs (for therapy and inspiration), notation (how to notate the voice), law, recording, broadcast (gadgets are microphone, amplifiers), depiction (use voice to depict something or annotate something, or try to copy somebody's voice), disguise, modeling (mechanical artificial, synthetic) and much more. We can find expressive voice in singing, acting, praying, preaching or any kind of speaking or vocalizing for affect.
Cupped hands, hollow logs and echoes from the caves are examples of primitive voice technology to augment voice that can be found very early in human history. From the late 16th century, surgery was used to preserve the uniqueness of voice. Since women were forbidden to sing in church choirs and theaters, it was a common practise to castrate young boys to keep their voice from changing and training them for singing in the church. The Farinelli project by Xavier Rodet at IRCAM morphed coloratura soprano and counter-tenor in order to create a timbre similar to a castrato singer for the movie "Farinelli (II Castrato)". A phenomenon exists in voice aesthetics that abnormal voices gain certain attention. These abnormal voices can be caused by abusive lifestyle, disease, or intentional voice damage. Intellectual property, dealing with compositions, plagiarism, etc. is related to the legal voice. A few famous historical lawsuits in legal voice are:
The evolution of recording technology devices, from mechanical horns to microphones, has increased both the quantity and quality of music recordings. It has also influenced the popularity of certain types of voices in music recording and broadcasting. For example, before the invention of the microphone, a weak whispering voice (e.g. Bing Crosby) was impossible to be captured by mechanical horns. The ability of the available recording devices to capture certain instruments' sounds has also influenced the wide use of a typical instrumentation in particular periods. For example, horn, cornet and trumpet were famous in Dixieland music due to their good projection compared with plucked basses.
Instrumental voice is voice that is used as a musical instrument. Vocal tablature (notation) for Indian, African drumming is an example of instrumental voice as a percussion instrument. Notated voice systems have been invented to notate voice using symbols. Examples of notated voice systems are Guidonian hand notation/conducting and Kodaly hand signs. In Japan, Okinawa shamisen players notate the vocal line as symbols along the image of the shamisen neck.
Different from ancient mouth "interface", such as Digeridoo, current modern mouth interface, the Mouthesizer, uses a mini headmounted ccd camera to track the shadow area inside the mouth using colour and intensity thresholding. Shape parameters extracted from the segmented region are then mapped to MIDI control changes in order to let the users control audio effects, synthesizer parameters, etc. with movements of the mouth.
Current Speaking/Singing Machines
The voice source can be characterized as a periodic source corresponding to the oscillating vocal folds, or a non-periodic source corresponding to turbulent noise, or a mixture of these. The voice system is controlled by the shape of the vocal tract. The spectrum of the voice is characterized by resonant peaks called formants. The location and shapes of formant resonances are strong perceptual cues that we use to identify vowels and consonants. The most successful systems capable of generating, recognizing, or flexibly modifying speech-like sounds, have allowed flexible manipulation of the resonant peaks of the spectrum, and of source parameters (voice, pitch, noise level, etc.). Listed below are exisiting speaking or singing machines in chronological order.
Acoustic Tubes - Mechanical models of the vocal tract for producing singing voice.
Von Kempelen's model (1791) - used hand control of a leather “vocal tract”
to vary the sounds produced, with a bellows for lungs, auxiliary holes for nostrils,
and reeds and other mechanisms for the voiced and fricative (sibilant) energy
required.
Kelly-Lochbaum's
model - samples space and time by approximating the smooth vocal tract tube
with cylindrical segments equal in length to the distance travelled by a sound
wave in one time sample.
Vocoder - The first
analysis/synthesis engine for speech transmission. An input voice signal is
decomposed using a bank of bandpass filters and a pitch detector. The pitch
detector is used to control the fundamental frequency of the excitation and
to determine whether the source signal is voiced or unvoiced. Together, the
bandpass filters provide an approximation of the overall vocal tract filter,
and the energy in each bandpass filter is transmitted as a parameter.
Voder - The Vocoder
with the analysis engine replaced by controls for a human operator. The operator,
in controlling the excitation type (voice vs. unvoiced), the fundamental frequency
of excitation, and the resonant bandpass filter responses, is able to synthesize
speech or singing.
Linear Prediction - involves forming a digital filter that predicts the next samples from a linear combination of a few previous samples. An error signal is yielded which, if fed back through the time-varying prediction filter, will yield exactly the original signal. The error signal can be parametrically coded and resynthesized, or modified before resynthesis.
Formant Wave Functions (FOFs) - Time-domain waveform models of the impulse responses of individual formants. Each of these formant wave functions can be excited at the required fundamental frequency to produce the singing voice.
Sinusoidal Models - An input signal is decomposed into a number of sinusoidal partials with the use of Fourier analysis to locate and track individual sinusoidal partials in the voice signal. The sinusoids can be resynthesized from the track parameters, after modification or coding, by additive synthesis.
Frequency Modulation - Involves modulating the frequency of one oscillator (the carrier) with the output of another (the modulated) to create a spread spectrum consisting of sidebands surrounding the carrier frequency. By putting a suitably scaled and modulated carried wave close to formants of the singing, one can approximate the spectrum of a singing voice.
Acoustic Tubes/Physical model -simulates the vocal tract transfer function by solving the one dimensional wave equation inside a smoothly varying tube. The one dimensional approximation is justified by noting that the length of the vocal tract is significantly larger than any width dimension, and thus the longitudinal modes dominate the resonance structure up to about 4000 Hz.
Template-based Models - based on constructed templates from recorded voice sounds.
It is possible to control vocal synthesis in real-time. However, it is not a natural "fit" due to too many parameters. Current existing vocal controller devices are:
SqueezeVox (with Colby Leider 01)- An accordion device controller for models of the human voice. With the right hand, pitch is controlled by the keyboard, vibrato with aftertouch, fine pitch and vibrato with a linear strip. Breathing is controlled by the bellows, and the left hand controls vowels and constants via buttons (presets), or continuous controllers such as a touch pad, plungers, or squeeze interface.
The COWE (Controller, One With Everything) - A device that can be used to control a number of vocal models, including BelCanto singers, crying babies, Tibetan and Tuvan singers.
Reference
More information regarding projects being carried out by Dr. Perry Cook and Princeton SoundLab can be found at the websites listed below:-
Dr.
Perry Cook website
Princeton
SoundLab
Others
musical controllers