A Brief History of Technology and the Expressive Voice, with Emphasis on the Synthesis of Singing

Perry Cook
Computer Science. Princeton University

The voice is arguably the oldest (with possible exception of simple percussion) and most expressive of musical instruments, with the instrument, builder, and performer being inseparable. As with all instruments, however, the voice has influenced, and been influenced by, technology in ways that might not be so obvious. This talk will trace some highlights of technology/voice interaction, from the first human realizing that cupping their hands in front of their mouth made their voice carry farther, through the intentional castration of young boys to give them divine singing voices, through the invention and proliferation of radio and recording, to the newest vocal effects and synthesizers. Many sound and image examples will be presented throughout the talk.

Last Update: 
 

Summary of the talk
Ong Bee Suan


Background

Dr. Perry Cook is an Associate Professor of Computer Science, with joint (affiliated) appointment in Music Department at Princeton University. Currently, Dr. Cook is on a Guggenheim Fellowship to research for a new book on the topic of "technology and vocal expression", exploring the interaction and influence of technology on the expressive human voice. The research will look at topics as varied as megaphones, microphones, sound recording, loudspeakers, radio, design of vocal performance spaces (from ancient Roman and Greek theaters to modern opera houses), surgery and drugs, telecommunications, and the computer. Dr. Cook's research interests are on physics-based sound synthesis models; physics of random and real-world sound sources; history of technology and music; singing voice synthesis and control; speech and audio compression; audio analysis and feature extraction; real-time devices for computer musical instrument control and human-computer interaction; applications in audio synthesis and analysis, auditory display, sound for immersive environments.


Research Seminar

The content of the research seminar is taken from parts of Dr. Cook's book entitled “La Voce Bella Della Macchina” (The Beautiful Voice of the Machine), which covers various topics related to the history of technology and the expressive voice as listed below.

Book Content

  1. The Voice Basic human voice physiology, neurology, and acoustics.
  2. The Articulated Voice Motions of our tongue, jaw, etc. for linguistic communication. Gesture in production and perception.
  3. The Pitched Voice Aspects of vocal pitch control and perception, specifically as related to emotion and singing.
  4. The Visual Voice Emotion in facial expression, lip reading, visualizations of the vocal organ, and of vocal sound. Spectrograms, hand signs, other visual speech tools.
  5. The Singing Voice Singing vs. speech. The basic voice parts. Research and lore about singing. The difference between Singing and speech
  6. The Bio-Medical Voice Surgery as vocal enhancement (Castrati singers), The aging voice, unintentional and intentional voice damage. Stuttering and other dysphonias. Artificial larynx's, speech synthesizers, and other therapies and devices to aid the voice.
  7. The Holy Voice Rosary beads, prayer wheels, prayer bowls, religious chant, notation, religious pop and heavy metal bands, etc.
  8. The Legal Voice Intellectual property (primarily copyright) related to the voice. Famous historical lawsuits related to the voice. Music downloading. Stealth recording, independent bands and labels.
  9. The Amplified Voice: Megaphones to Microphones: Caruso vs. Bing Crosby. Recording history and techniques.
  10. The Recorded Voice History and vocal implications of sound recording.
  11. The Broadcast Voice History and vocal implications of broadcasting.
  12. The Gendered Voice Storytellers taking on other genders, Security/other devices that corrupt gender perception, Artists who use electronics to transform gender.
  13. The Descriptive Voice Imitation, Mimicry, etc., Otomonopoea in noise-producing object names, Otomonopoea in musical instrument names.
  14. The Instrumental Voice Vocalise (songs without words). The voice as percussion instrument, Vocal tablature (notation) for Indian, African drumming. Voice interaction with other instruments (brass and other winds, Digeridoo).
  15. The Notated Voice Speech vs. written language, Musical notation systems, (example: Okinawan shamisan players notate the vocal line as symbols along the image of the shamisan neck), Guidonian hand notation/conducting, Kodaly hand signs.
  16. The Delayed, Delayed Voice Acoustic echo and delay, Electronic echo and delay, extended versions and variants (delay, reverse echoes, reverberation, artificial chorus, flanging, pitch shifting, many more)
  17. The Fictional Robotic Voice Voices of machines, robots, and computers in movies, television, etc.
  18. The Anonymous Voice Paging (air, bus, pilot), Authority, Anonymity, Radio DJs, Commercial announcing, other voices without faces.
  19. The Synthesized Voice Historical speaking machines. Overview of voice synthesis, especially singing synthesis.
  20. The Accompanied Voice Voice with other instruments.
  21. The Silent Voice Vows of silence, mimes, etc.
  22. The Contemporary Electro-Acoustic Music Voice Contemporary art-music technology and voice composition
  23. The Popular Electronic Musical Voice popular music/art technology and the voice.

Interviews and Supporting Materials:

Appendix A - Researching the Voice Interviews with researchers from Bell Labs, MIT/Lincoln Labs, Haskins Labs, Stockholm KTH, etc., such as Max Mathews, Jont Allen, Ben Gold, Students of Dennis Klatt, Gunnar Fant, Johann Sundberg
Appendix B - Composing the Voice Interviews with John Chowning, Charles Dodge, Paul Lansky, others.
Appendix C - Performing the Voice Interviews with Laurie Anderson, Meredith Monk, Bobby McFerrin, others.
Appendix D - Switching On the Voice Interviews with many voice hackers, from Wendy Carlos to Kraftwerk to RadioHead
Appendix E - Description of materials on DVDROM


Technology & Expressive Voice

The speaker gives a broad definition of technology. According to him, technology is "any intentionally fashioned tool or technique". Examples of technology in the context of human voice are amplification, delay, surgery, drugs (for therapy and inspiration), notation (how to notate the voice), law, recording, broadcast (gadgets are microphone, amplifiers), depiction (use voice to depict something or annotate something, or try to copy somebody's voice), disguise, modeling (mechanical artificial, synthetic) and much more. We can find expressive voice in singing, acting, praying, preaching or any kind of speaking or vocalizing for affect.

Cupped hands, hollow logs and echoes from the caves are examples of primitive voice technology to augment voice that can be found very early in human history. From the late 16th century, surgery was used to preserve the uniqueness of voice. Since women were forbidden to sing in church choirs and theaters, it was a common practise to castrate young boys to keep their voice from changing and training them for singing in the church. The Farinelli project by Xavier Rodet at IRCAM morphed coloratura soprano and counter-tenor in order to create a timbre similar to a castrato singer for the movie "Farinelli (II Castrato)". A phenomenon exists in voice aesthetics that abnormal voices gain certain attention. These abnormal voices can be caused by abusive lifestyle, disease, or intentional voice damage. Intellectual property, dealing with compositions, plagiarism, etc. is related to the legal voice. A few famous historical lawsuits in legal voice are:

  1. Former Beatle George Harrison's copyright infringement lawsuit for writing a song entitled "My Sweet Lord"(70s) with a tune that sounded very much like the Chiffon's song "He's So Fine"(60s).
  2. John Fogarty's copyright infringement lawsuit for sounding too much like himself.
  3. Tom Waits lawsuit with a snack company manufacturer, Frito-Lay, Inc., and its advertising agency, Tracy-Locke, Inc., for voice misappropriation and false endorsement. It was the consequence of the broadcast of a radio commercial for SalsaRio Doritos, which featured a vocal performance imitating Waits' raspy singing voice.
  4. Bette Midler lawsuit with the Ford Motor Company and its advertising agency for deliberately imitating one of her songs in a television commercial.

The evolution of recording technology devices, from mechanical horns to microphones, has increased both the quantity and quality of music recordings. It has also influenced the popularity of certain types of voices in music recording and broadcasting. For example, before the invention of the microphone, a weak whispering voice (e.g. Bing Crosby) was impossible to be captured by mechanical horns. The ability of the available recording devices to capture certain instruments' sounds has also influenced the wide use of a typical instrumentation in particular periods. For example, horn, cornet and trumpet were famous in Dixieland music due to their good projection compared with plucked basses.

Instrumental voice is voice that is used as a musical instrument. Vocal tablature (notation) for Indian, African drumming is an example of instrumental voice as a percussion instrument. Notated voice systems have been invented to notate voice using symbols. Examples of notated voice systems are Guidonian hand notation/conducting and Kodaly hand signs. In Japan, Okinawa shamisen players notate the vocal line as symbols along the image of the shamisen neck.

Different from ancient mouth "interface", such as Digeridoo, current modern mouth interface, the Mouthesizer, uses a mini headmounted ccd camera to track the shadow area inside the mouth using colour and intensity thresholding. Shape parameters extracted from the segmented region are then mapped to MIDI control changes in order to let the users control audio effects, synthesizer parameters, etc. with movements of the mouth.


Current Speaking/Singing Machines

The voice source can be characterized as a periodic source corresponding to the oscillating vocal folds, or a non-periodic source corresponding to turbulent noise, or a mixture of these. The voice system is controlled by the shape of the vocal tract. The spectrum of the voice is characterized by resonant peaks called formants. The location and shapes of formant resonances are strong perceptual cues that we use to identify vowels and consonants. The most successful systems capable of generating, recognizing, or flexibly modifying speech-like sounds, have allowed flexible manipulation of the resonant peaks of the spectrum, and of source parameters (voice, pitch, noise level, etc.). Listed below are exisiting speaking or singing machines in chronological order.

Acoustic Tubes - Mechanical models of the vocal tract for producing singing voice.

Von Kempelen's model (1791) - used hand control of a leather “vocal tract” to vary the sounds produced, with a bellows for lungs, auxiliary holes for nostrils, and reeds and other mechanisms for the voiced and fricative (sibilant) energy required.
Kelly-Lochbaum's model - samples space and time by approximating the smooth vocal tract tube with cylindrical segments equal in length to the distance travelled by a sound wave in one time sample.

Vocoders/Voders - It is based on the idea of dividing the frequency spectrum into bands, and codes the energy of each band.

Vocoder - The first analysis/synthesis engine for speech transmission. An input voice signal is decomposed using a bank of bandpass filters and a pitch detector. The pitch detector is used to control the fundamental frequency of the excitation and to determine whether the source signal is voiced or unvoiced. Together, the bandpass filters provide an approximation of the overall vocal tract filter, and the energy in each bandpass filter is transmitted as a parameter.
Voder - The Vocoder with the analysis engine replaced by controls for a human operator. The operator, in controlling the excitation type (voice vs. unvoiced), the fundamental frequency of excitation, and the resonant bandpass filter responses, is able to synthesize speech or singing.

Filter-Based Models - A filter based synthesis methods.

Linear Prediction - involves forming a digital filter that predicts the next samples from a linear combination of a few previous samples. An error signal is yielded which, if fed back through the time-varying prediction filter, will yield exactly the original signal. The error signal can be parametrically coded and resynthesized, or modified before resynthesis.

Formant Wave Functions (FOFs) - Time-domain waveform models of the impulse responses of individual formants. Each of these formant wave functions can be excited at the required fundamental frequency to produce the singing voice.

Sinusoidal Models - An input signal is decomposed into a number of sinusoidal partials with the use of Fourier analysis to locate and track individual sinusoidal partials in the voice signal. The sinusoids can be resynthesized from the track parameters, after modification or coding, by additive synthesis.

Frequency Modulation - Involves modulating the frequency of one oscillator (the carrier) with the output of another (the modulated) to create a spread spectrum consisting of sidebands surrounding the carrier frequency. By putting a suitably scaled and modulated carried wave close to formants of the singing, one can approximate the spectrum of a singing voice.

Acoustic Tubes/Physical model -simulates the vocal tract transfer function by solving the one dimensional wave equation inside a smoothly varying tube. The one dimensional approximation is justified by noting that the length of the vocal tract is significantly larger than any width dimension, and thus the longitudinal modes dominate the resonance structure up to about 4000 Hz.

Template-based Models - based on constructed templates from recorded voice sounds.


Vocal Synthesis Controllers

It is possible to control vocal synthesis in real-time. However, it is not a natural "fit" due to too many parameters. Current existing vocal controller devices are:

SqueezeVox (with Colby Leider 01)- An accordion device controller for models of the human voice. With the right hand, pitch is controlled by the keyboard, vibrato with aftertouch, fine pitch and vibrato with a linear strip. Breathing is controlled by the bellows, and the left hand controls vowels and constants via buttons (presets), or continuous controllers such as a touch pad, plungers, or squeeze interface.

The COWE (Controller, One With Everything) - A device that can be used to control a number of vocal models, including BelCanto singers, crying babies, Tibetan and Tuvan singers.


Reference

More information regarding projects being carried out by Dr. Perry Cook and Princeton SoundLab can be found at the websites listed below:-

Dr. Perry Cook website
Princeton SoundLab
Others musical controllers