Microphone placement and equalization for speech intelligibility
The DPA 6066 Subminiature Headset Microphone is DPA’s newest headset. Brinck wore one during his presentation, on which this story is based, at Stage|Set|Scenery in Berlin.
Technical Editor’s note: At Stage|Set|Scenery, Bo Brinck presented “Speech Intelligibility: A Theoretical Seminar about Speech Intelligibility and Its Importance for All Performances.” The presentation focused on what parts of the audio spectrum were critical for speech intelligibility, how these parts are attenuated by microphone placement on a performer, and how equalization can be used to help boost the diminished frequencies. It was an interesting and useful presentation, so a version of it is presented here.
A PERFORMER’S SPOKEN AND SUNG words should be intelligible for the audience to understand what is being said or sung.
However, it can be challenging to retain intelligibility when amplifying the voice.
Microphone placement on the performer and what is done with the signal from the microphone can greatly affect how intelligible the amplified voice is. Loudness by itself is not enough for intelligibility. In this article, we present some facts on speech intelligibility and how to retain it.
The voice as a sound source is important to understand. While language can be something that groups of people have in common, the sound and character of the voice is individual from person to person. At the same time, speech is an acoustic signal, subject to the laws of physics.
Sound level
Vocal efforts vary; from a subdued whisper to loud shouting. It is hard to assign a fixed number to speech level, as this is individual from person to person, but the values in the table below indicate the average A-weighted speech level of the speech of an adult.
It is worth noting that the ability to understand speech is optimum when the level of the speech corresponds to the level of normal speech at a distance of 1 m.
In other words, a sound pressure level of approximately 55 to 65 dB referenced to 20
μPa. Average speech level as a function of listening distance, and there is nearly 20 dB difference between normal speech and shouting.
Note that each level presented in the table is an averaged RMS level and not a peak level. Typically, the peaks are 20 to 23 dB above the RMS level. The ratio between the peak level and the RMS level is called the crest factor. This factor is an important parameter when a voice is to be recorded or reproduced in an electroacoustic system. Loud singing, measured at the lips, can reach levels of 130 dB referenced to 20 μPa RMS and peak levels above 150 dB. The audio chain should be capable of handling these sound pressure levels.
The spectrum of speech
The spectrum of speech covers quite a wide portion of the audible frequency spectrum.
In non-tonal languages (most European languages) speech consists primarily of consonants and vowel sounds. The vowel sounds are generated by the vocal chords and filtered by the vocal cavities. A whisper is without voiced sounds, but the whisper is filtered by the same vocal cavities that contribute to the formation of the different vowels, so the characteristics of vowel sounds also occur in a whisper. In general, the fundamental frequency of the complex speech tone—also known as the pitch or f 0 —lies in the range of 100-120 Hz for men, but variations outside this range can occur.
The f 0 for women is found approximately one octave higher. For children, f 0 is around 300 Hz.
The consonants are created by air blockages and noise sounds formed by the passage of air through the throat and mouth, particularly the tongue and lips. In terms of frequency, the consonants lie above 500 Hz.
At a normal vocal intensity, the energy of the vowels usually diminishes rapidly above 1 kHz. Note however, that the emphasis on the speech spectrum shifts one to two octaves towards higher frequencies when the voice is raised (shouting). Also, note that it is not possible to increase the sound level of consonants to the same extent as vowels. In practice, this means that the intelligibility of speech is not increased by shouting, compared to applying normal vocal effort, in situations where the background noise is not significant.
If you listen to two people speak or sing the same vowel at the same pitch (f 0 ), the vowels are presumably recognizable as identical in both cases. However, any two voices do not necessarily produce exactly the same spectrum. The formants provide the perceived vowel sounds. Also, the formants provide information different from speaker to speaker. The formants are explained by the acoustic filtering of the spectrum generated by the vocal cords. Vowels are created by the tuning of the resonances of the cavities in the vocal tract.
What affects intelligibility?
In tonal languages, such as Chinese and Thai, the speakers use pitch and shifts in pitch to signal meaning. In non-tonal languages, such as English, Spanish, German, and Japanese, words are primarily distinguished by changing a vowel, a consonant, or both. However, of these two, the consonants are the most important for distinguishing one word from another.
The important frequencies in non-tonal (Western) languages are in the frequency band around 2 kHz. This is the most important frequency range for intelligibility.
Most consonants are found here. If this band is lost, intelligibility is lost.
A graph of intelligibility versus high-pass or low-pass filter cut-off frequencies shows the importance of frequencies above 2 kHz.
Using an HP filter (blue line) at 20 Hz (upper left) leaves the speech 100% understandable. Moving the HP-filter cutoff to 500 Hz still leaves the speech signal understandable. Even though most of the speech energy is cut out, the intelligibility is only reduced by 5%. However, applying a higher cut-off at 2 kHz makes intelligibility drop, and moving further to cut frequencies below 4 kHz reduces intelligibility almost to zero. Conversely, applying an LP-filter cuts intelligibility severely. When cutting at 1 kHz, eliminating everything above that frequency, the intelligibility is less than 40%.
Even with the cut-off at 2 kHz, intelligibility is less than 75%. It can be seen that the frequency range between 1 kHz and 4 kHz is highly important for intelligibility.
The sound field
The sound field around the person speaking is affected not only by the physics of the vocal tract, but also by the person’s head and body. Sound is directional, with not all frequencies radiating in all directions equally.
If the sound level at 1 m is plotted around a person, the difference between front and back overall is approximately 7 dB. Furthermore, in the vertical plane, the level forward is slightly higher at 30° below the horizontal compared to straight forward, mainly because sound is reflected off the chest. However, the differences in levels are also frequency-dependent, with higher frequencies being more directional.
High frequencies are attenuated more toward the rear than lower frequencies.
Directivity increases from approximately 1 kHz and up. The important frequencies for consonants—key for intelligibility— are above 1 kHz; it is obvious that higher intelligibility is obtained when a microphone is position in front of a person as opposed to behind the person. However, microphone placement directly in front of a performer is difficult if the performer has to move around and the microphone is not handheld. Microphones are usually placed on a performer’s head or chest, and those positions require some consideration of the polar pattern of the voice at different frequencies.
Placing the microphone
If we take the signal from a microphone placed one meter in front of a performer as a reference, we get louder signals with different frequency curves when the microphone is positioned elsewhere.
Positioning a microphone close to the performer—placing it on the chest, forehead, ear, or at the cheek—results in a signal somewhat louder than what is picked up a meter away from the performer (the microphone is closer). Furthermore, if we take what a good microphone picks up at one meter as a neutral response, all the other microphone positions change the sound spectrum, emphasizing some frequencies and attenuating others. There is a general tendency of a raise around 800 Hz, which must be considered, but the important deviations for speech intelligibility are the attenuations above 1 kHz. A lavalier microphone on the chest is particularly bad because of the severe dip at about 3 kHz, right where we need to hear the difference between consonants. Attenuation of higher frequencies should be considered in microphone placement and in applying compensatory equalization to the signal.
Lavalier/chest-worn microphone
The speech spectrum at the typical chest position has a lack of frequencies in the essential range of 3-4 kHz. If a microphone with a flat frequency response is positioned on a person’s chest, the 3-4 kHz range should be boosted around 5-10 dB just to compensate for the loss.
Headset microphone: cheek or ear
A headset microphone can put the microphone at the cheek or ear position.
The level at the headset microphone is approximately 10 dB louder on the cheek compared to a chest position, and the spectrum is flatter, compared to the chest position, but there is still significant high-frequency roll off that should be compensated for.
Forehead microphone
A microphone positioned on the forehead or at the hairline is often used in film and stage performance, and is relatively neutral regarding speech intelligibility. There is a boost at 800 Hz, but the spectrum critical for consonants above about 1.5 kHz is fairly flat.
Empirical equalization
Some signal compensation is desirable to deal with the high-frequency attenuation and maintain intelligibility. The easiest way to do this is to compare the signal from a reference microphone about a meter in front of the performer to the signal from a microphone placed where it will be used in the performance. Apply equalization to the signal from the performer’s mic until it sounds about the same as the reference microphone. There are several ways you can do this.
(1) If you have two microphones, you can have the performer speak while both are live. Send one signal to one earphone on a studio headset, and the other signal to the other earphone. Adjust a graphic or parametric equalizer on the worn microphone’s signal until it sounds the same as the reference. Alternatively, you can simply switch back and forth—A/B switching—with one speaker and adjust until the differences are not noticeable.
(2) If you have one microphone, make a recording with the microphone in the reference position, and another recording with it in the worn position. Send the two recordings to the two earphones and adjust as above, or simply switch back and forth, listening to one then the other through a speaker.
(3) An alternative to listening and subjectively comparing the two different signals is to invert the polarity of one, mix it with the other, and send the resulting signal to one speaker. The resulting signal will be the difference between the two signals.
Adjust the EQ until the signal is gone. The signals are now identical. Flip the phase again, back to normal, and you are done, except for recording the equalizer settings.
Whatever method you use, record the equalizer settings, and you should have a good starting place for setting the EQ whenever a mic is worn in a particular position, whether it is on the forehead, chest, ear, or cheek. It should be fairly consistent, no matter who the performer is or what kind of performance.
Sung vowel sounds convey a lot of the beauty of music, but hearing the consonants is important if we are to know if someone’s love or luck is here to stay.
Technical Editor’s final note: Decades ago my girlfriend and I took her parents to see Cavalleria Rusticana and Pagliacci at the San Francisco Opera. At intermission, my girlfriend’s father turned to me and asked, “Why can’t I understand them?” “They are singing in Italian,” I said. “I speak Italian.
Why can’t I understand them?” was his reply.
I had no answer at the time, but now I am thinking it was because the beautiful singing was almost all vowels, no consonants. No consonants gives us pretty music but not much intelligibility.
Bo Brinck has been working in the audio industry since his early teens, first as a musician, later as a recording engineer and producer. He has more than 200 albums in his discography. With DPA Microphones since 2005, Bo now serves as Global Education and Application Manager/ Product Specialist. He is a microphone expert and understands how mics work as well as how to apply them in practice.
To increase knowledge about the importance of microphones in the sound chain, Bo also educates on all types of audio related topics such as acoustic challenges and correct microphone placement on the body as well as musical instruments.
. . . speech is an acoustic signal, subject to the laws of physics.
A PERFORMER’S SPOKEN AND SUNG words should be intelligible for the audience to understand what is being said or sung.
However, it can be challenging to retain intelligibility when amplifying the voice.
Microphone placement on the performer and what is done with the signal from the microphone can greatly affect how intelligible the amplified voice is. Loudness by itself is not enough for intelligibility. In this article, we present some facts on speech intelligibility and how to retain it.
The voice as a sound source is important to understand. While language can be something that groups of people have in common, the sound and character of the voice is individual from person to person. At the same time, speech is an acoustic signal, subject to the laws of physics.
Sound level
Vocal efforts vary; from a subdued whisper to loud shouting. It is hard to assign a fixed number to speech level, as this is individual from person to person, but the values in the table below indicate the average A-weighted speech level of the speech of an adult.
It is worth noting that the ability to understand speech is optimum when the level of the speech corresponds to the level of normal speech at a distance of 1 m.
In other words, a sound pressure level of approximately 55 to 65 dB referenced to 20
μPa. Average speech level as a function of listening distance, and there is nearly 20 dB difference between normal speech and shouting.
Note that each level presented in the table is an averaged RMS level and not a peak level. Typically, the peaks are 20 to 23 dB above the RMS level. The ratio between the peak level and the RMS level is called the crest factor. This factor is an important parameter when a voice is to be recorded or reproduced in an electroacoustic system. Loud singing, measured at the lips, can reach levels of 130 dB referenced to 20 μPa RMS and peak levels above 150 dB. The audio chain should be capable of handling these sound pressure levels.
The spectrum of speech
The spectrum of speech covers quite a wide portion of the audible frequency spectrum.
In non-tonal languages (most European languages) speech consists primarily of consonants and vowel sounds. The vowel sounds are generated by the vocal chords and filtered by the vocal cavities. A whisper is without voiced sounds, but the whisper is filtered by the same vocal cavities that contribute to the formation of the different vowels, so the characteristics of vowel sounds also occur in a whisper. In general, the fundamental frequency of the complex speech tone—also known as the pitch or f 0 —lies in the range of 100-120 Hz for men, but variations outside this range can occur.
The f 0 for women is found approximately one octave higher. For children, f 0 is around 300 Hz.
The consonants are created by air blockages and noise sounds formed by the passage of air through the throat and mouth, particularly the tongue and lips. In terms of frequency, the consonants lie above 500 Hz.
At a normal vocal intensity, the energy of the vowels usually diminishes rapidly above 1 kHz. Note however, that the emphasis on the speech spectrum shifts one to two octaves towards higher frequencies when the voice is raised (shouting). Also, note that it is not possible to increase the sound level of consonants to the same extent as vowels. In practice, this means that the intelligibility of speech is not increased by shouting, compared to applying normal vocal effort, in situations where the background noise is not significant.
If you listen to two people speak or sing the same vowel at the same pitch (f 0 ), the vowels are presumably recognizable as identical in both cases. However, any two voices do not necessarily produce exactly the same spectrum. The formants provide the perceived vowel sounds. Also, the formants provide information different from speaker to speaker. The formants are explained by the acoustic filtering of the spectrum generated by the vocal cords. Vowels are created by the tuning of the resonances of the cavities in the vocal tract.
What affects intelligibility?
In tonal languages, such as Chinese and Thai, the speakers use pitch and shifts in pitch to signal meaning. In non-tonal languages, such as English, Spanish, German, and Japanese, words are primarily distinguished by changing a vowel, a consonant, or both. However, of these two, the consonants are the most important for distinguishing one word from another.
The important frequencies in non-tonal (Western) languages are in the frequency band around 2 kHz. This is the most important frequency range for intelligibility.
Most consonants are found here. If this band is lost, intelligibility is lost.
A graph of intelligibility versus high-pass or low-pass filter cut-off frequencies shows the importance of frequencies above 2 kHz.
Using an HP filter (blue line) at 20 Hz (upper left) leaves the speech 100% understandable. Moving the HP-filter cutoff to 500 Hz still leaves the speech signal understandable. Even though most of the speech energy is cut out, the intelligibility is only reduced by 5%. However, applying a higher cut-off at 2 kHz makes intelligibility drop, and moving further to cut frequencies below 4 kHz reduces intelligibility almost to zero. Conversely, applying an LP-filter cuts intelligibility severely. When cutting at 1 kHz, eliminating everything above that frequency, the intelligibility is less than 40%.
Even with the cut-off at 2 kHz, intelligibility is less than 75%. It can be seen that the frequency range between 1 kHz and 4 kHz is highly important for intelligibility.
The sound field
The sound field around the person speaking is affected not only by the physics of the vocal tract, but also by the person’s head and body. Sound is directional, with not all frequencies radiating in all directions equally.
If the sound level at 1 m is plotted around a person, the difference between front and back overall is approximately 7 dB. Furthermore, in the vertical plane, the level forward is slightly higher at 30° below the horizontal compared to straight forward, mainly because sound is reflected off the chest. However, the differences in levels are also frequency-dependent, with higher frequencies being more directional.
High frequencies are attenuated more toward the rear than lower frequencies.
Directivity increases from approximately 1 kHz and up. The important frequencies for consonants—key for intelligibility— are above 1 kHz; it is obvious that higher intelligibility is obtained when a microphone is position in front of a person as opposed to behind the person. However, microphone placement directly in front of a performer is difficult if the performer has to move around and the microphone is not handheld. Microphones are usually placed on a performer’s head or chest, and those positions require some consideration of the polar pattern of the voice at different frequencies.
Placing the microphone
If we take the signal from a microphone placed one meter in front of a performer as a reference, we get louder signals with different frequency curves when the microphone is positioned elsewhere.
Positioning a microphone close to the performer—placing it on the chest, forehead, ear, or at the cheek—results in a signal somewhat louder than what is picked up a meter away from the performer (the microphone is closer). Furthermore, if we take what a good microphone picks up at one meter as a neutral response, all the other microphone positions change the sound spectrum, emphasizing some frequencies and attenuating others. There is a general tendency of a raise around 800 Hz, which must be considered, but the important deviations for speech intelligibility are the attenuations above 1 kHz. A lavalier microphone on the chest is particularly bad because of the severe dip at about 3 kHz, right where we need to hear the difference between consonants. Attenuation of higher frequencies should be considered in microphone placement and in applying compensatory equalization to the signal.
Lavalier/chest-worn microphone
The speech spectrum at the typical chest position has a lack of frequencies in the essential range of 3-4 kHz. If a microphone with a flat frequency response is positioned on a person’s chest, the 3-4 kHz range should be boosted around 5-10 dB just to compensate for the loss.
Headset microphone: cheek or ear
A headset microphone can put the microphone at the cheek or ear position.
The level at the headset microphone is approximately 10 dB louder on the cheek compared to a chest position, and the spectrum is flatter, compared to the chest position, but there is still significant high-frequency roll off that should be compensated for.
Forehead microphone
A microphone positioned on the forehead or at the hairline is often used in film and stage performance, and is relatively neutral regarding speech intelligibility. There is a boost at 800 Hz, but the spectrum critical for consonants above about 1.5 kHz is fairly flat.
Empirical equalization
Some signal compensation is desirable to deal with the high-frequency attenuation and maintain intelligibility. The easiest way to do this is to compare the signal from a reference microphone about a meter in front of the performer to the signal from a microphone placed where it will be used in the performance. Apply equalization to the signal from the performer’s mic until it sounds about the same as the reference microphone. There are several ways you can do this.
(1) If you have two microphones, you can have the performer speak while both are live. Send one signal to one earphone on a studio headset, and the other signal to the other earphone. Adjust a graphic or parametric equalizer on the worn microphone’s signal until it sounds the same as the reference. Alternatively, you can simply switch back and forth—A/B switching—with one speaker and adjust until the differences are not noticeable.
(2) If you have one microphone, make a recording with the microphone in the reference position, and another recording with it in the worn position. Send the two recordings to the two earphones and adjust as above, or simply switch back and forth, listening to one then the other through a speaker.
(3) An alternative to listening and subjectively comparing the two different signals is to invert the polarity of one, mix it with the other, and send the resulting signal to one speaker. The resulting signal will be the difference between the two signals.
Adjust the EQ until the signal is gone. The signals are now identical. Flip the phase again, back to normal, and you are done, except for recording the equalizer settings.
Whatever method you use, record the equalizer settings, and you should have a good starting place for setting the EQ whenever a mic is worn in a particular position, whether it is on the forehead, chest, ear, or cheek. It should be fairly consistent, no matter who the performer is or what kind of performance.
Sung vowel sounds convey a lot of the beauty of music, but hearing the consonants is important if we are to know if someone’s love or luck is here to stay.
Technical Editor’s final note: Decades ago my girlfriend and I took her parents to see Cavalleria Rusticana and Pagliacci at the San Francisco Opera. At intermission, my girlfriend’s father turned to me and asked, “Why can’t I understand them?” “They are singing in Italian,” I said. “I speak Italian.
Why can’t I understand them?” was his reply.
I had no answer at the time, but now I am thinking it was because the beautiful singing was almost all vowels, no consonants. No consonants gives us pretty music but not much intelligibility.
Bo Brinck has been working in the audio industry since his early teens, first as a musician, later as a recording engineer and producer. He has more than 200 albums in his discography. With DPA Microphones since 2005, Bo now serves as Global Education and Application Manager/ Product Specialist. He is a microphone expert and understands how mics work as well as how to apply them in practice.
To increase knowledge about the importance of microphones in the sound chain, Bo also educates on all types of audio related topics such as acoustic challenges and correct microphone placement on the body as well as musical instruments.
. . . speech is an acoustic signal, subject to the laws of physics.