The No.1 Website for Pro Audio
22nd November 2019
#1
Lives for gear

Binaural Theory

This is intended to take the reader through binaural recording theory from A to Z. My goal is that anyone reading and understanding this will be able at the end to start making great binaural recordings even if they never heard of such a thing.
The approach is technical, but I tried to provide enough commentary and notes so that one could disregard the formulas and still be able to create accurate binaural recordings at the end. That being said, my hope is that people will enjoy the mathematical discussion as well.
The math involved itself is very simple (although some basic knowledge of signal processing wouldn’t hurt), but given the nature of the subject at hand this is a somewhat notation-heavy discussion.
Figures are provided to help visualize every section of the discussion.

REAL EVENT
Let’s start by considering what happens when we hear a sound coming at us from an arbitrary direction ß, forming an angle ß with the direction representing the listener facing forward.
The sound wave, generally speaking, can come from above or below, front or back, as well as being angled to the left or the right.
We will call this wave Wß.
If we assume that the wave comes from far enough, its wave fronts, grey in the figure, are planes and can be represented by straight lines. This is not a necessary hypothesis and our discussion is valid even when sounds come from up close and the wave fronts look more like concentric spheres.
At time T0 the wave still hasn’t reached the listener’s head. At time T1 it has reached the right ear but not the left ear, yet. At time T2 the wave has reached the left ear and has already passed the right ear.
So we see that the left ear perceives the same wave with a delay in respect to the right ear. For this wave’s direction, the time delay is T2 – T1.
Not only that, but the wave reaching the left ear will have to overcome the shadow effect caused by the head obstructing the wave’s path.
In general, the head’s geometry (all of it, up to the eardrums and including the ear canals) causes the pressure sensed by each eardrum to be dependent on the direction from where the sound wave is coming.
The pressure sensed by each eardrum is, therefore, the product (speaking in Fourier transform terms) of the spectra of the wave itself and that of a filter that represents the time delay and frequency shaping discussed earlier, and varies for each ear depending on the direction on the wave.
These direction dependent filters are called HRTFs (Head Related Transfer Functions).
In the figure, Hß,R is the HRTF function associated to direction ß for the right ear. Similarly, Hß,L is the HRTF function associated to direction ß for the left ear.
Pß,R is the pressure sensed at the right eardrum caused by the wave Wß, coming from direction ß.
Similarly, Pß,L is the pressure sensed at the left eardrum caused by the wave Wß.

In a nutshell, the whole point of binaural recording is to recreate Pß,R and Pß,L at the listener’s eardrums during playback, so that the recording sounds exactly like the real event.

Based on the discussion and conventions above, we can write the equations for the pressure sensed at each eardrum as:
Pß,R = Wß * Hß,R
Pß,L = Wß * Hß,L

RECORDING
Let’s now use a stereo microphone M to record pressures Xß,R and Xß,L caused by the wave Wß and present at the right and left capsules, respectively. Similarly to the HRTFs, we can introduce the MRTFs (Microphone Related Transfer Functions), as those filters that applied to Wß produce the pressures Xß,R and Xß,L.
Xß,R = Wß * Mß,R
Xß,L = Wß * Mß,L

These pressures will be equalized by filters ER and ES to produce the right and left signals SR and SL, that in turn will be provided as inputs to the loudspeakers during playback.

Our goal is to find ER and ES so that, no matter the direction ß of the sound being recorded, the pressures at the listener’s eardrums during playback will be exactly Pß,R and Pß,L.

Since at any given moment during the recording there are many directions from where the sounds being recorded are coming, ER and ES need to be independent from ß to find a viable solution for the task at hand.
Assuming the microphone is symmetrically built, we can also write:
Mß,R = M-ß,L
Where represents the direction that is the specular of direction ß, in respect to the microphone’s symmetry plane. That is, the response of one of the microphone’s channels to a certain wave is the same as the response of the other channel to the specular wave.

PLAYBACK
We now feed the signals SR and SL to the loudspeakers (L), placed at angles α and -α with the direction representing the listener facing forward.
We call the waves generated by the loudspeakers W1 and W2, coming from directions 1 and 2 respectively.
We assume that we have some means to have W1 only affect the right ear, and W2 only affect the left ear. This can be achieved via crosstalk cancellation for loudspeakers playback (although this is a theoretical situation and crosstalk works only up to a degree) or, obviously, using headphones.

As we discussed, if we make sure that P1,R is equal to Pß,R and P2,L is equal to Pß,L, for every arbitrary direction ß, we will have recreated an accurate playback of any real event.

Since:
P1,R = W1 * H1,R
Pß,R = Wß * Hß,R
P2,L = W2 * H2,L
Pß,L = Wß * Hß,L
we can write W1 and W2 as functions of Wß.
W1 = Wß * Hß,R * (1/H1,R)
W2 = Wß * Hß,L * (1/H2,L)

Following the path for the right channel, we can write 2 equations, one for the recording phase and one for the playback phase
RECORDING:
Wß * Mß,R * ER = SR
PLAYBACK:
SR * L = W1 --> SR = W1 * (1/L) = Wß * Hß,R * (1/H1,R) * (1/L)

--> Wß * Mß,R * ER = Wß * Hß,R * (1/H1,R) * (1/L)
And, finally:
ER = (1/Mß,R) * Hß,R * (1/H1,R) * (1/L)

Similarly, for the left channel, we can write the two equations:
Wß * Mß,L * EL = SL
SL * L = W2

Which, similarly to the right channel, lead to
EL = (1/Mß,L) * Hß,L * (1/H2,L) * (1/L)

Let’s rewrite the equations for ER and EL together and discuss them:
ER = (1/Mß,R) * Hß,R * (1/H1,R) * (1/L)
EL = (1/Mß,L) * Hß,L * (1/H2,L) * (1/L)

These equations represent the equalizations necessary to be applied to the signals recorded by the microphone M when a sound wave Wß coming from direction ß is present, so that when listening through loudspeakers L, from directions 1 and 2, the pressures sensed by the listener’s eardrums are the same as they would be if the listener was in place of the microphone during the recording session.
Unfortunately we can see that these equalizations, in their general form, depend on the direction ß. For every direction ß, therefore, we would need a separate EQ filter. For events comprised of more than 1 sound source (any real event), we would need to apply many EQ at the same time, which is obviously impossible.
The only way we can overcome this problem is if the terms ((1/Mß,R) * Hß,R) and ((1/Mß,L) * Hß,L) are identically equal to 1, for any direction ß. Which means that for any sound wave, coming from any direction, the microphone needs to apply the same changes to the sound wave, that will finally be recorded by the capsules, as the listener’s head would do for the same sound wave, that would finally be sensed by the eardrums.

Therefore, the microphone needs to have the same geometry of the listener’s head, up to the eardrums and including the ear canals. In other words, M = H.
Ideally, a torso replica should also be included, as the torso also contributes to HRTFs, even though in a lesser amount than the head geometry.

Since every listener has their personal HRTFs, the best we can do is to use a microphone that is based on averaged measurements of head, pinnae and canal dimensions.

A mannequin head, lifelike silicone ears and plastic tubes to approximate the ear canals are a great starting point for an accurate binaural microphone.

With such microphone, the equations for the equalizations become:
ER ~= (1/H1,R) * (1/L) ~= (1/M1,R) * (1/L)
EL ~= (1/H2,L) * (1/L) ~= (1/M2,L) * (1/L)

These EQs depend on three factors:
1. The microphone’s geometry, that we try to keep as close as possible to an average head, therefore approximately a known factor.
2. The direction of the speakers, which can be assumed similar for all the listeners, since the typical stereo listening configuration is quite standard (directions 1 and 2 are approximately the same for all the listeners), therefore a known factor.
3. The frequency shaping of the speakers, which for well balanced speakers is a known factor, although personal preference might play a role.
In any case, any potential problem associated to different frequency response for different loudspeakers is something that is not confined to binaural listening, but to any type of recording. It goes without saying that if the goal is accurate reproduction of an event, the speakers used should be up to the task.

These equations tell us that if we want to recreate the correct pressure at the listener’s right (left) eardrum, we need to equalize the signal recorded by microphone’s right (left) channel with the inverse of the microphone's right (left) channel response to a wave coming from the right (left) speaker direction. Also, we need to undo the frequency shaping resulting from the speaker’s frequency response.
It is also worth repeating that these equations can be written ONLY under the no crosstalk hypothesis.
If the microphone is symmetrical and the capsules are matched, from
Mß,R = M-ß,L
we derive
M1,R = M2,L
and ultimately:
ER = EL

MIC EQUALIZATION
To find M1,R and M2,L we put the microphone at the listening position and feed a signal to the loudspeakers (L2) with spectrum equal to 1. An impulse response or white noise are such signals. These signals will produce waves W4 and W5, coming from directions 1 and 2 respectively. We used W4 and W5 instead of W1 and W2 for clarity, since W4 and W5 represent a specific subset of W1 and W2 waves, namely the ones that are the result of loudspeakers L2, from directions 1 and 2, reproducing signals with spectra = 1.
We used L2 instead of L to indicate that the loudspeakers used to equalize the microphone are not necessarily the same that the recording will be listened to with. This is to forewarn of an issue that may arise when both equalizing the microphone and mixing the recording with the same loudspeakers.

We know that:
Xß,R = Wß * Mß,R
Xß,L = Wß * Mß,L
Therefore:
X4,R = W4 * M1,R
X5,L = W5 * M2,L

These equations are valid if we don’t run both speakers at the same time, and require us to use the right speaker to equalize the right microphone channel and the left speaker to equalize the left microphone channel, at separate times.
We also know that:
W4 = S1 * L2
W5 = S2 * L2
We can write
X4,R = S1 * L2 * M1,R
X5,L = S2 * L2 * M2,L
And since S1 = S2 = 1
X4,R = L2 * M1,R --> M1,R = X4,R * (1/L2)
X5,L = L2 * M2,L --> M2,L = X5,L * (1/L2)

Now we can write:
ER = (1/M1,R) * (1/L) = (1/X4,R) * L2 * (1/L)
EL = (1/M2,L) * (1/L) = (1/X5,R) * L2 * (1/L)

If we equalize the microphone using loudspeakers L2 that are more or less as balanced as the speakers used to listen to the recording, then L2 = L and the EQ simplify as:
ER = (1/X1,R)
EL = (1/X2,L)
These equations represent the way to find the correct equalization for the microphone:

1. Place the microphone at the listening position.
2. Play a white noise (or equivalent) signal with the right (left) speaker and record the right (left) microphone channel. One channel at a time.
3. Invert the spectrum of the recorded signal. That spectrum is the required EQ.

NOTES
Note 1.
In their general form:
ER = (1/X1,R) * L2 * (1/L)
EL = (1/X2,R) * L2 * (1/L)
these equations let us know something interesting.

If we find out that our recording sounds great when the mic is equalized with L2 and the recording is also mixed with L2, but it doesn’t translate well to other speakers L, then loudspeakers L2 are not ‘voiced’ correctly.

L represents the loudspeaker with which our recordings will be listened to by other people. While no two speakers are the same, we will assume L to be a speaker with a balanced response. Personal taste comes into play here, but there are studies which tell us what a balanced response looks like. “Accurate Sound Reproduction Using DSP” by Mitch Barnett is a great starting point if one wants to look into this topic.
By using the EQ procedure described above, if L2 doesn’t have a balanced response as L does, we are neglecting the (L2 * (1/L)) factor, which is not equal to 1 anymore. Put differently, it is as if we are adding an extra equalization ((1/L2) * L). For example, if L2 = 1 (L2 is equalized flat), we are adding an extra unnecessary filter L. Since it turns out that a well balanced speaker has a response that is tilted downward for the highs,

binaural recordings made with a microphone that has been equalized with flat speakers will sound great on those speakers, but will lack top end on other systems.

Note 2.
The MRTFs add a boost to some frequencies of up to 20 or even 30 dB.

Capsules capable of high SPL handling are necessary for anything other than very quiet sources.

The positive thing about this boost is that the frequencies being enhanced are the same that our ears are most sensitive to (the microphone has the same shape of our head, so the frequencies enhanced by the microphone are the same that are enhanced by our own head, which are not surprisingly the ones that we are most sensitive to).
Since the equalizing filter will effectively undo the boost by applying a corresponding dip in the frequency response, which is also applied to the noise floor at those frequencies, this means that we can use capsules with not so stringent noise requirements.

Binaural recording have an ‘embedded’ substantial increase in S/N ratio.

Note 3.
The microphone’s geometry is very important. There are microphones that are sold as binaural that do much worse than placing the capsule at the canal’s entrance instead of at the end of it.
Some of those do away with the head in between the ears (a lot of the shadowing effect disappears), and some do away with the pinnae (basically a variation of the jecklin disk). I personally would not consider any of those microphones for recordings that are intended to be of the purist type, but I also think that making music (which recording and mixing is a part of) is a creative process, where anything goes. The perfect microphone is the one that will get you the sound that you want, which is not necessarily an exact replica of the real event. Therefore even if one doesn’t want to go through the hassle of using ear canal replicas that’s totally fine. As a matter of fact, some people even resist the fact that ear canals are in fact necessary for an accurate replica of the real event. While math is hard to contradict with anectodal evidence and personal preference, at the end of the day personal preference is what matters the most when making a recording.

Start with a spaced pair, add a jecklin disk, substitute a sphere in place of the disk, add pinnae, add ear canals. These are sequential steps towards a more and more realistic recording. But if at any point in this sequence one feels like stopping because it’s good enough for them, nobody can tell them they are wrong.
On the other hand, it is equally wrong to assume that the binaural status quo of the capsule being located at the canal’s entrance must not be messed with and regard it as the end all be all of binaural technology. Math disproves this.
Lastly, whatever you do, if you build your microphone yourself, the closer you get to a realistic representation of the human head with your DIY microphone, the more you need to equalize it. Microphones that have pinnae NEED to be equalized for sure, no matter if they use the canals or not. They will sound anything from bad to awful if not equalized.

Note 4.
Headphone playback solves the crosstalk problem very simply and 100%.
Unfortunately, headphone listening comes with its own set of problems. First of all the fact that they change the acoustic impedance seen by the eardrums in respect to when the head is in free space, and they do that in a way that varies a lot from one individual to the next, especially at high frequencies. Since most of the binaural cues (all of the ones associated with frequency shaping) live in the high frequencies, this means that with headphones we add another element, on top of the inevitable differences in everybody’s HRTFs, that will interfere with the task of trying to reproduce the correct pressure at the listener’s eardrums.
This is why I personally prefer to listen to binaural recordings with loudspeakers, even though the crosstalk diminishes the three dimensional sound stage representation. I find that a lot of the soundstage is still maintained, as the brain is still able to extract the binaural cues. This is due to the fact that there is a delay associated with the sound from one loudspeaker reaching the opposite ear, and the head shadowing helps as well with a 5 to 10 dB reduction. It might not seem like much, but it is enough for the brain to pick up the binaural cues. There’s also studies on crosstalk cancellation for loudspeakers that agree with this very personal consideration of mine, but go a step further and try to enhance the cancellation while making sure that no excessive coloration is added to the balance of the recording.

Lastly, on the myth that binaural recordings can only be listened to via headphones, I think that while the brain might have to work harder to extract the binaural cues from loudspeaker listening, there is no reason to think that, even if all cues were to be lost, we would be left with anything less than a very well balanced sounding stereo recording.. at the very least!
Attached Thumbnails

## Welcome to the Gearslutz Pro Audio Community!

###### Registration benefits include:
• The ability to reply to and create new discussions
• Interact with VIP industry experts in our guest Q&As