Notebook

Room impulse responses¶

![RIR Measurement Session](echoes.png)

Photos from one of the measurement sessions in the Lausanne cathedral. A huge set of audio-visual measurements is available here. Among other things, these measurements were used to hear the shape of a room.

- - - - - -

The mother of all tool to do acoustics is of course the linear acoustic wave equation. In room acoustics, sound levels are low enough so that the acoustics are indeed linear, and we can use this convenient model. But one thing we know about linear systems is that they are completely described by their impulse response. Since the wave equation depends on both space and time (it's a partial differential equation), better than using the term impulse response, we can talk about its Green's function. That's basically the same thing, except that the Green's function explicitely depends on the location of the sound source, and on the location of the receiver (microphone).

Room impulse response (RIR) is then exactly what the phrase says. For a source (S) and a receiver (R) at fixed locations inside the room, the RIR describes the acoustic (linear, time-shift invariant) channel between S and R. So it's a Green's function for some particular choice of the source and receiver locations. Put simply, an RIR is what we would record at R if we fired a pulse of sound at S.

The set of all RIRs in a room (for all possible S and R) is the complete description of the room's acoustics. Actually we don't need all of them. For example, an RIR from S to R is the same as the RIR from R to S because of the property called duality.

Let's load some RIRs to see how they look like and hear how they sound like! The first RIR is from an emptied classroom at EPFL, the infamous BC329 room for LCAV seminars.

In [1]:

%pylab inline
from scipy import io

mat = io.loadmat('BC-329.mat')
fs  = mat.get('fs')        # Sampling frequency
h1  = array(mat.get('h'))  # Room impulse response

n1     = arange(0, fs);
nEarly = arange(0, fs/15);

# Plot the first second of the impulse response
subplot(2, 1, 1)
plot(n1/double(fs)*1e3, h1[n1])
ylabel('$h(t)$')

subplot(2, 1, 2)
plot(nEarly/double(fs)*1e3, h1[nEarly])
xlabel('Time in [ms]')
ylabel('$h(t)$')

Populating the interactive namespace from numpy and matplotlib

Out[1]:

<matplotlib.text.Text at 0x9bc2b70>

Here we plotted the first second of the RIR, and we also zoomed in to the first 70 milliseconds. In this zoomed-in figure we can see some distinct peaks that are in fact echoes. More on this in the subsection on the Structure of RIRs (further down).

Now we'll load and plot a room impulse response from the Lausanne cathedral!

In [2]:

mat  = io.loadmat('Cathedral.mat')
fs   = mat.get('fs')        # Sampling frequency
h2   = array(mat.get('h'))  # Room impulse response

n5   = arange(0, 5*fs);

# Plot the first second of the impulse response
subplot(2, 1, 1)
plot(n5/double(fs)*1e3, h2[n5])
ylabel('$h(t)$')

# Plot the early part of the impulse response
subplot(2, 1, 2)
plot(nEarly/double(fs)*1e3, h2[nEarly])
xlabel('Time in [ms]')
ylabel('$h(t)$')

Out[2]:

<matplotlib.text.Text at 0xa0439b0>

Let's see how the spectrograms of these two RIRs look like. Spectrogram represents the spectral distribution of energy over time. It tells us which frequencies will be amplified and which attenuated by different portions of an RIR. In the figures below we can notice that high frequencies are much more pronounced in BC329, perhaps thanking to flat, very reflective surfaces (glass walls for example).

In [3]:

figure()

subplot(2, 1, 1)
specgram(h1[n1,0], NFFT=1024, noverlap=4);
subplot(2, 1, 2)
specgram(h2[n1,0], NFFT=1024, noverlap=4);

It is a very good idea to listen to room impulse responses. By doing that, we can really appreaciate the difference between various acoustic spaces.

In [5]:

from Audio import *

Audio(h1[0:3*int(fs)],rate=fs) # Play the BC329 impulse response (3 seconds)

Out[5]:

In [6]:

Audio(h2[0:5*int(fs)],rate=fs) # Play the cathedral impulse response (5 seconds)

Out[6]:

Another good exercise is to slow these responses down by stretching them in time. This is equivalent to playing them at the lower sampling rate... of course this will also pitch shift them so they'll sound deeper.

In [16]:

R=14;
Audio(h1[0:3*int(fs)],rate=fs/R) # Play the BC329 impulse response at a lower sampling rate

Out[16]:

This seems quite neat. What do you observe? What can you hear? Compare this with the cathedral's RIR slowed down.

In [17]:

Audio(h2[0:5*int(fs)],rate=fs/R) # Play the cathedral impulse at a lower sampling rate

Out[17]:

Please note that some audio players may have problems with the value we divide the sampling rate by (the variable R above). In those cases you may change R for the playback to work.

Structure of RIRs¶

How is an RIR constructed? Well, an impulse fired somewhere in the room travels to the microphone. But together with this direct path impulse, echoes from the walls also arrive at the microphone. Echoes generate new echoes from other walls that we call higher-generation echoes, and this goes on and on until the energy is absorbed. Every time an echo bounces off a wall, the wall absorbs a fraction of the sound energy (we increase the heat of the walls). Additionally, different materials absorb energy differently in different frequency bands, so the walls also filter the impulse (if we look at the wall as a filter, its impulse response is not a Dirac---it has some finite, non-infinitesimal duration). This means that with every reflection, what once was a perfect impulse becomes more and more spread in time.

Next important effects are diffuse reflections. For most of the frequencies of interest, and most of the common materials, the walls are essentially mirrors. That is, in order to hear a reflection of a point source of sound from a particular wall, we must be listening in exactly the right spot, so that the incident angle of the sound waves with respect to wall's normal equals the outgoing angle. Check out the following figure that explains the image source model. Image source model says that these echoes (which BTW we call specular, again in analogy to light), can be modeled as virtual image sources of the original source.

- - -

But also similarly to light (typically in a much smaller proportion), some sound energy is scattered around this ideal direction. Even if this energy is very small, and difficult to even measure for a single reflection, it is quite important for the perception of sound in a room.

Another important effect not modeled as an echo is diffraction. Diffraction is essentially bending of the sound around edges and corners. It's the reason why we can hear someone talking around the corner, even if there are no walls around. Again, these contributions are really important for the listening experience.

All these effects together are reflected in the structure of an RIR. Take a look again at the RIRs we plotted earlier. We observe that the early part of a room impulse response contains more or less isolated, well pronounced peaks that correspond to individual echoes. But non-flat fading, diffusivity of reflections, and the diffractions all unite in the latter part of a RIR to create noise-like exponential decay (of course, an equally important contribution is from the fact that echoes get denser and denser as the time passes). In fact, that's how modern auralization software models the tail of the room impulse response---as noise multiplied by an exponential decay. The rate of the decay varies wildly between different rooms, and is linked to reverberation time. $T_{60}$ is defined as the time for reflections of a direct signal to decay 60 dB. It varies from around 1 second in speech-only lecture halls, to around 2 seconds for generale-purpose auditoria, to for example more than 8 seconds in the Notre-Dame cathedral in Paris, and then some, as in the world's longest echo.

Measuring RIRs¶

An important practical question is how to measure RIRs. As they are "responses to an impulse", the first idea is to simply use an impulsive source of sound, and record the room reponse with a microphone. There are numerous drawbacks of this simple idea:

RIR is a response to an idealized impulse, a Dirac delta function. Realistic impulsive sources are only approximations of the Dirac, and usually not very good ones. Even if we might think that popping a balloon or firing a starter pistol generates a very good impulse, this is not very true (see Figures ??),
Repeatability is an issue: popping different balloons generates different sounds,
Sound level is an issue: most sources are not loud enough (even if we could think they are) to yield sufficient signal-to-noise ratio (SNR) in the recorded response,
Directivity: balloons and starter pistols are not omnidirectional sources (most loudspeakers are not omnidirectional either, but we know their directivity patterns).

A very good way to measure RIRs is to use a loudspeaker. Again, we could try to reproduce a sound impulse through the loudspeaker, but the energy we would require in order to get resonable SNRs would destroy the loudspeaker. That is why in practice we spread this energy over time. We play a known sound through the loudspeaker, and then deconvolve in the spectral domain. Denote the room impulse response by $h(t)$ , and the sound that we play by $x(t)$ . Then the microphone measures their convolution,

$\begin{equation} y(t) = [x \ast h](t). \end{equation}$

In the frequency domain, convolution becomes multiplication

$\begin{equation} Y(\omega) = X(\omega) H(\omega), \end{equation}$

and we can estimate the RIR $h(t)$ by spectral division as

$\begin{equation} h(t) = \mathcal{F}^{-1} \bigg\{ \frac{Y(\omega)}{X(\omega)} \bigg\}. \end{equation}$

This formula suggests that we want to pick $X(\omega)$ which is flat across frequencies. Two common choices are maximum length sequences (MLS) and sine sweeps. We will look a bit more closely into using sine sweeps for RIR measurements.

Sweeped sine signals¶

We always measure the RIR in a certain range of frequencies. That is, even if in the ideal case we fire a true Dirac delta function with infinite bandwidth, we can't do in practice, because loudspeakers and microphones have finite bandwidths. Dirac is one extreme---the other extreme is a sinusoid $x_0(t) = A \sin(\omega_0 t) = A \sin(2 \pi f_0 t)$ . A sinusoid contains only a single frequency (in this case $f_0$ ).

The idea is then to cover the desired frequency range $[f_1, f_2]$ by sweeping the sinusoid's frequency through it. The test signal has a general form

$\begin{equation} x(t) = A \sin (2 \pi f(t)). \end{equation}$

We call $f(t)$ instantaneous phase, and its derivative $df(t) / dt$ is instantaneous frequency. We can choose how to vary the instantanous frequency from $f_1$ to $f_2$ . A known acoustician Angelo Farina suggests to use exponential sweeps, where

$\begin{equation} x(t) = \sin \left[ \frac{\omega_1 T}{\ln(\omega_2 / \omega_1)} \left( e^{\frac{t}{T} \ln(\omega_2 / \omega_1)} - 1 \right) \right]. \end{equation}$

Here, $\omega_1$ is the start frequency, $\omega_2$ is the end frequency, and $T$ is the total duration of the sweep.

Sure enough, we need a discrete version of this signal, as we want to generate it using a computer. Denote the sampling frequency by $f_s$ and the corresponding sampling period by $T_s = 1/f_s$ . Then the duration of the pulse in samples is $N = \lfloor T/T_s \rfloor$ . We obtain the following formula for discrete $x$ ,

$\begin{equation} x[n] = \sin \left[ \frac{\omega_1 N T_s}{\ln(\omega_2 / \omega_1)} \left( e^{\frac{n}{N} \ln(\omega_2 / \omega_1)} - 1 \right) \right]. \end{equation}$

In [10]:

f1 = 50.     # Start frequency in [Hz]
f2 = 10000.  # End frequency in [Hz]
T  = 5.      # Pulse duration in [s]
fs = 44100.  # Sampling frequency in [Hz]
Ts = 1./fs   # Sampling period in [s]
N  = floor(T / Ts) # Pulse duration in samples
n  = arange(0, N)  # Sample index

om1 = 2*pi*f1
om2 = 2*pi*f2

x_exp = sin(om1*N*Ts / log(om2/om1) * (exp(n/N*log(om2/om1)) - 1))

subplot(2, 1, 1)
plot(n[0:int(fs)], x_exp[0:int(fs)])
xlabel('$n$ [samples]')
ylabel('$x_{exp}[n]$')

subplot(2, 1, 2)
plot(abs(fft.fftshift(fft.fft(x_exp))))
xlabel('Frequency bin $k$')
ylabel('$X_{exp}[k]$');

In [11]:

specgram(x_exp, NFFT=1024, noverlap=4);
ylabel('Normalized frequency (1 is $f_s/2$)')
xlabel('n [samples]')

Out[11]:

<matplotlib.text.Text at 0xc4e9748>

Let's compare this spectrogram with the spectrogram of the linear sweep, and let's actually do the computation since its simpler in this case. In the linear sweep we vary the instantaneous frequency linearly from $\omega_1$ to $\omega_2$ over time $T$ . We can write

$\begin{equation} f'(t) = \omega_1 + \frac{\omega_2 - \omega_1}{T}t. \end{equation}$

Integrating this equation, we obtain that

$\begin{equation} f(t) = \int_0^t f'(t)~dt = \omega_1 t + \frac{\omega_2 - \omega_1}{2T} t^2, \end{equation}$

and the linear sine sweep signal is given as $x_{\text{lin}}(t) = \sin(f(t))$ . Note that the first idea we have for generating a linear sweep could be to say $x(t) = \sin(f(t) t)$ , and vary $f(t)$ from $\omega_1$ to $\omega_2$ . But this would give a wrong result, as we would miss the division by 2 in the denominator up there (due to the integration).

In [12]:

f1 = 50.     # Start frequency in [Hz]
f2 = 10000.  # End frequency in [Hz]
T  = 5.      # Pulse duration in [s]
fs = 44100.  # Sampling frequency in [Hz]
Ts = 1./fs   # Sampling period in [s]
n  = arange(0, N)  # Sample index

om1 = 2*pi*f1
om2 = 2*pi*f2

x_lin = sin(om1*(n*Ts) + (om2-om1)/2/T * (n*Ts)**2)

subplot(2, 1, 1)
plot(n[0:int(fs)], x_lin[0:int(fs)])
xlabel('$n$ [samples]')
ylabel('$x_{lin}[n]$')

subplot(2, 1, 2)
plot(abs(fft.fftshift(fft.fft(x_lin))))
xlabel('Frequency bin $k$')
ylabel('$X_{lin}[k]$');

In [13]:

specgram(x_lin, NFFT=1024, noverlap=4);
ylabel('Normalized frequency (1 is $f_s/2$)')
xlabel('n [samples]')

Out[13]:

<matplotlib.text.Text at 0xce72550>

Note that in both spectrograms, sweeps end at the same frequency (as they should, of course). Try to generate the linear sweep the naive way described above (without dividing by two), and see what happens in the spectrogram.

Now for the next things: It is not a bad idea to hear how these sweeps sound like, so let's play them.

In [14]:

Audio(x_exp,rate=fs) # Playback of the exponential sweep

Out[14]:

In [15]:

Audio(x_lin,rate=fs) # Playback of the linear sweep

Out[15]:

Fascinating, huh? This gives us a lot of insight into how human hearing works. The exponential sweep is actually perceived more linearly than the linear sweep. The linear sweep seems to linger for too long in the higher frequences. As an exercise, please read up a bit, and explain this!

Auralization¶

Measured or synthetic RIRs can be used to recreate the acoustics impression of the corresponding space. If we convolve an RIR of a cathedral with a dry sound sample (anechoic sample, recorded in an anechoic chamber), the resulting signal will sound as if it was recorded in the cathedral. This is used for example in virtual reality, to create realistically sounding virtual spaces. Check out fantastic research by our colleague Dirk Schröder. Another application example is music production.

To provide a truly immersive experience, we should do more than just single channel (monaural) auralization. We should either reproduce the complete original sound field using appropriate algorithms and sound system, or provide binaural auralization. In binaural auralization we aim to recreate the impression of an acoustic space by presenting the sound through the headphones.

For simplicity, in this Signal of the Day example we will consider only monaural auralization. As you will hear, even this simple convolution can evoke correct mental imagery about the space we're auralizing. Consider the following dry sound sample (a question "Can one hear the shape of a room?" recorded in an anechoic chamber).

In [18]:

mat = io.loadmat('Dry.mat')
fs  = mat.get('fs')       # Sampling frequency
y  = array(mat.get('y'))  # Anechoic sound

Audio(y,rate=fs)

Out[18]:

Sounds dry, huh? Now let's auralize! First, let's hear what happens when we convolve this dry sound with the impulse response of the classroom BC329.

In [19]:

from scipy import signal # To get fftconvolve which seems much faster than numpy's convolve

y_room = signal.fftconvolve(y[:,0], h1[:,0])

Audio(y_room,rate=fs)

Out[19]:

OK, now let's do the auralization with the cathedral's RIR. What do you expect to hear?

In [20]:

y_cathedral = signal.fftconvolve(y[:,0], h2[:,0])

Audio(y_cathedral,rate=fs)

Out[20]:

Exercise: Auralization + Structure of RIRs¶

Recall what you read about the structure of RIRs. It might seem that early echoes are the most important part of the room impulse response. Indeed, for some tasks, mostly localization related, they are essential. But we now want to show that for the impression of the space, the reverberation tail is very important. To this end, try convolving the dry sound with only the early part of the cathedral impulse response, and then playing it (say with the first 100 ms).
Next, try convolving the dry sound with only the tail of the RIR (from 100 ms until the end), and listen to it. What do you hear? It sounds very hollow. The reason for that is that the early part of the RIR also contains the direct sound path! Without it we are really listening only to reverberation, without the direct sound.
Find a way to add the direct sound back in (but without the echoes!) and listen to the result.