ELEC 484 Project: 3D Stereo with Two Loudspeakers

In this project I have attempted to simulate the projection of sound sources from any direction in 3D space using two speakers located in front of the listener. This is accomplished by using measured head-related transfer function data to simulate the projection of the sound and then cancel out the spatial effect of the speakers.

Source code for this project may be found at: hrtf.py

Introduction

Directional Cues

There are two ways that the human ear can distinguish between sound sources on the horizontal plane: the Interaural Intensity Difference (IID), and the Interaural Time Difference (ITD).

The IID refers to the difference in sound level between ears, which is caused by sound being attenuated by passing through the head. This cue is more applicable to high-frequency sound (over 1000Hz), which is attenuated more due to the low-pass characteristics of the head. The most traditional, simple, and well-known method of generating stereo sound, panning, uses this cue.

The ITD refers to a difference in time between when a sound reaches the closest ear to its source, and when it reaches the other. This is more useful for sounds under 1000Hz, which have wavelengths longer than the head, allowing the phase difference to be perceived.

Detection of Elevation

The position of a sound source above, below, or behind the listener, can also be detected. This is made possible by the pinna, the outer part of the ear. The pinna is roughly spiral in shape, and therefore affects sounds from all different directions on a vertical plane differently. This, combined with the IID and ITD cues, allows the detection—and therefore the projection— of sounds from any angle.

The HRTF

The Head-Related Transfer Function is a filter imitating the frequency-damping characteristics of sound passing through the head, echoing from the shoulder, and passing through the pinna of the ear. There is no consistant way to calculate these factors, since all these variables vary from person to person. Most existing models are taken from empirical measurements. The HRTFs used in this project were taken from impulse response measurements of a KEMAR dummy head with dummy ears, taken at MIT in 1995 and available at http://sound.media.mit.edu/KEMAR.html. These functions are stored in the time domain as .wav files, each file corrisponding to a measurement taken of a sound source from a particular azimuth (horizontal angle) and elevation (vertical angle).

These HRTFs each supply a left and right-channel impulse response, which I will denote as HL and HR respectively.

Speaker Crosstalk Cancellation

Using the HRTF as outlined above, sounds can be made to come from any source only when listened to in headphones. When two speakers instead of two headphones play the stereo sound, the directional effect of the speakers' placement adds its own influence. To cancel this out, the text recommends the use of a head-shadowing filter in combination with some gain, which will cancel the head's influence on the sound as it passes through it.

Implementation

Choosing a transfer function


def setangles(elev, azimuth):
	elev = int(elev)
	azimuth = int(azimuth)
	
	#bring to multiple of ten
	if elev != 0:
		while elev%10 > 0:
			elev = elev + 1

	if elev > 90:
		elev = 90
	if elev < -40:
		elev = -40

	#Set increment of azimuth based on elevation
	...

	return elev, azimuth, flip

		

Setangles is a python function created for this project that accepts an arbitrary azimuth and elevation, and returns an elevation and azimuth closest to the input for which a file exists. The measurements provided exist in elevations from -40 to 90 degrees, each of which has azimuths from 0 to 180 degrees in varying intervals.

Additionally, this function provides a boolean variable 'flip,' which is set to on if the sound source is from the left (as calculated by the azimuth input). Transfer functions for sound sources from the left are simply those from the right with the left and right transfer functions swapped.

Loading the File


def read(elev, azimuth, N=128):
	""" Accepts elev and azimuth in degrees, and returns closest impulse response 
and transfer function to that combination from compact KEMAR HRTF measurements"""

	elev, azimuth, flip = setangles(elev, azimuth)
	

	filename = "compact/elev"+str(elev)+"/H"+str(elev)+"e"+str(azimuth)+"a.wav"
	fs, h_t = wav.open(filename)
	print elev,azimuth
	h_t_l = transpose(transpose(h_t)[0])
	h_t_r = transpose(transpose(h_t)[1])
	if flip:
		return h_t_r, h_t_l
	return h_t_l, h_t_r
		

This function will accept an arbitrary elevation and azimuth, call the function setangles to obtain close valid angles dictating which wav file to open, and open it. It will then split the stereo wav file into left and right channels and return them separately.

Projecting a Source Sound


def project(sig, elev, azimuth):
	h_t_l, h_t_r = read(elev, azimuth)

	Hw_l = fft(h_t_l, len(sig))
	Hw_r = fft(h_t_r, len(sig))

	f_diner = fft(sig)
	f_diner_l = Hw_l*f_diner
	f_diner_r = Hw_r*f_diner
	t_diner_l = ifft(f_diner_l, len(sig))
	t_diner_r = ifft(f_diner_r, len(sig))
	return t_diner_l, t_diner_r
		

This function accepts a mono signal, an elevation, and an azimuth, retrieves the impulse response using read(), uses a fast fourier transform to project the signal and the impulse responses to the frequency domain. Then, left and right signals are created by multiplying the frequency-domain signal with the left and right transfer functions, and an inverse fft is used to return left and right signals, which will now appear to come from the elevation and azimuth specified.

Dynamically changing Sound Source


def path(t_sig,start, end, duration=0, window_size=1024, fs=44100):
	""" Moves a sound from start to end positions over duration (Seconds)"""
	M = (fs/2.) / window_size
	w = r_[:fs/2.:M]
	N = len(w)

	window = hamming_window(N)(r_[:window_size])

	i = 1
	elev = start[0]
	elev_end = end[0]

	azimuth = start[1]
	azimuth_end = end[1]

	if duration == 0:
		duration = len(t_sig)/fs
	
	N_steps = int(len(t_sig) * 2 / window_size)
	elev_delta = float((elev_end - elev) / float(N_steps)) #deg/half-window
	azimuth_delta = float((azimuth_end - azimuth) / float(N_steps))

	output_l = zeros( len(t_sig) )
	output_r = zeros( len(t_sig) )

	while i*(window_size) < len(t_sig):
		ind_min = (i-1.)*window_size
		ind_max = (i)*window_size
		t_sig_w = t_sig[ind_min:ind_max] * window
		t_output_l, t_output_r = project(t_sig_w, elev, azimuth)
			
		output_l[ind_min:ind_max] += t_output_l
		output_r[ind_min:ind_max] += t_output_r

		elev = elev + elev_delta
		azimuth = azimuth + azimuth_delta
		
		i = i+0.5

	return output_l, output_r
		

This code increments from the starting position to the ending one and applies the transfer function to overlapping windows. The hamming window was used because I had it handy. This creates the illusion of the sound source moving from the starting position to the ending.

The effect of using this function to spin the sound source 360o about the listener's head (when headphones are used) can be seen in this sound file: diner_360_headphone.wav

Speaker Cancellation

In lieu of the method dictated by the text, since I am going to be employing HRTFs—which account for pinna and shoulder effects in addition to head shadowing—anyhow, I have devised a way to use them for this purpose.

Since we can supply HRTFs for sources located at the positions of the left and right speakers, we can derive an equation for the signal recieved at the ear, and then solve for these ear signals to cancel the spatial effects of using speakers instead of headphones.

First, we let HLL refer to the left transfer function of the left speaker's positional HRTF; in other words, the filter we would apply to the left signal if we were attempting to make a sound come from the position of the left-hand speaker. This will come from the HRTF with azimuth -θL and elevation 0. HLR will refer the left-ear transfer function of the right speaker, and HRL and HRR will be the right-ear transfer functions for the positions of the left and right speakers respectively.

If the separation between the head is insignificant compared to the distance from the speakers' center d, the signal at the left and right ears EL and ER can be found using the following equations:

Combining these equations into a single equation, we have:

Now, we take the inverse of the HRTF matrix

Using this equation, we can now apply this matrix to any signal designed for headphones [EL ER]T to derive a signal for two speakers. Note that, unlike the transaural stereo method detailed in the text, this method does not require θL and θR to be the same. Notice that the HRTF implicitly accounts for the gain and delay factors present in the transaural equation below.

It should be noted that neither this method nor the one depicted in the text make any allowance for reverberations, which will cause any two-speaker implementation of a two-channel surround signal to be inferior to the headphone version. Ideal conditions for this setup to work will would be in a large soft-walled room with speakers far enough from the listener to render the ear separation distance negligible.


def speaker_transform(sig_l, sig_r):
	theta_l = -30
	theta_r = 30

	ht_l_l, ht_l_r = read(0, theta_l)
	ht_r_l, ht_r_r = read(0, theta_r)

	H_l_l = fft(ht_l_l, len(sig_l))
	H_l_r = fft(ht_l_r, len(sig_l))
	H_r_l = fft(ht_r_l, len(sig_l))
	H_r_r = fft(ht_r_r, len(sig_l))

	f_sig_l = fft(sig_l, len(H_l_l))
	f_sig_r = fft(sig_r, len(H_l_l))

	C = ((H_l_l*H_r_r - H_r_l * H_l_r)**-1)

	
	f_output_l = C*H_r_r*f_sig_l - H_r_l*f_sig_r
	f_output_r = C*H_l_l*f_sig_r - H_l_r*f_sig_l

	t_output_l = ifft(f_output_l, len(sig_l))
	t_output_r = ifft(f_output_r, len(sig_r))

	return t_output_l, t_output_r
		

This code retrieves the appropriate HRTFs for the speaker positions, and implements the above equation. This will eliminate the effect of the speakers, as discussed above. An example of this technique, calibrated for speakers at +/- 30 degrees to the listener, can be found in this sound file: diner_360_speaker.wav

Conclusion

This project has successfully implemented a scheme for 3D surround sound using only two speakers. Possible applications of this technique include: