2004
AURACLE by Max Neuhaus,
Text A VOICE-CONTROLLED, NETWORKED SOUND INSTRUMENT. 2004

AURACLE: A VOICE-CONTROLLED, NETWORKED SOUND INSTRUMENT*

Jason Freeman (Columbia University Department of Music), Kristjan Varnik (Akademie Schloss Solitude), Sekhar Ramakrishnan (Zentrum für Kunst und Neue Medientechnologie), Max Neuhaus, Phil Burk (Softsynth), and David Birchfield (Arizona State University, Arts, Media, and Engineering Program)

ABSTRACT

Auracle is a voice-controlled, networked sound instrument. It enables users to control a synthesized instrument with their voice and to interact with each other in real time over the Internet. This paper discusses the historical background of the project, beginning with Neuhaus’ ‘virtual aural spaces’ in the 1960s and relating them to Barbosa’s conception of ‘shared sonic environments’. The architecture of the system is described in detail, including the multi-level analysis of vocal input, the communication of that analysis data across the network, and the mapping of that data onto a software synthesizer.

Not only is Auracle itself a collaborative, networked instrument, but it was developed through a collaborative, networked process. The project’s development mechanisms are examined, including the leveraging of existing tools for distributed development, the creation of specialized development applications, the adoption of extreme programming practices, and the use of Auracle itself as a means for communication and collaboration.

1. INTRODUCTION

Auracle is a voice-controlled, networked sound instrument conceived by Max Neuhaus and realized collaboratively by the authors. Users interact with each other in real time over the Internet, playing synthesized instruments together in a group ‘jam’. Each instrument is entirely controlled by a user's voice, taking advantage of the sophisticated vocal control which people naturally develop learning to speak.

The project was designed to facilitate the kinds of communal sound dialogue which are rare in contemporary society:

Anthropologists in looking at societies which have not yet had contact with modern man have often found whole communities making music together. Not one small group making music for the others to listen to, but music as a sound dialogue between all the members of the community […Auracle and my earlier broadcast works are] proposing to reinstate a kind of music which we have forgotten about and which is perhaps the original of the impulse for music in man. Not making a musical product to be listened to, but forming a dialogue, a dialogue without language, a sound dialogue.

These pieces then are about taking ordinary people and somehow putting them in a situation where they can start this nonverbal dialogue. They have the innate skills as our ability with language demonstrates. The real problem then is finding a way to let them escape from their preconceptions of what music is. We now think of music as an aesthetic product. When you propose to a lay public that they make music together, they all try to imitate professional musicians making a musical product, badly. It only gets interesting when they lose their self-consciousness and become themselves. (Neuhaus 1994)

Auracle is an entity made to give the lay public this opportunity. It was designed to be accessible to a lay public without musical training or technical expertise. We strived to create an open-ended architecture rather than a musical composition: a system which, as much as possible, responds to but does not direct the activities of its users. We also sought to build a highly transparent system, in which users could easily identify their own contributions within the ensemble, while also remaining engaged over extended periods of time.

2. HISTORICAL BACKGROUND

2.1 SHARED SONIC ENVIRONMENTS

Auracle is inspired by a class of analogue shared sonic environments, in which the network is constructed not with the Internet but rather with the public telephone network combined with radio networks and broadcasts. In works such as Public Supply (1966, 1973) Max Neuhaus mixed together audio from callers during live radio broadcasts and and also controlled sound synthesis with caller’s audio. In Radio Net (1977), he created a cross-country sound transformation system. (Neuhaus 1990 and 1994). He describes Radio Net:

In those days radio programs on NPR were distributed by what they called a Round Robin, telephone lines connecting all two hundred stations into a large loop stretching across the country. Any station in the system could broadcast a program on all the others by opening the loop and feeding the program to the loop.

I saw that it was possible to make the loop itself into a sound transformation circuit … I decided to configure it into five loops, one for each call-in city, all entering and leaving the NPR studios in Washington. Instead of being open loops as usual during a broadcast though, I wanted to close them and insert a frequency shifter in each so that the sounds would circulate. It created a sound transformation 'box' that was literally fifteen hundred miles wide by three thousand miles long with five ins and five outs emerging in Washington.

We had a 'dress rehearsal' the day before the broadcast so I could get a feel for things. It is touchy when you put a wire that long in a loop. Even if you do have a frequency shifter and gain control, each loop was in a sense a living thing; they could get out of hand very quickly. During the broadcast I was on a conference call with the five engineers and could listen to each loop and ask for changes in shift and gain at any time. My role was holding the balance of this big five-looped animal with as little motion as possible.

During the broadcast the sounds phoned into each city passed through its self-mixer and started looping. With each cross-country pass, each sound made another layer, overlapping itself at different pitches until it gradually died away. It was quite a beautiful Sunday afternoon - two hours over which ten thousand people made sounds.

Barbosa defines shared sonic environments as ‘a new class of emerging applications that explore the Internet’s distributed and shared nature [and] are addressed to broad audiences’ (Barbosa 2003: 58). As examples, he cites WebDrum (Burk 1999), where on-line participants collaboratively alter settings on a drum machine; MP3Q (Tanaka 2000), where users collectively manipulate MP3 files with a 3D interface; and Public Sound Objects (Barbosa and Kaltenbrunner 2002), an open-ended architecture for the creation of shared sonic environments.

We consider Auracle to be a precursor of Barbosa’s concept of the shared sonic environment, and the examples he cites as well as other recent projects such as DaisyPhone (Bryan-Kinns and Healey 2004), where Internet or mobile-phone users collaboratively modify a looping musical MIDI sequence, with each user colouring circles to change the pitches and rhythms in his or her instrument. In Eternal Music (Brown 2003), each user drags a ball around a window to control a drone generated by modulated sine waves. And components of both the Cathedral Project (Duckworth 2000) and the Brain Opera (Machover 1996) have invited Internet users to control sounds during live physical performances, collaborating not only with other Internet users but also with live performers onstage in a concert hall.

More recently, radio shows by NegativLand (Joyce 2005) and Press the Button (Radio Show Calling Tips 2005), among others, have invited callers to join improvising musicians in the broadcast studio. And Silophone (The User 2000) operates in both the analogue and digital domains; it joins together sounds made by telephone callers and sound files uploaded by Internet participants, playing them in a giant grain silo in Montreal and broadcasting their acoustic transformations back over the phone and Internet to the participants.

2.1 VOICE-CONTROLLED INSTRUMENTS

Unlike the analogue shared sonic environments constructed with radio and telephone networks, Auracle uses analysis data from the voice to control a synthesis engine rather than directly processing an audio stream. This approach was partially motivated by practical considerations: it reduced network bandwidth and latency while maintaining high quality audio output. It is of course also a much more flexible way of mapping input to output.

A number of recent software projects and interactive musical works also use this technique. For example, the Kantos software plugin (Antares 2004) maps pitch, rhythmic, and formant analyses of a monophonic audio input signal onto its synthesizer; parameters of the mappings and the synthesis algorithm itself are specified by the user through a graphical interface. In the Singing Tree (Oliver 1997), a component of the interactive ‘Mind Forest’ in Tod Machover’s Brain Opera (Machover 1996), users are asked to sing a steady pitch into a microphone; as they hold it steadier and longer, a MIDI harmonization becomes richer, and images on a screen begin to change. And the Universal Whistling Machine (Böhlen and Rinker 2004) analyzes the pitch and amplitude envelopes of a user’s whistling and synthesizes responses in which the tempo, contour, and direction of the analysis data are transformed.

3. ARCHITECTURE

[Figure 1. Auracle System Architecture.]

Users launch Auracle from the project’s web site, opening a graphical user interface through which they can ‘jam’ with other users logged in from around the world. To control their instrument, users input vocal gestures into a microphone. Their gestures are analyzed, reduced into control data, and sent to a central server. The server broadcasts that data back to all participating users within their ensemble. Each client computer receives the data and uses it to control a software synthesizer.

The client software is implemented as a Java applet incorporating the JSyn plugin (Burk 1998), and real-time collaboration is handled by a server running TransJam (Burk 2000). Data logging for debugging, usage analysis, and long-term system adaptation is handled by an HTTP post (from Java) on the client side and PHP/MySQL scripts on the server side.

The following subsections describe each architectural component in detail.

3.1 LOW-LEVEL ANALYSIS[1]

The initial low-level analysis of the voice computes basic features of the audio signal over an analysis window. The incoming sound is analyzed for voicedness / unvoicedness, fundamental frequency, the first two formant frequencies with their respective formant bandwidths, and root mean square (RMS) amplitude.[2] JSyn is used to capture the input from a microphone, but it cannot extract the vocal parameters we need, so we built this functionality ourselves. We limited our own DSP implementation to pure Java to avoid packaging and deploying JNI libraries for each targeted platform. We considered techniques based on linear prediction (LP), cepstrum (used in Oliver 1997), FFT, and zero-crossing counts. We chose linear prediction, feeling it would be the easiest to implement in pure Java with acceptable performance and accuracy.

Raw sample data from the microphone is brought from JSyn into Java. Once in Java, the data is determined to be voiced or unvoiced based on the zero-crossing count. Following Rabiner and Schafer (1978), the data is downsampled to 8192 kHz and broken into 40 ms blocks, which are analyzed by LP for the following characteristics: fundamental frequency, the first and second formant frequencies, and the bandwidth of each formant. RMS amplitude values are also calculated for each block of input. The values for each block of analysis are fed into a median smoothing filter (Rabiner and Schafer 1978: 158-161) to produce the low-level feature value for that analysis frame.

Performance of the LP code was a major concern of ours. So, in this case, we violated Knuth’s maxim and prematurely optimized. The LP code is implemented in a slightly peculiar, non-object-oriented style. The goal was to minimize virtual and interface method lookup, and more importantly, to minimize object creation. Though such issues are often disregarded when writing Java, it should not be surprising that removing memory allocations in time-critical loops proved crucial to tuning this code. In the end, we were able to implement the signal analysis in pure Java with satisfactory performance.

3.2 MID-LEVEL ANALYSIS

The mid-level analysis parses the incoming low-level analysis data into gestures. Since users are asked to hold down a play button while they are making a sound, it was trivial to parse vocal input into gestures based on the button’s press and release.[3] If the button is held down continuously for several seconds, the system will eventually create a gesture boundary to prevent any single gesture from becoming too long. And if there are periods of silence while the button remains down, the system will create additional gesture boundaries at those points.

Once a gesture is identified, a feature vector of statistical parameters is created to describe the entire gesture. The choice of features is based largely on studies of vocal signal analysis for emotion classification by Banse and Scherer (1996), Yacoub, Simske, Lin, and Burns (2003), and Cowie, Douglas-Cowie, Tsapatsoulis, Votsis, Kollias, Fellenz, and Taylor (2001). While we are not focused solely on emotion, we found this research a useful starting point. Studies of timbre, most of which extend Grey’s (1977) multidimensional scaling studies, were also informative, but their focus on steady instrumental tones was less directly applicable to the variety of vocal gestures expected from Auracle users.

While many emotion classification studies try to separate linguistically determined features from emotionally determined features (Cowie et al. 2001), this is not necessary in Auracle. Our system responds to features of user input whether they are linguistically determined, emotionally determined, or consciously manipulated by users to control the instrument in specific ways.

Our mid-level feature vector includes 43 features: the mean, minimum, maximum, and standard deviation of f0, f1, f2, and RMS amplitude, as well as of their derivatives; the mean, minimum, maximum, and standard deviation of the durations of individual silent and non-silent segments within the gesture; and the ratio of silent to non-silent frames, voiced to unvoiced frames, and mean silent to mean non-silent segment duration.

3.3 HIGH-LEVEL ANALYSIS[4]

It is theoretically possible to directly transmit each 43-element mid-level feature vector across the network and to map that vector onto synthesis control parameters, but we found it impractical in practice to directly address this amount of data.

Instead, we perform a high-level analysis which projects the 43-dimensional mid-level feature space onto three dimensions. In choosing those dimensions, we did not wish to merely select a subset of the mid-level features, nor did we wish to manually create projection functions: these approaches would have driven users to interact according to our own preconceptions, and in doing so would have contradicted the goals of the project.

We were attracted to the use of Principal Components Analysis (PCA) to generate this projection, because it preserves the greatest possible amount of variance in the original data set. In other words, the mid-level features which users themselves vary the most take on the greatest importance in the PCA projection. It facilitates a self-organizing, user-driven approach.

But PCA creates a static projection; for Auracle, we wanted a dynamic approach which could perform both short-term adaptation — by changing over the course of a single user session to focus on the mid-level features varied most by that user — and long-term adaptation, in which the classifier’s initial state for each session slowly changes to concentrate on the mid-level features most varied by the entire Auracle user base.

An adaptive classifier does sacrifice a degree of transparency in its classifications: it is more difficult for users to relate their vocal gestures to sound output when the high-level feature classifications, and thus the mappings, are constantly changing. And it is impossible to interpret the meaning of high-level features during the design of mapping procedures, since their semantics change with adaptation. For us, though, transparency in this component of Auracle was less important than adaptability.

Our adaptive PCA implementation does not use classical PCA methods, in which the principal components of a set of feature vectors are the eigenvectors of the covariance matrix of the set with the greatest eigenvalues. This strategy is awkward to adapt to a continuously-expanding input set and computationally expensive to perform in real time in Java.

[Figure 2. APEX neural network.]

Instead, we implement the Adaptive Principal Component EXtraction (APEX) model (Diamantaras and Kung 1996 and Kung, Diamantaras, and Taur 1994), which improves upon earlier neural networks proposed by Oja (1982), Sanger (1989), Rubner and Tavan (1989), and others. APEX efficiently implements an adaptive version of PCA as a feed-forward Hebbian network (with modifications to maintain stability) and a lateral, asymmetrical anti-Hebbian network. The Hebbian portion of the network discovers the principal components, while the anti-Hebbian portion rotates those components. The learning rate of the algorithm is automatically varied in proportion to the magnitude of the outputs and a ‘forgetting’ factor which controls the algorithm’s memory of past inputs (Kung, Diamantaras, and Taur 1994).

Upon launching Auracle, a client’s neural network is initialized with weights downloaded from the server.[5] The client-side neural network quickly adapts to the vocal gestures created by the local user, updating its internal weights accordingly. Then, when a user logs out of Auracle, the client’s internal weights are transmitted back to the server, which merges them with its previous weight matrix to facilitate long-term adaptation.

Unlike many other neural networks, it is easy to monitor how APEX adapts; each feed-forward weight represents the importance of a particular mid-level feature in the computation of a particular high-level feature. This transparency was critical in developing, debugging, and evaluating the high-level analysis system within Auracle.

3.4 NETWORK

Each gesture’s low-level analysis envelopes, along with the high-level feature values, are sent to a central server running TransJam (Burk 2000), a Java server for distributed music applications. The TransJam server provides a mechanism to create shared objects, acquire locks on those objects, and distribute notifications of changes to those objects. Each client sends its gesture data as a modification to a data object which it has locked, and the server then transmits the updated object information to all clients in the ensemble. In this manner, all client machines maintain all players’ analysis data in sync.

By sending only control data, Auracle maintains low latency and high audio quality using a fraction of the bandwidth required for audio streaming. However, TransJam’s XML text-based protocol is expensive when transmitting floating-point numbers, since each digit is sent as a separate ASCII character. So we compress those numbers using a fixed lookup table to dramatically improve bandwidth utilization.[6]

Java security restrictions and practical networking issues made direct peer-to-peer communication impossible. This necessitates a central server. However, to mitigate the probability of a performance bottleneck, Auracle’s architecture is designed to minimize the work done by the server. The server is merely a conduit for data and does no processing itself. Mapping and synthesis operations are duplicated by all clients, but we preferred this solution over adding load on the server. Our benchmarking shows that we can support 100 simultaneous users, each sending one gesture per second, with an average CPU load of only 35% on our Apple Xserve (G4 1.33 GHz, 512 MB RAM).

3.4.1 NETWORK LATENCY

The analysis data is transmitted to the server only once a complete gesture has been detected. This reduces network traffic and generally uses the network more efficiently. Data is only mapped onto synthesis control parameters when it arrives from the server, even when the data was created by the local client. This creates a short delay between the vocal input and synthesized response; we have found that this latency is not a disadvantage but rather facilitates a conversational style of interaction which works quite well.

3.4.2 EVENT DISTRIBUTION

The onset of gestures from different players are occasionally shifted in order to minimize their overlap. We delay gesture onsets a small amount (beyond network latency) when doing so reduces overlap, and in dense textures, we also scale gestures so that the length of each one is slightly shorter. As with our approach to network latency, our goal is to facilitate a conversational style of interaction, where players respond to past events rather than trying to synchronize with future events. As an additional benefit, users are more easily able to hear their contribution to the ensemble when gestures overlap less, increasing the system’s transparency.

3.5 MAPPING AND SYNTHESIS

Each client receives the data from the server and passes it to a mapper. The mapper generates envelopes and control parameters for a software synthesizer. The mapper manages an entire ensemble of synthesis instruments, each of which is controlled by the vocal gestures of a single player. The mapper allocates and de-allocates synthesizer objects as users enter and exit the ensemble. It also temporarily halts inactive synthesizer objects when a client's CPU usage climbs above a threshold.

The synthesis algorithm, implemented entirely using the JSyn API (Burk 1998), is a hybrid of several techniques, designed to enable the mapping of player data onto a wide range of timbres. The synthesizer is composed of three separate sections: an excitation source, a resonator, and a filter bank. The initial excitation is composed of two sources —a pulse oscillator and a frequency-modulated sine oscillator. These are mixed and sent through an extended comb filter with an averaging lowpass filter and probabilistic signal inverter included in the feedback loop. The result is sent through a bank of bandpass filters and mixed with the unfiltered sound to generate the final output.

[Figure 3. Synthesizer schematic.]

Much of the low-level analysis data is mapped onto the synthesis algorithm in straightforward ways. The fundamental frequency envelope controls the frequency of the excitation sources and the length of the feedback delay line. The amplitude envelope controls the amplitude of the excitation source, the overall amplitude of the synthesizer, and the depth of frequency modulation.[7] The first and second formant envelopes are used to set the centre frequencies of the bandpass filters, and the Q on those filters are inversely proportional to the formant bandwidth envelopes. The voicedness / unvoicedness envelope of a gesture modulates the probabilistic signal inverter between noisier and purer timbres.

High-level feature data, on the other hand, is used to control timbral aspects of the synthesis which evolve from one gesture to the next but do not change within a single gesture: the ratio of pulse to sine generators in the excitation source, the probability of inverting the feedback signal, and the filter Q values. In the latter two cases, a high-level feature value defines a range within which the parameter can vary over the course of the gesture. Then low-level envelopes — voicedness/unvoicedness level and formant bandwidths, respectively — control continuous, subtle variations within that range over the course of the gesture.

We did not want sound output to stop completely when users were not making vocal gestures. So when a player’s synthesizer is finished playing a gesture, it continues sounding a quiet ‘after ring’ until the next user gesture is received. The relationship of the vocal gesture to this after ring is less transparent than with the gesture itself; it is designed simply to be a quiet sound which constantly but subtly changes. It is based on the formant envelopes of the previous gesture, slowed down dramatically and played out of phase with each other.

To help users more easily distinguish among the sounds controlled by different players in the ensemble, we make small modifications to timbre and panning for each synthesis instrument. A player’s own synthesizer is always panned to the centre of the stereo mix; other players in the ensemble are panned to the left or right to varying degrees. And the frequency modulation ratio for each synthesizer is randomly initialized to a different position in a lookup table, which defines an ordered set of ratios moving gradually from whole numbers to increasingly complex fractions. Over the course of a session, a synthesizer’s frequency modulation ratio subtly moves through the lookup table (in response to changes in the standard deviation of a gesture’s fundamental frequency), but each user’s synthesizer begins at a unique position in that table.

3.6 GRAPHICAL USER INTERFACE

[Figure 4. Auracle Graphical User Interface.]

The focus of Auracle is on aural interaction, so the software’s graphical user interface is deliberately sparse. The main display area shows information about all users in the active ensemble of players: their usernames, their approximate locations on a world map (computed with an IP-to-location service), and a running view of the gestures they make (displayed as a series of coloured squiggles corresponding to their amplitude, fundamental frequency, and formant envelopes).

Users push and hold a large play button when they want to make a vocal gesture. Additional controls allow them to move to another ensemble, create a new ensemble, and monitor and adjust audio levels. A text chat among players within the ensemble is available in a separate popup window.

4 DISTRIBUTED DEVELOPMENT PROCESSES

Not only is Auracle itself a collaborative, networked instrument, but it was developed through a collaborative, networked process. The six-member project team had members based in Germany, Italy, California, and Arizona. During the year-long development process, the team met for three intensive week-long meetings in Germany to discuss key aesthetic and architectural issues. But the deployment of standard collaboration and communication tools was essential in coordinating the efforts of team members throughout the year and in making the project a reality.

4.1 PROJECT-SPECIFIC DEVELOPMENT TOOLS[8]

4.1.1 DYNAMIC SYSTEM CONFIGURATION

We designed Auracle as a component-based architecture because we wanted to experiment with a variety of approaches, particularly with regards to mapping and synthesis techniques. The final release of Auracle only uses a small fraction of the hundreds of components we created.

Auracle’s architecture uses Java interfaces, reflection, and the observer pattern, combined with an avoidance of direct cross references, so that components can be mixed and matched to form a complete system. During startup, the application reads a text file specifying the particular components to be used and instantiates the corresponding configuration.

Reconfiguration of Auracle does not require the source code to be recompiled, but it does require the configuration file to be edited and the program to be restarted. Rapid comparisons between configurations are not possible. And small tweaks to synthesizer parameters require changes to the source code; they cannot be specified in the configuration file. As the number of experimental components grew, tracking and comparing components and configurations became increasingly difficult, and manually distributing configurations to colleagues became tedious.

[Figure 5. Auracle TestBed graphical user interface.]

To address these limitations, we created the Auracle TestBed, a separate application used only in the development process and not included in the public release. Popup menus in the TestBed’s GUI select analyzer, mapper, synthesizer, and effects unit components, and sliders adjust internal synthesizer parameters for fine-tuning control.

The TestBed saves configurations of analyzers, mappers, synthesizers, and effects units as patches. Developers annotate patches through name and description fields to add comments or help explain them to other team members. The patches are saved as text files and also displayed as buttons in the graphical user interface. A single button press switches to a different system configuration, enabling rapid comparisons between patches. The change in Auracle's configuration is immediate; no text files need to be edited and the application does not need to be restarted.

From within the TestBed, developers can also easily upload patches to the group development server to share them with other team members, who can use them in a group ‘jam session’ or download them to their local machine.

4.1.2 INTEGRATION WITH EXISTING TOOLS

The Auracle TestBed can also send analysis data to any application which supports the Open Sound Control (OSC) protocol (Wright and Freed 1997). We used this feature to send Auracle data in real time to SuperCollider (McCartney 1996), Max/MSP (Cycling ’74 2004), and Wire (Burk 2004). By combining Auracle with external sound development tools, we were able to quickly prototype new ideas using existing synthesis libraries and user-friendly environments which permitted runtime modifications to synthesis algorithms.

We exported synthesis patches developed in Wire as Java source code and directly integrated them into Auracle’s Java source tree. For synthesis algorithms designed in the other applications, we manually ported the most successful algorithms to Java, which was straightforward.

4.2 REMOTE COLLABORATION TOOLS

We integrated a variety of online collaboration tools into our workflow to ensure that our vision of the project remained in sync, our work schedules were coordinated, and our priorities were clear. These included both structured collaboration tools — a bug tracking database and group task and calendar software — and unstructured environments — a Wiki for collaborative development of project documents and a mailing list (with searchable archives) for free-form discussion.

Equally important, Auracle itself became a platform for our own collaboration on the project. We quickly developed prototypes for all the components in the architecture, along with text-based chat functionality, and began holding twice-weekly ‘jam sessions’ on our development builds. These jams, which were usually followed by Internet-based audio conference calls, were critical opportunities to track our progress and identify technical and aesthetic issues. They also helped us to regularly experience Auracle as users rather than as developers.

4.3 EXTREME PROGRAMMING PRACTICES

Networked software development necessitated the use of good programming and development habits to keep our code clear, integrated, and synchronized. We followed many of the development practices encouraged by the Extreme Programming (XP) paradigm (Beck 1999), including nightly automated builds and unit testing on our development server, and frequent developer collaboration and task rollover from one developer to another.

4.3.1 AUTOMATING USER INPUT

Since Auracle is a voice-controlled instrument, we needed to constantly create vocal sounds in both manual and automated testing situations. Our TestBed application enables us to quickly select and loop through audio files which replace microphone input into Auracle, and we maintained a large vocal gesture sample database to use in this regard. A second, smaller collection of sound files documented gestures which caused problems such as inaccurate analyses, overloaded synthesis filters, or even crashes. We used these files to consistently reproduce problems as we were trying to fix them. Sound file playback was also incorporated into our automated unit testing architecture.

Auracle is designed for use by an ensemble of participants, so it was important to test it in group situations throughout the development process. Mapping and synthesis components sounded dramatically different when used individually than when used in a group ‘jam session’. Many bugs only occurred in group situations. And we also needed to test the server under heavy loads to benchmark performance and determine capacity.

To address these needs, we developed a Headless Client to simulate the activity of a single user. In order to reduce CPU usage, the Headless Client pre-analyzes audio files and stores data in a form ready to transmit to the server. It references this pre-processed data when ‘jamming’ on Auracle. And it does not perform any mapping or synthesis on data received back from the server.

A command-line application launches several Headless Clients simultaneously to simulate one or more ensembles of participants. A developer can simulate dozens of users from a single machine and then launch a single instance of the complete applet to ‘jam’ with them interactively.

4.3.2 DOCUMENTATION

We used Javadoc functionality to create self-documenting code; Javadoc web pages were updated nightly as part of our build process. We complemented these Javadocs with higher-level component architecture descriptions, which were posted and updated manually on our Wiki.

4.3.3 DEBUGGING AND TRACKING MECHANISMS

Once we released a beta version of Auracle to the public, we wanted to monitor user activity to identify the problems users encountered. A combination of several different logging mechanisms were developed to track this information.

Web server logs provide basic information about site visitors, and the TransJam server tracks some rudimentary information about user sessions. But this was little help when trying to find the source of reported problems or trying to track unreported issues.

So the Auracle applet complements this data by uploading more detailed information to a server-side database, tracking each client's operating system, web browser, Java implementation, and any client-side error messages and Java stack traces generated during the session. The database is searchable via a web interface, and daily e-mail summaries are sent to our mailing list.

This logging data helps us more easily track and fix bugs. When users send us problem reports, we can quickly locate their session in the database and find information about their system configuration and any errors which Auracle logged; they do not need to figure out these details themselves. We can also look directly in the database to find errors which were never reported by users at all. Often, a stack trace in the log points us to a specific line of source code and an easy solution.

5. DISCUSSION

Auracle was officially launched to the public in October 2004 — on the Internet, at Donnaueschinger Musiktage in Germany, and during a live radio event on SWR. Since then, we have received feedback from numerous Internet-based Auracle users, and we have watched people interact with Auracle and discussed their experiences with them at several events where Auracle kiosks have been installed.

We have been thrilled to see how Auracle engages people ranging from non-musicians to trained singers, of many different ages and cultural backgrounds. Many users are drawn into extended interactions with the system, and it is always surprising to hear the variety of vocal sounds they create and the variety of sounds the system creates in response.

We are also pleased with the long-term adaptation of the high-level analysis. We often return to the system ourselves after a week or two and feel noticeable changes in its timbral response to our voices. We would still like to find additional ways to make the system adapt to user activities over time and learn from what they do.

Our mapping and synthesis components, which were selected for production use from dozens of experimental versions created over the course of Auracle’s development, succeed in enabling users to explore a wide variety of timbres. We expect to continue to develop new mapper and synthesis components, adding alternative implementations to the system from time to time.

5.1 USER BASE

We are encouraged by the fact that Auracle users are engaged with the system for extended periods of time, and that so many of them return to participate again. During the four months beginning October 15, 2004, there were 1097 user sessions on Auracle, with 590 usernames connecting from 520 distinct hosts. The average session length was 18 minutes.

The majority of users play Auracle alone. While Auracle is still engaging when played in this manner, it is most interesting when users are online at the same time and can ‘jam’ together.[9]

We have experimented with a variety of strategies to help users find each other online, including scheduling regular online events and encouraging users to schedule Auracle meetings with friends, but these techniques have had limited success. Our most successful Auracle events, ironically, have taken place in the physical world, with several computers set up as kiosks on which people can try Auracle. We are continuing to present Auracle in this format, and we are also exploring the possibility of permanent kiosks in museums and other public spaces.

In the long term, we hope to draw enough users to Auracle so that there are always ensembles of players online. In this regard, we are focusing not only on drawing more users to the site, but also on getting more of them to log in and participate once they arrive. We designed Auracle with easy setup in mind, and we tested extensively for compatibility on a wide variety of platforms and configurations. But during the four-month period beginning October 15, 2004, 6,052 distinct hosts visited the Auracle web site, yet only 520 of those hosts actually launched and logged in to Auracle. And over half of the users who did log in never actually input a sound into Auracle at all.

Some users are likely perplexed by the user interface, and we are working to improve site documentation and help them more easily test and configure their audio system. Others are too shy to contribute, preferring to lurk. But our informal polling indicates that the majority of these users simply lack computer microphones. While we do not expect users to buy an external microphone or headset just to use Auracle, we are encouraged by the growing popularity of online audio chat and telephony applications, and we hope that computer microphones will soon be ubiquitous even on desktop machines.

5.2 OPENING AURACLE TO THE COMPUTER MUSIC COMMUNITY

Auracle was designed for a lay public without formal musical or technical training, and those users have been the focus of our efforts to date. Now, we want to make the project more accessible to members of the computer music community.

We are preparing some of the Java source code for release under an open-source license, so that others may leverage our development work in their own projects.

The Auracle team is also interested in allowing third party JSyn developers to contribute custom meta-instruments that would function as plugins in Auracle. The plugin would receive voice analysis information that would include envelopes for fundamental frequency F0, formants F1 and F2, voicedness. The plugin would output audio.

We are considering two options: #1) Programmer Option - Java programmer develops a class that implements the AuraclePlugin API. This would allow complex mapping between input analysis parameters and the final sound. But it requires Java programming skills. #2) User Option - Computer user would create a JSyn Wire patch that had input ports for the vocal analysis parameters and an output port. The user could create a patch by graphically connect synthesis modules together and send the patch to the Auracle team. The advantage would be that non-programmers could contribute. But mapping of the input parameters to audio would take place at the level of audio synthesis so it would not be as efficient or flexible as a higher level mapping.

In both cases a JAR file containing an Auracle test environment would be provided. This would allow you to build your instrument and test it in stand-alone mode using voice input.

By opening Auracle development to new contributors, we hope that the project will evolve in new ways and new directions which we could not have envisioned ourselves. If you are interested in participating through http://www.auracle.org/contact.html.

BIBLIOGRAPHICAL REFERENCES

Antares audio technologies. 2004. Antares kantos. http://www.antarestech.com/products/kantos.html.

Banse, R., and K. Scherer. 1996. Acoustic Profiles in Vocal Emotion Expression. Journal of Personality and Social Psychology, 70 (3): 614-636.

Barbosa, A. 2003. Displaced Soundscapes: A Survey of Networked Systems for Music and Sonic Art Creation. Leonardo Music Journal 13: 53-59.

Barbosa, A. and M. Kaltenbrunner. 2002. Public Sound Objects: A Shared Musical Space on the Web. Proceedings of International Conference on Web Delivering of Music 2002. Darmstadt, Germany, IEEE Computer Society Press: 9-15.

Beck, K. 1999. Extreme programming Explained: Embrace Change. Reading, MA, Addison-Wesley.

Böhlen, M., and J. Rinker. 2004. When Code is Context: Experiments with a Whistling Machine. Proceedings of the 12^thACM International Conference on Multimedia. New York: ACM, 983-984.

Brown, C. 2003. Eternal Network Music. http://crossfade.walkerart.org/brownbischoff2/.

Bryan-Kinns, N. and P. Healey. 2004. DaisyPhone: Support for Remote Music Collaboration. Proceedings on the 2004 Conference on New Interfaces for Musical Expression. Hamamatsu, Japan, ACM: 29-30.

Burk, P. 1998. JSyn – A Real-time Synthesis API for Java. Proceedings of the 1998 International Computer Music Conference. Ann Arbor, MI: ICMA, 252-255.

Burk, P. 1999. WebDrum. http://www.transjam.com/webdrum/.

Burk, P. 2000. Jammin' on the web – a new client/server architecture for multi-user performance. Proceedings of the 2000 International Music Conference. Berlin, Germany: ICMA, 117-120.

Burk, P. 2004. Wire: A Graphical Editor for JSyn. http://www.softsynth.com/wire/.

Cycling ’74. 2004. Max/MSP. http://www.cycling74.com/products/maxmsp.html.

Cowie, R., E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor. 2001. Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine, January 2001, 32-80.

Diamantaras, K. and S. Y. Kung. 1996. Principal Component Neural Networks. New York: John Wiley and Sons, Inc.

Duckworth, W. 2000. The Cathedral Project. http://cathedral.monroestreet.com/.

Freeman, J., C. Ramakrishnan, K. Varnik, M. Neuhaus, P. Burk, and D. Birchfield. 2004. Adaptive High-level Classification of Vocal Gestures Within a Networked Sound Environment. Proceedings of the 2004 International Computer Music Conference. Miami, FL, ICMA: 668-671.

Grey, J. 1977. Multidimensional perceptual scaling of musical timbres. Journal of the Acoustical Society of America, 61 (5): 1270-1277.

Joyce, D. 2005. Get Your Own Show. http://www.negativland.com/nmol/ote/text/getoshow.html.

Kung, S., K. Diamantaras, and J. Taur. 1994. Adaptive Principal Component EXtraction (APEX) and Applications. IEEE Transactions on Signal Processing, 42 (5), 1202-1217.

Machover, T. 1996. The Brain Opera. http://brainop.media.mit.edu.

McCartney, J. 1996. Supercollider: A new real-time synthesis language. Proceedings of the 1996 International Computer Music Conference. Hong Kong, ICMA: 257-258.

Neuhaus, M. 1990. Audium, Projekt für eine Welt als Hör-Raum. In Decker, E. and P. Weibel (eds.) Vom Verschwinden der Ferne: Telekommunikation und Kunst. Cologne: Du Mont, 1990.

Neuhaus, M. 1994. The Broadcast Works and Audium. In Zeitgleich. Vienna: Triton, 1994.

Oja, E. 1982 A Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology, 15: 267-273.

Oliver, W. 1997. The Singing Tree, A Novel Interactive Musical Interface. M.S. thesis, EECS Department, Massachusetts Institute of Technology.

Rabiner, L. and R. Schafer. 1978. Digital Processing of Speech Signals. Englewood Cliffs, NJ, Prentice-Hall.

Radio Show Calling Tips. 2005. http://www.pressthebutton.com/calling.htm.

Ramakrishnan, C., J. Freeman, K. Varnik, D. Birchfield, P. Burk, and M. Neuhaus. 2004. The Architecture of Auracle: A Real-Time, Distributed, Collaborative Instrument. Proceedings of the 2004 Conference on New Interfaces for Musical Expression. Hamamatsu, ACM: 100-103.

Rubner, J., and P. Tavan. 1989. A Self-Organizing Network for Principal-Components Analysis. Europhyics. Letters, 10(7): 693-698.

Sanger, T. 1989. An Optimality Principle for Unsupervised Learning. In D. Touretzky (ed.) Advances in Neural Information Processing Systems. San Mateo, CA: Morgan Kaufman.

Tanaka, A. 2000. MP3Q. http://fals.ch/Dx/atau/mp3q/.

The User. 2000. Silophone. http://www.silophone.net.

Varnik, K., J. Freeman, C. Ramakrishnan, P. Burk, D. Birchfield, and M. Neuhaus. 2004. Tools Used Whlie Developing Auracle: A Voice-Controlled, Networked Instrument. Proceedings of the 12^th ACM International Conference on Multimedia. New York: ACM, 528-531.

Wright, M. and A. Freed. 1997. Open sound control: A new protocol for communicating with sound synthesizers. Proceedings of the International Computer Music Conference. Thessaloniki, Hellas, ICMA, 101-104.

Yacoub, S., S. Simske, X. Lin, and J. Burns. 2003. Recognition of Emotions in Interactive Voice Response Systems. http://www.hpl.hp.com/techreports/2003/HPL-2003-136.html.

CAPTIONS

Figure 1. Auracle system architecture.

Figure 2. The APEX neural network as used within Auracle. X nodes represent mid-level features (input) and Y nodes represent high-level features (output). W weights are feed-forward, C weights are lateral.

Figure 3. Synthesizer schematic.

Figure 4. Auracle graphical user interface.

Figure 5. Auracle TestBed graphical user interface.

NOTES

* The Auracle project is a production of Max Neuhaus and Akademie Schloss Solitude (art, science, and business program) with the financial support from the Landesstiftung Baden-Würtemburg. We express our gratitude for their generous support. Auracle is available at http://auracle.org.

[1] For an extended discussion, see Ramakrishnan, Freeman, Varnik, Birchfield, Burk, and Neuhaus 2004.

[2] This analysis is predicated on the assumption that the incoming sound is vocal. We are not guaranteed that the user is making vocal sounds, but we treat all input as if it were vocal.

[3] Our primary motivation for this interface design was to reduce feedback in the system, in which audio output was re-input through the microphone as a new vocal gesture.

[4] For an extended discussion, see Freeman, Ramakrishnan, Varnik, Neuhaus, Burk, and Birchfield 2004.

[5] The server-side weights are initialized through training on a database of 230 recorded vocal gestures created by ten participants.

[6] Since we know all data is numerical, it is trivial to devise such a lookup table.

[7] The amplitude envelope is used to control frequency modulation depth in order to tightly couple amplitude and timbre changes, making the amplitude envelope more salient and the user’s contribution to the synthesized sounds more transparent.

[8] For an extended discussion, see Varnik, Freeman, Ramakrishnan, Burk, Birchfield, Neuhaus 2004.

[9] We have created perpetual, virtual ensembles on Auracle where users can interact with robots if they wish.

2004AURACLE by Max Neuhaus, Text A VOICE-CONTROLLED, NETWORKED SOUND INSTRUMENT. 2004

2004
AURACLE by Max Neuhaus,
Text A VOICE-CONTROLLED, NETWORKED SOUND INSTRUMENT. 2004