Low Delay Audio Streaming for a 3D Audio Recording System ˙ Marzena Malczewska, Tomasz Zernicki and Piotr Szczechowiak Zylia Sp. z o. o. Umultowska 85 Pozna´ n, Poland, {marzena.malczewska, tomasz.zernicki, piotr.szczechowiak}@zylia.pl Abstract This paper presents a prototype of a 3D audio recording system named AudioSense which uses Wireless Acoustic Sensors to capture spatial audio. The sound is recorded in real-time by microphones embedded in each sensor device and streamed to a Processing Unit for 3D audio compression. One of the key problems in systems which stream audio data is end-to-end latency. This paper is focused on analyzing a set of chosen parameters of the Opus codec in order to obtain the minimal delay. Experimental results on the prototype system have shown that it is possible to achieve below 10ms of end-toend audio delay with the use of the Opus codec. Keywords audio streaming, opus codec, low latency streaming, wireless sensor network, spatial audio 1 Introduction The area of 3D audio and object based audio is currently a hot research topic as evidenced by a large number of research papers and new emerging standards such as MPEG-H 3D Audio. The majority of the research efforts in this area are concentrated on the audio processing and rendering side. The problem of 3D audio recording is getting less attention in the literature. Channel-based method of spatial sound production assume that the number of microphones used during the recording is directly proportional to the number of loud speakers used during sound rendering. This can lead to large numbers of microphones in case of multiple audio channels recordings. In addition, proper setting and tuning of microphones in the field can be a tedious task which requires many resources in terms of time and manpower. Current distributed recording systems use wired microphones, which makes it difficult to deploy and use the system in certain environments. To solve the limitations of current spatial audio recording systems, the AudioSense system is being developed. It introduces object-based sound representation and wireless audio streaming. The system can be described as a Wireless Acoustic Sensor Network (WASN) [Bertrand, 2011; Akyildiz et al., 2007]. In this system individual sound sources (objects) are extracted from a sound mixture using sound source separation techniques (e.g Independent Component Analysis [Comon, 1994]). Object based audio representation gives high flexibility in terms of sound rendering (easy rendering for headphones, stereo, 5.1, 7.1 systems) and enables interactive manipulation of individual sound objects during playback. The proposed AudioSense system has many possible applications and can be used for both indoor and outdoor audio recordings. The system can be used in teleconference applications to add the possibility to identify speakers by speech direction. It can be also used for wildlife monitoring and live TV broadcasts from the field. The AudioSense technology has applications in surveillance systems where it can identify and track objects based on sound processing. The system can be also applied in the entertainment industry in case of movies, games and virtual reality applications that require immersive 3D sound. Realisation of the AudioSense system is a challenging task that requires overcoming major challenges in such areas as audio streaming, audio coding, sound sources separation and audio synchronisation. This paper is focusing on designing a low delay audio streaming mechanism that meets the strict requirements of the AudioSense system. The AudioSense system consists of battery operated devices with low processing capabilities. Therefore audio recording, coding and streaming has to be performed with energy efficiency in mind. Live applications of the system require also low latency audio streaming that is reliable and allows simultaneous streaming of data from multiple devices over the wireless medium. In order to minimise the end-to-end delay it is necessary to optimise the audio recording process, use a low latency audio codec and streaming method. This paper shows the design of the system that tries to achieve this goal. It presents the technologies and design choices made to implement a low delay audio streaming system on off-the shelf embedded devices. The results achieved during the performance evaluation of the system show that it is possible to achieve a low end-to-end delay of 10ms for wireless audio streaming within the AudioSense system. The rest of the paper is organised as follows. Section 2 presents the related work in the area of spatial audio recording systems. Section 3 describes the architecture of the AudioSense system together with hardware and software components implemented to build the first prototype of the system. The results achieved during the performance evaluation phase are presented in Section 4. Finally Section 5 concludes the paper and describes the lessons learnt from implementing a low delay audio streaming system. 2 Related work Spatial audio recording systems are gaining on popularity with the introduction of 3D audio systems and technologies that can reproduce truly immersive sound. Majority of existing systems for spatial audio recording use wired microphones [Gallo et al., 2007] to capture the virtual sound stage. This fact limits drastically the mobility of such systems and increases significantly their deployment time. In the literature one can find also wireless systems for distributed audio recording like [Taysi et al., 2010] and [Pham et al., 2014]. The main problem with such systems is that the wireless sensor network devices are equipped with low quality microphones, amplifiers and A/D converters due to the low cost and high energy efficiency of the system. Sounds recorded with such systems have insufficient quality for many audio applications. One of the systems that tries to overcome the problems of low cost WASNs is WiLMA [Sch¨orkhuber et al., 2014]. The system introduces a wireless microphone array that offers high quality audio recording and processing. WiLMA enables connection of up to 4 professional microphones to each sensor module and provides wireless synchronisation for audio recordings. Similar approach to system design is presented also in [Mennill et al., 2012] where a distributed microphone array system is used for environmental monitoring and animals recording. This system is also based on battery operated sensors and uses GPS for accurate synchronisation of the recordings. One of the limitations of spatial audio recording systems presented in [Mennill et al., 2012] is that the system does not offer continuous realtime wireless audio streaming. All the recordings are stored on local flash memory of the sensor devices. The AudioSense system takes the next step in spatial audio recording systems by providing low delay wireless streaming capabilities and audio representation in the objectbased format. These features open up the door for a whole new range of audio applications that can be realised with the use of the AudioSense system. 3 System overview The proposed architecture of the AudioSense system is presented in Figure 1. From the functional side the system can be divided into two parts. The first part consists of Acoustic Sensors that form a wireless network responsible for audio recording. The second part includes an embedded device which performs 3D audio processing. Each device in the wireless sensor network has one or several microphones, A/D converter and performs initial audio compression. Compressed audio signals are transmitted through the Gateway to the Processing Unit. The Gateway serves as an interface between the wireless and the wired part of the system. After reception of the audio signals the Processing Unit performs aggregation of the individual streams. Each of the streams is decoded and synchronised with each other. In the next step the process of sound sources separation is performed to generate individual audio objects [Sala¨ un et al., 2014], [Ozerov et al., 2012]. These objects are then used in the process of 3D audio coding (e.g. MPEG-H 3D Audio [ISO/IEC WD 230083, 2014]. Finally the encoded audio is transmitted over the Internet to the client side where the sound rendering is performed. 3.1 Hardware components From the hardware perspective the Acoustic Sensor prototype consists of a Beaglebone Black [Coley, 2014] with an Audio Cape board [BeagleBoard, 2012] and a dedicated microphone Figure 1: Architecture of the AudioSense system. board designed in-house. Beaglebone Black is a low-power single board computer based on 1GHz ARM Cortex A8 CPU. The board provides only one 12-bit analog-to-digital converter which is not sufficient for any professional audio applications. Hence the usage of an Audio Cape (6 channels of up to 96 kHz sampling at 24 bit) is required to improve the quality of the recorded sound. For the microphone board a pair of Monacor MCE-4000 electret omnidirectional microphones is selected due to high signal to noise ratio and very good sensitivity. Each microphone is connected to a low noise operational amplifier - MCP6021. The amplified acoustic signal is passed on to the Audio Cape where analog to digital conversion is executed. Next, the digital data in one of available formats (e. g. S16LE) is sent to the Beaglebone Black. Each acoustic sensor is equipped also with a wireless interface compatible with the IEEE 802.11 a,b,g,n standards. The first version of the prototypical Acoustic Sensor is presented in Fig. 2. Figure 2: First version of the Acoustic Sensor prototype. 3.2 Software components The prototype of Audio Sensor is running Debian Jessie Linux with kernel version 3.8.13. In order to implement audio processing on the device, the Gstreamer framework is utilised. Sound capturing, coding and streaming are all implemented within a single Gstreamer v1.4 pipeline. Figure 3 illustrates Gstreamer pipelines implemented on both the Acoustic Sensor and the 3D Audio Processing Unit. The pipeline on the Acoustic Sensor side is responsible for capturing audio samples using the ALSA plugin and encoding them with the Opus [Valin et al., 2013] encoder. Next, every packet is encapsulated in the RTP packet and sent via UDP to the Processing Unit. On the Processing Unit side each received packet is processed by the depayloader and Opus decoder. This processing is performed for each stream independently. Next step is the separation plugin, which gets n streams and, after performing the process of sound sources separation, generates m audio objects. The processed data is passed on to the multiqueue and then interleaved to form a multichannel wave file. For this purpose a new Gstreamer module is implemented called WavNChEnc. The stream generated by the module is then passed to the MPEG-H 3D Audio codec which generates a single .mp4 file. To measure time of encoding, the measurements points were set right before and after Opus encoder. Respectively, on the Processing Unit the points were set before and after the Opus decoder. Streaming time was measured with the measurement points set just before UDP transceiver in Acoustic Sensor and right after RTP depayloader in the Processing Unit. One of the key aspects in networked audio systems is audio synchronisation. In order to provide synchronisation of the recorded audio streams, a separate synchronisation module is implemented. The synchronisation method applied is a hybrid approach based on reference broadcast that uses the ideas presented in [Elson et al., 2002] and [Budnikov et al., 2004]. Using this hybrid synchronisation method it is possible to achieve a synchronisation error of around 200µs. which includes: • Audio encoding time - time needed to encode one whole buffer of data by the Opus encoder. Such measurements were executed for different codec parameters which have the highest impact on the encoding time (e.g. bitrate, complexity, frame-size). • Audio decoding time - measurement of decoding time for the same set of parameters as in the case of audio encoding. • Audio transmission time - packets latency measurement when streaming wirelessly over Wi-Fi (IEEE 802.11n). For the Audio streaming tests the Opus parameters were constant while the network setup was different in each experiment. The system was tested with several Acoustic Sensors in the network. In each of the cases the distance between the Acoustic Sensors and the 3D Audio Processing Unit was different to test the system in different working conditions. In addition to latency tests the experiments included also CPU usage measurements for Opus encoding and decoding. Impact of selected parameters of the Opus codec on the quality of sound was not the subject of our tests. Several tests were performed in the past and are well described in [Hoene et al., 2011]. Figure 3: Gstreamer pipelines implemented on the Acoustic Sensor and the 3D Audio Processing Unit side. 4 Performance evaluation 4.1 Test scenarios The performance of the designed 3D Audio recording system was evaluated using several test scenarios. The main goal of the experiments was to adjust and optimise hardware and software components of the system to achieve minimal streaming delay for different audio bitrates and network setups. Audio streaming latency was measured in an end-to-end manner 4.2 Results This subsection presents experimental results achieved by measuring the end-to-end audio streaming delay in the AudioSense system. The first set of tests was performed to measure the encoding delay of the Opus codec in order to find the optimal codec parameters that enable minimal processing latency. Three parameters of the codec were identified as possible candidates for processing delay optimisation: • Complexity - is defined as a trade-off between processing complexity and quality/bitrate. This parameter is selected using an integer from 0 to 10, where 0 is the lowest complexity and 10 is the highest. In the experiments fixed values of 0, 3, 6 and 10 were used to check what is the influence of complexity on the processing delay. • Frame size - Opus has fixed frame durations of 2.5, 5, 10, 20, 40, and 60 ms. Increase in the frame duration has influence on coding efficiency improvement but the gain becomes small for frame sizes above 20 ms. • Bitrate - Opus supports different bitrates in the range between 6 kbit/s and 510 kbit/s. Higher bitrate results in higher quality audio and lower latency in packets delivery at the cost of increased bandwidth. Time of coding bitrate: 128 kbps 30 25 Time [ms] 20 2.5 5 10 20 40 60 15 10 5 0 0 3 6 10 Complexity Figure 4: Opus encoding time for different values of complexity and frame-size. The difference is especially visible for higher frame durations of 40 and 60 ms where the latency is 3 to 6 times higher than in the case of smaller frame sizes. Therefore the best values of frame size in case of the AudioSense system are below 20ms where the encoding delay is smaller than 10ms. Surprisingly the frame size of 2.5ms is providing a similar encoding delay as in the case when the frame duration is set to 20ms. In terms of complexity the optimal value is 3 with frame size of 10 ms. The influence of different audio bitrates on the Opus encoding delay is illustrated in Figure 5. The complexity parameter in all cases is fixed at 0. The experiments are performed for five audio bitrates (64, 96, 128, 256 and 320 kbit/s) and the same frame size values as in the previous test. The graph shows that significant increase in processing time is visible for larger values of frame size (40 and 60 ms). It is evident that the bitrate change has much smaller effect on encoding time than the change of the complexity parameter. For frame size values below 20ms the change of bitrate has very small effect on the encoding delay - only 1ms increase when changing the bitrate from 64kbit/s to 320 kbit/s. This experiment shows once again that the frame size of 10ms provides the optimal setting in terms of Opus encoding latency. Time of coding complexity: 0 Time of decoding 15 bitrate: 128 kbps 9 8 7 10 6 Time [ms] Time [ms] 2.5 5 10 20 40 60 2.5 5 10 20 40 60 5 4 5 3 2 1 0 64 96 128 256 320 Bitrate [kbps] 0 0 3 6 10 Complexity Figure 5: Opus encoding time for different values of bitrate and frame-size. Figure 4 shows Opus encoding delay for different complexity and frame size settings (different colors on the figure correspond to different frame sizes). For all measurements the bitrate remained constant at the level of 128 kbit/s. It is clearly visible that the average encoding latency increases with higher complexity values. Figure 6: Opus decoding time for different values of complexity and frame-size. Opus decoding latency is tested in a similar manner as in the case of encoding. Figure 6 shows the impact of the complexity parameter on the decoding time for different frame duration values. The audio bitrate is fixed at 128 kbit/s. As can be seen in Figure 6 the com- Time of decoding CPU usage complexity: 0 60 8 7 55 6 50 2.5 5 10 20 40 60 4 3 CPU [%] Time [ms] 5 2.5 5 10 20 40 60 45 40 2 35 1 0 30 64 96 128 256 320 64 96 128 256 320 Bitrate [kbps] Bitrate [kbps] Figure 7: Opus decoding time for different values of bitrate and frame-size. Figure 8: CPU usage for different values of bitrate and frame-size. plexity parameter has a small impact on the overall audio decoding time. For smaller values of frame size (10ms and below) the decoding time remains the same for different complexity values. The difference is visible only in case of larger frame size values (20, 40 and 60ms) where the decoding delay can increase or decrease by around 1ms with the change of codec complexity. Figure 7 presents Opus decoding times with respect to different audio bitrates. In all cases the complexity parameter is set to 0. It is clearly visible that audio bitrate has very small impact on the decoding time. The main parameter that has the biggest influence on decoding time is frame duration. From the point of view of Opus decoding, the best performance in terms of execution time can be achieved for the smallest possible values of frame size: 2.5 and 5ms. The AudioSense system consists of battery operated sensor devices therefore the power consumption during codec operation is an important parameter that can limit the total operation time of the system. Figure 8 presents the CPU usage on the Acoustic Sensor while performing coding and decoding using the Opus codec. The measurements are performed for six different audio bitrates and six frame durations. In all cases the CPU operation remains between 35% and 57%. Highest CPU usage is recorded for the smallest frame size value (2.5ms). For all frame sizes between 10ms and 60ms the CPU usage stays on the same levels. The influence of audio bitrate on CPU processing is not signifi- cant as changing the bitrate from 64 kbit/s to 320 kbit/s increases the CPU usage by 7% on average. The complexity parameter of the Opus codec has a stronger influence on the CPU processing than audio bitrate change. Changing the complexity from 0 to 3 increases the CPU usage by 10% on average. Switching from 3 to 6 adds another 10% of CPU processing. It is recommended to set the complexity on 3 or lower in order to keep the CPU usage below the level of 50%. From the energy efficiency point of view the optimal frame size is equal to 10ms. 3.50 3.25 Latency of streaming - distance ca. 1 m 1 x AS 2 x AS 3 x AS 3.00 2.75 2.50 Time [ms] 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 0 500 1000 Measurement no. 1500 2000 Figure 9: Streaming delay measurements with a distance of 1m between devices. Audio encoding and decoding adds a significant delay in the audio processing pipeline of the AudioSense system. The third factor that adds an additional delay is audio streaming over the wireless channel. In order to mea- 3.50 3.25 Latency of streaming - distance ca. 6 m 1 x AS 2 x AS 3 x AS 3.00 2.75 2.50 Time [ms] 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 0 500 1000 Measurement no. 1500 2000 Figure 10: Streaming delay measurements with a distance of 6m between devices. 3.50 3.25 Latency of streaming - distance ca. 10 m 1 x AS 2 x AS 3 x AS 3.00 2.75 2.50 Time [ms] 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 0 500 1000 Measurement no. 1500 2000 Figure 11: Streaming delay measurements with a distance of 10m between devices. sure streaming latency over Wi-Fi several experiments are performed using the prototype AudioSense system. All the tests are made with the same parameters of the Opus codec: sampling rate 48kHz, bitrate 128 kbit/s, complexity 0, frame size 10ms. Figure 9 presents the first set of experiments where the network consists of one, two or three Acoustic Sensors (AS). In all cases the distance between the 3D Audio Processing Unit and Acoustic Sensors is equal to 1m. The measurements are taken over 2000 audio samples. The streaming is performed under Line of Sight (LOS) conditions using the 802.11 n mode. It is clearly visible in Figure 9 that the streaming delay is the lowest (around 70µs) when there is only one Acoustic Sensor in the network. Addition of the second sensor that sends simultaneously audio data to the processing unit in- creases significantly the overall packets delivery time to around 1ms on average. The network with three Acoustic Sensors increases the delay even further to around 1.6ms. Figures 10 and 11 show the results of the same experiment as above but under different network conditions. The distance between the devices is increased to 6m and 10m respectively. The streaming is performed under Non Line of Sight (NLOS) conditions. The results demonstrate that the increase in distance between devices has small influence on the average audio streaming delay. The average delay for packet reception remains at the same levels in all three sets of tests. The main difference can be noticed in the jitter levels which are much higher when using the system in NLOS conditions. 5 Conclusions This paper presents the architecture of the AudioSense system which is designed to record and process spatial audio. All the hardware and software components of the prototype implementation of the system are described in detail. The result of the work is a wireless acoustic sensor network capable of distributed sound recording in an object-based audio format. The paper focuses also on the development of a low delay audio streaming technique which meets the strict requirements of the AudioSense system. For this purpose the Gstreamer framework is utilised together with the Opus codec. The optimal working parameters for the codec are selected through experimental evaluation and the end-to-end delay is measured for different setups of the wireless network. The results demonstrate that it is possible to achieve an average delay below 10ms for coding, transmission and decoding of the audio signal in a wireless system of several Acoustic Sensors. For the future work it would be interesting to test the system on a larger scale with parallel transmissions from many Acoustic Sensors. The capacity of the system and transmission delay can be further optimised by utilising wireless streaming in the 802.11 ac standard. For the needs of sound sources separation it will be beneficial to apply a hardware based synchronisation method which would limit the synchronisation error to several µs. 6 Acknowledgements This work was supported by National Centre for Research and Development (NCBiR), Poland, ”Leader” programme. References Adapteva, 2014. Parallella Reference Manual 13.11.25. Ian F Akyildiz, Tommaso Melodia, and Kaushik R Chowdury. 2007. Wireless multimedia sensor networks: A survey. Wireless Communications, IEEE, 14(6):32–39. BeagleBoard, 2012. BeagleBone Audio Cape Revision A1 System Reference Manual, October. Alexander Bertrand. 2011. Applications and trends in wireless acoustic sensor networks: a signal processing perspective. In Communications and Vehicular Technology in the Benelux (SCVT), 2011 18th IEEE Symposium on, pages 1–6. IEEE. D. Budnikov, I. Chikalov, S. Egorychev, I. Kozintsev, and R. Lienhart. 2004. Providing common i/o clock for wireless distributed platforms. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, volume 3, pages iii–909–12 vol.3, May. Gerald Coley, 2014. BeagleBone Black System Reference Manual - Revision B, January. Pierre Comon. 1994. Independent component analysis, a new concept? Signal processing, 36(3):287–314. Jeremy Elson, Lewis Girod, and Deborah Estrin. 2002. Fine-grained network time synchronization using reference broadcasts. In Proceedings of the 5th Symposium on Operating Systems Design and implementation, OSDI ’02, pages 147–163, New York, NY, USA. ACM. Emmanuel Gallo, Nicolas Tsingos, and Guillaume Lemaitre. 2007. 3d-audio matting, post-editing and re-rendering from field recordings. EURASIP: Journal on Advances in Signal Processing. Special issue on Spatial Sound and Virtual Acoustics. Christian Hoene, Jean-Marc Valin, Koen Vos, and Jan Skoglund. 2011. Summary of opus listening test results. Technical report, IETF. ISO/IEC WD 23008-3. 2014. Information technology - High efficiency coding and media delivery in heterogeneous environments - Part 3: 3D audio. Technical report, International Organization for Standardization / International Electrotechnical Commission, International Telecommunications Union – Telecommunication, January. Daniel J Mennill, Matthew Battiston, David R Wilson, Jennifer R Foote, and Stephanie M Doucet. 2012. Field test of an affordable, portable, wireless microphone array for spatial monitoring of animal ecology and behaviour. Methods in Ecology and Evolution, 3(4):704–712. Alexey Ozerov, Emmanuel Vincent, and Fr´ed´eric Bimbot. 2012. A General Flexible Framework for the Handling of Prior Information in Audio Source Separation. IEEE Transactions on Audio, Speech and Language Processing, 20(4):1118 – 1133, May. 16. Congduc Pham, Philippe Cousin, and Arnaud Carer. 2014. Real-time on-demand multi-hop audio streaming with low-resource sensor motes. In Local Computer Networks Workshops (LCN Workshops), 2014 IEEE 39th Conference on, pages 539–543. IEEE. Yann Sala¨ un, Emmanuel Vincent, Nancy Bertin, Nathan Souvira`a-Labastie, Xabier Jaureguiberry, Dung T. Tran, and Fr´ed´eric Bimbot. 2014. The Flexible Audio Source Separation Toolbox Version 2.0. May. Christian Sch¨orkhuber, Markus Zaunschirm, and IO-hannes Zm¨olnig. 2014. Wilmawireless largescale microphone array. In Linux Audio Conference, volume 2014. Z. Cihan Taysi, M. Amac Guvensan, and Tommaso Melodia. 2010. Tinyears: Spying on house appliances with audio sensor nodes. In Proceedings of the second ACM Workshop on Embedded Sensing Systems for EnergyEfficiency in Buildings, BuildSys ’10, pages 31–36, New York, NY, USA. Association for Computing Machinery. Jean-Marc Valin, Gregory Maxwell, Timothy B Terriberry, and Koen Vos. 2013. Highquality, low-delay music coding in the opus codec. In Audio Engineering Society (AES) Convention no. 135. Audio Engineering Society.
© Copyright 2025