DEVELOPMENT OF THE UC AUDITORYVISUAL MATRIX SENTENCE TEST A thesis submitted in partial fulfilment of the requirements for the Degree of Master of Audiology at the University of Canterbury by Ronald Harris Trounson University of Canterbury 2012 ii Acknowledgements The author wishes to acknowledge his primary supervisor, Dr Greg O’Beirne, for his unwavering support, enthusiasm and vision for the development of an auditory-visual matrix sentence test, something which we believe is the first of its kind in the world. As with any ground breaking research, it would not be possible without the assistance of a truly dedicated team. The author wishes to acknowledge of help of his secondary supervisor, Dr Margaret Maclagan, and linguist Ruth Hope for their expertise and guidance on the subtleties of New Zealand English. The author wishes to acknowledge Professor Nancy Tye-Murray for her inspirational research into auditory-visual enhancement. The author wishes to acknowledge Emma Parnell of the New Zealand Institute of Language, Brain and Behaviour along with Rob Stowell from the University of Canterbury (UC) College of Education and John Chrisstoffels from the UC School of Fine for their video equipment and cinematography expertise. The author wishes to acknowledge the speaker, actress Emma Johnston, for her incredible patience and self control throughout the recording sessions. Last but not least, the author wishes to acknowledge Dr Emily Lin from the UC Department of Communication Disorders for her expertise and guidance on statistical methods. iii iv Abstract Matrix Sentence Tests consist of syntactically fixed but semantically unpredictable sentences each composed of 5 words (name, verb, quantity, adjective, object). Test sentences are generated by choosing 1 of 10 alternatives for each word to form sentences such as "Amy has nine green shoes". Up to 100,000 unique sentences are possible. Rather than recording these sentences individually, the sentences are synthesized from 400 recorded audio fragments that preserve coarticulations and provide a natural prosody for the synthesized sentence. Originally developed by for the Swedish language in 1982, Matrix Sentence Tests are now available in German, Danish, British English, Polish and Spanish. The Matrix Sentence Test has become the standard speech audiometry measure in much of Europe, and was selected by the HearCom consortium as a means of standardising speech audiometry across the different European regions and languages. Existing Matrix Sentence Tests function in auditory-only mode. We describe the development of a New Zealand English Matrix Sentence Test in which we have made the important step of adding an auditory-visual mode, using recorded fragments of video rather than simply audio. The addition of video stimuli not only increases the face-validity of the test, but allows different presentation modes to be compared, thereby allowing the contribution of visual cues to be assessed. v Table of Contents Acknowledgements ................................................................................................ iii Abstract .................................................................................................................... v List of Figures ...................................................................................................... viii List of Tables ........................................................................................................... x Abbreviations ..........................................................................................................xi 1 Introduction ..................................................................................................... 13 1.1 Speech Audiometry in New Zealand ..................................................... 13 1.2 Disadvantages of Fixed-Level Monosyllabic Words in Quiet............... 14 1.3 Speech-in-Noise ..................................................................................... 14 1.4 Masking Noise ....................................................................................... 15 1.5 Adaptive Testing Procedures ................................................................. 16 1.6 Sentence Tests........................................................................................ 17 1.7 Advantages of Matrix Sentence Tests ................................................... 19 1.8 New Zealand English ............................................................................. 20 1.9 Auditory-visual Enhancement ............................................................... 20 1.10 Statement of the Problem ....................................................................... 23 2 3 vi Methodology ................................................................................................... 24 2.1 Composition of Base Matrix .................................................................. 24 2.2 Sentence Generation .............................................................................. 27 2.3 Sentence Recording ............................................................................... 28 2.4 Vowel Recording and Accent Analysis ................................................. 31 2.5 Sentence Segmentation .......................................................................... 34 2.6 User Interface Development .................................................................. 48 2.7 Video Transition Analysis ..................................................................... 53 Discussion ....................................................................................................... 56 3.1 Control of Head Position ....................................................................... 56 3.2 Video Recording Procedures ................................................................. 60 3.3 Video Editing Procedures ...................................................................... 63 3.4 Video Transition Analysis ..................................................................... 66 3.5 Clinical Applications ............................................................................. 68 4 3.6 Research Applications ........................................................................... 68 3.7 Limitations ............................................................................................. 69 3.8 Future Development .............................................................................. 70 Conclusions ..................................................................................................... 80 Appendix 1 – New Zealand Matrix Sentence Recording List ............................... 81 Appendix 2 – Phonemic Distribution Analysis ..................................................... 82 Appendix 3 – Vowel Formant Frequency Analysis............................................... 84 Appendix 4 – Audio-visual Segmentation Points .................................................. 91 Appendix 5 – FFmpeg Command Syntax............................................................ 102 Appendix 6 – MS-DOS Command Syntax .......................................................... 104 References ............................................................................................................ 105 vii List of Figures Figure 1 - Schematised framework of auditory-visual speech recognition ........... 21 Figure 2 - Phonemic distribution of UC Auditory-visual matrix vs NZHINT ...... 26 Figure 3 - Matrix sentence generation pattern ....................................................... 27 Figure 4 - UC Auditory-visual Matrix Sentence Test recording set-up ................ 28 Figure 5 - UC Auditory-visual Matrix Sentence Test autocue set-up .................. 29 Figure 6 - Formant frequency analysis of the word "Had" .................................... 32 Figure 7 - Speaker’s vowel formant frequencies vs normative NZ data ............... 32 Figure 8 - Auditory-visual sentence segmentation process ................................... 34 Figure 9 - Auditory-visual segmentation between sentences ................................ 35 Figure 10 - Auditory-visual segmentation at start of sentence .............................. 35 Figure 11 - Auditory-visual sentence segmentation rules ..................................... 36 Figure 12 - Auditory-visual segmentation of waveforms starting at zero amplitude ............................................................................................................................... 37 Figure 13 - Auditory-visual segmentation of waveforms ending with "s" ............ 38 Figure 14 - Auditory-visual segmentation of waveforms beginning with "s" ...... 39 Figure 15 - Auditory-visual segmentation of waveforms containing zero amplitude ............................................................................................................... 40 Figure 16 - Auditory-visual segmentation of waveforms ending at zero amplitude ............................................................................................................................... 41 Figure 17 - Auditory-visual segmentation of waveforms containing consistent amplitude ............................................................................................................... 42 Figure 18 - Post recording adjustment of video output ......................................... 44 Figure 19 - UC Auditory-visual Matrix Sentence Test encoding process ............. 45 Figure 20 - UC Auditory-visual Matrix Sentence Test user interface ................... 48 Figure 21 - UC Auditory-visual Matrix Sentence Test mixing software flow chart ............................................................................................................................... 49 Figure 22 - UC Auditory-visual Matrix Sentence Test playback software flow chart ....................................................................................................................... 51 Figure 23 - Audio-visual word pair video transitions ............................................ 53 Figure 24 - Normalised smoothness ranking of video transitions ......................... 54 viii Figure 25 - Percentage of word pairs excluded vs number of available matrix sentences ................................................................................................................ 55 Figure 26 - UC Auditory-visual Matrix Sentence Test head support system ........ 57 Figure 27 - UC Auditory-visual Matrix Sentence Test recording setup errors...... 61 Figure 28 - Alpha channel mask ............................................................................ 63 Figure 29 - Head position stabilisation algorithm ................................................. 64 Figure 30 - Head alignment clamp (Swosho, 2012) .............................................. 71 Figure 31 - Halo head brace (Bremer Medical Incorp, Jacksonville, FL, USA) ... 72 Figure 32 - University of Canterbury Adaptive Speech Test user interface .......... 74 Figure 33 - Word pair vs sound file contents......................................................... 75 Figure 34 - Scoring of "Amy bought two big bikes" ............................................. 76 Figure 35 - Scoring of "William wins those small toys" ....................................... 77 Figure 36 - Word specific intelligibility function .................................................. 78 Figure 37 - Participants' word intelligibility vs sentence intelligibility (hypothetical example) .......................................................................................... 79 ix List of Tables Table 1 - British English word matrix ................................................................... 19 Table 2 - New Zealand English word matrix ......................................................... 24 Table 3 - Vowel notation ....................................................................................... 31 Table 4 - 720p50 audio and video format settings................................................. 34 Table 5 - UC Auditory-visual Matrix Sentence Test video output settings........... 46 Table 6 - UC Auditory-visual Matrix Sentence Test audio output settings........... 47 x Abbreviations AAE Adobe After Effects AM Amplitude Modulation AME Adobe Media Encoder APP Adobe Premiere Pro avi Audio video interleave file format BKB-SIN Bamford-Kowal-Bench Speech-in-Noise Test CST Connected Sentence Test CVC Consonant-Vowel-Consonant dB Decibel F1 First formant F2 Second formant fps Frames per second GB Gigabyte HD High Definition HINT Hearing-in-Noise Test Hz Hertz IAC Industrial Acoustics Company Ltd jpeg Joint Photographic Experts Group image file format NZ New Zealand NZDTT New Zealand Digit Triplet Test NZHINT New Zealand Hearing-in-Noise Test m4a mpeg4 audio format mp4 mpeg4 video format mpeg4 Moving Picture Experts Group, standard 4 file format mpg Moving Picture Experts Group, standard 1 file format MB Megabyte MS-DOS Microsoft-Disk Operating System PAL Phase Alternating Line PC Personal Computer PCM Pulse-code Modulation xi PI Performance-Intensity QuickSIN Quick Speech-in-Noise RAM Random Access Memory RGB Red, Green, Blue SNR Signal-to-Noise Ratio SRT Speech Reception Threshold UC University of Canterbury UCAST University of Canterbury Adaptive Speech Test USB Universal Serial Bus VI Virtual Instrument wav Waveform audio file format WIN Words-in-Noise xii 1 Introduction Hearing and understanding speech have unique importance in our lives. For children, the ability to hear and understand speech is fundamental to the development of oral language. For adults, difficulty in detecting and understanding speech limits the ability to participate in the communication interactions that are the foundation of numerous activities of daily living. In 1951 the father of Audiology, Raymond Carhart, defined speech audiometry as "a technique wherein standardized samples of a language are presented through a calibrated system to measure some aspect of hearing ability" (Carhart, 1951). Today, speech audiometry is an integral part of the audiological test battery. It is a key measure of overall auditory perception skills, providing an indication of an individual’s ability to identify and discriminate phonetic segments, words, sentences and connected discourse (Mendel, 2008). Scores on speech tests are often used as a crosscheck of the validity of pure-tone thresholds (McArdle & Hnath-Chislom, 2009). 1.1 Speech Audiometry in New Zealand The materials used in speech audiometry in New Zealand are generally monosyllabic word lists presented in quiet, such as the Meaningful CVC (Consonant-Vowel-Consonant) Words (Boothroyd & Nittrouer, 1988). Items are presented in lists, often after a carrier phrase, such as “say (the word) ____”. Words are presented in isolation, without context, so that patients must repeat what they hear without relying on contextual clues. The aim is to attempt to isolate the problem of audibility from other confounding factors such as working memory and use of context (Wilson, McArdle, & Smith, 2007). Performance is scored by word or by phoneme repeated correctly to arrive at a percentage correct score. A number of word lists are presented at two or more different intensity levels in order to describe a performance-intensity (PI) function, from which the speech reception threshold (SRT) or 50% correct point can be estimated. The conditions under which speech audiometry is performed in the clinic are optimal compared to those encountered in the real world. Speech materials are presented 13 through headphones, in a soundproof room, with maximum concentration from the patient and minimum external distraction. 1.2 Disadvantages of Fixed-Level Monosyllabic Words in Quiet Speech recognition testing in quiet does not address the main problem experienced by the majority of hearing impaired listeners, which is difficulty understanding speech in noise. Listeners with identical word recognition abilities in quiet can have significantly different word recognition abilities in background noise (Beattie, Barr, & Roup, 1997). The assessment of receptive communication abilities ideally should involve speech materials and listening conditions that are likely to be encountered in the real world. Speech tests in quiet of the kind that are currently used in audiological practice in New Zealand fall into the category of non-adaptive tests. These methods are susceptible to floor and ceiling effects (where a number of participants obtain scores of, or close to, 0% or 100%). Once scores close to 100% are attained then no further improvement can be recognised, as the testing materials are not of sufficient difficulty to challenge the patient’s abilities. These effects can distort results and make it difficult to reveal significant differences in speech recognition ability (Gifford, Shallop, & Peterson, 2008). There is less redundant information in single monosyllabic words than there is in sentences, which yield multiple contextual clues involving syntax and semantics. Single word recognition tests are not representative of spoken language and the validity of these word lists for predicting the social adequacy of one’s hearing has been questioned (Orchik, Krygier, & Cutts, 1979; Beattie, 1989). 1.3 Speech-in-Noise As there is no correlation between self-reported measures of difficulty understanding speech-in-noise and objective measurements of this ability (Rowland, Dirks, Dubno, & Bell, 1985), efficient, reliable objective tests should be part of the audiological test battery. Speech-in-noise tests have long been 14 recognised as an important addition to the audiological test battery, although they are only just starting to be introduced clinically (Carhart & Tillman, 1970; Dirks, Morgan, & Dubno, 1982; Strom, 2006). Speech-in-noise testing enables the clinician to test hearing impaired listeners in the kind of ‘real-world’ situations in which they report having the greatest difficulty. It can have benefits for hearing aid selection and counselling, giving a more realistic assessment of the likely benefit the patient will receive from hearing aids (Beattie et al., 1997). Some of the most common speech-in-noise tests are the Connected Sentence Test (CST; Cox, Alexander, & Gilmore, 1987), the Hearing in Noise Test (HINT; Nilsson, Soli, & Sullivan, 1994), the Quick Speech-in-Noise Test (QuickSIN; Killion, Niquette, Gudmundsen, Revit, & Banerjee, 2004), the Bamford-KowalBench Speech-in-Noise Test (BKB-SIN; Niquette et al., 2003; Etymotic Research, 2005), the Words-in-Noise test (WIN; Wilson, 2003; Wilson & Burks, 2005) and the digit triplet test (Smits, Kapteyn, & Houtgast, 2004; Ozimek, Warzybok, & Kutzner, 2010; Zokoll, Wagener, Brand, Buschermöhle, & Kollmeier, 2012). The CST, HINT, QuickSIN and BKB-SIN use sentence level materials as the target stimuli; the WIN uses monosyllabic words, and the digit triplet test uses a sequence of 3 digits. 1.4 Masking Noise The speech-in-noise tests listed above use multi-talker babble as the masking noise with the exception of the HINT, which uses speech-spectrum noise. Wilson and colleagues (2007) compared the effectiveness of the HINT, QuickSIN, BKBSIN, and WIN tests in differentiating between speech recognition performance by listeners with normal hearing and performance by listeners with hearing loss. The separation between groups was least with the BKB-SIN and HINT (4–6 dB) and most with the QuickSIN and WIN (8–10 dB). While differences in semantic context contribute to the performance on each test, background masking noise also has an effect. Speech-spectrum noise waveforms like those used in the HINT exhibit little amplitude modulation (AM), whereas, depending on the number of 15 talkers, multi-talker babble usually has a larger AM characteristic. The importance of an AM characteristic is that during the low point in the waveform fluctuation the signal-to-noise ratio (SNR) is increased, thereby offering the listener a glimpse of a portion of the target speech signal (Miller & Licklider, 1950; Dirks & Bower, 1970; Howard-Jones & Rosen, 1993). Hearing impaired listeners have greater difficulty than normal hearing listeners taking advantage of the momentary improvement in SNR due to their poorer temporal resolution abilities (Stuart & Phillips, 1996, 1998). One class of masking noise that may be particularly useful in the assessment of speech-in-noise abilities is interrupted noise (Miller, 1947; Miller & Licklider, 1950; Pollack, 1954, 1955; Carhart, Tillman, & Johnson, 1966; Wilson & Carhart, 1969). Interrupted noise is usually a continuous noise that has been multiplied by a square wave that produces alternating intervals of noise and silence. Wilson and Carhart (1969) found that spondaic word thresholds for listeners with normal hearing were 28 dB lower in an interrupted noise than in a continuous noise, whereas listeners with hearing loss experienced only an 11 dB difference. This is a wider separation of recognition performance than is provided by either the QuickSIN or the WIN. The use of amplitude modulated, interrupted noise as the masker may provide a more sensitive measure of hearing impairment than the multi-talker babble currently used in clinically available speech-in-noise tests. 1.5 Adaptive Testing Procedures Speech tests in quiet of the kind that are currently used in audiological practice in New Zealand fall into the category of non-adaptive tests. A non-adaptive testing procedure, where the distribution of trials is pre-determined at different fixed intensities, is called the method of constant stimuli. The BKB-SIN, QuickSIN and WIN tests use a modified method of constant stimuli in a descending presentation level paradigm, which is a pseudo-adaptive procedure involving the presentation of a set of target stimuli at a fixed SNR followed by further sets of target stimuli at decreasing levels. The number of target stimuli and decibel step sizes can be 16 varied, but all are administered in a systematic fashion. The Spearman-Kärber equation (Finney, 1952) is used to calculate the SRT of 50%. The HINT uses a truly adaptive procedure (Levitt, 1971) whereby the stimulus level on any one trial is determined by the response to the preceding stimulus. Threshold is defined as the stimulus intensity at which the listener can identify the stimulus correctly for 50% of trials. By measuring the SRT directly, rather than eliciting percentage correct scores, floor and ceiling effects are avoided. Furthermore, by honing in more quickly and efficiently on the region of interest where the individual’s threshold is likely to fall, adaptive tests can be more effective than tests that use the method of constant stimuli (Levitt, 1978) while still preserving accuracy and reliability (Buss, Hall, Grose, & Dev, 2001; Leek, 2001). This has important ramifications for clinicians and time management while making the task less onerous for the patient. 1.6 Sentence Tests Sentences are far more representative of everyday communication than isolated monosyllabic words or digit triplets since they include natural intensity fluctuations, intonation, contextual cues, and temporal elements that are associated with conversational speech (Nilsson et al., 1994). Conversational speech is highly redundant, as knowledge of the subject in question, and visual cues from lip-reading and body language can assist the listener in deciphering the signal. From a measurement point of view, the psychometric functions of sentences are steeper than those of words and digits (McArdle, Wilson, & Burks, 2005), making sentences particularly suitable for accurate estimation of the SRT. Two types of sentence tests can be distinguished, namely those based on everyday utterances of unified grammatical structure, and those comprising semantically unpredictable sentences of a fixed grammatical structure. The first type of sentence test was originally proposed by Plomp and Mimpen (1979) and the test materials are called Plomp-type sentences. Plomp-type sentence tests have been developed for Dutch (Plomp & Mimpen, 1979; Versfeld, Daalder, Festen, & 17 Houtgast, 2000), German (Kollmeier & Wesselkamp, 1997), American English (Nilsson et al., 1994), Swedish (Hallgren, Larsby, & Arlinger, 2006), French (Luts, Boon, Wable, & Wouters, 2008) and Polish (Ozimek, Kutzner, Sek, & Wicher, 2009) languages. In general, these tests are composed of phonemically and statistically equivalent lists made up of different sentences, where differences in the phonemic distribution and list-specific SRTs across lists are statistically insignificant. The disadvantage of Plomp-type sentences is that the test lists usually cannot be used twice with the same subject within a certain time interval (i.e., shorter than half a year). The meaningful sentences can easily be memorized or particular words can be guessed from the context, which would affect the SRT result. As the amount of test lists are limited, Plomp-type sentence tests are not suitable when many speech intelligibility measurements have to be performed with the same listener, e.g., during hearing instrument fitting or in research. The second type of sentence test is the so-called matrix sentence test first proposed by Hagerman (1982) for the Swedish language. These are syntactically fixed, but semantically unpredictable sentences, each consisting of 5 words (name, verb, quantity, adjective, object). There is a base list consisting of 10 sentences with 5 words each. The test sentences are generated by choosing one of the 10 alternatives for each word group in a pseudo-random way that uses each word of the base list exactly once (e.g. Lucy sold twelve cheap shoes). Wagener improved on the idea of Hagerman by taking co-articulation into consideration in order to provide a natural prosody of the synthesized sentences for the German (Wagener, Kühnel, & Kollmeier, 1999a; Wagener, Brand, & Kollmeier, 1999b, 1999c) and Danish (Wagener, Josvassen, & Ardenkjaer, 2003) versions of the matrix sentence test. More recently, British English (Hall, 2006; Hewitt, 2007), Polish (Ozimek et al., 2010) and Spanish (Hochmuth et al., 2012) versions have been developed. The matrix sentence test has become the standard speech audiometry measure in much of Europe, and was selected by the HearCom (www.hearcom.eu) consortium as a means of standardising speech audiometry across the different European regions and languages 18 1.7 Advantages of Matrix Sentence Tests Matrix sentence tests have advantages over currently available speech-in-noise tests. A 5 x 10 matrix yields 105 or 100,000 different sentence combinations, resulting in a practically unlimited amount of speech material in comparison to Plomp-type sentences. Matrix sentences are useful for hearing aid evaluation and other applications where repeated testing is required. They are also suitable for severely hearing impaired and cochlear implant users because they are spoken relatively slowly and consist of only 50 well known words. The limited vocabulary also makes matrix sentences suitable for testing children. Unlike Plomp-type sentences, the fixed format of the matrix sentences has the advantage of being very similar across different languages. Measurement and scoring procedures are more uniform making across country comparisons of performance much easier. However, differences do exist between speakers and languages. The British English matrix sentence test (Table 1) would not be appropriate for New Zealand speakers in its original form. Table 1 - British English word matrix 19 1.8 New Zealand English New Zealand English differs from British English in a number of ways, most noticeably in the vowel system. New Zealand English vowels have a very different formant structure and place in the vowel space (Maclagan & Hay, 2007). The front vowels in FLEECE, DRESS and TRAP within the lexical sets introduced by Wells (1982) have raised and fronted in New Zealand English causing them to be pronounced much higher in the mouth, similar to Australian and South African English. DRESS is sometimes pronounced so high in the vowel space by some speakers that it can overlap with FLEECE. There is further neutralisation of front vowels before /l/, such as in the word pairs celery/salary, and doll, dole and dull. The vowel in KIT has centralised and lowered even further than when Wells (1982) described it. NEAR and SQUARE are completely merged for many speakers, so that cheer and chair, beer and bare are pronounced identically. The GOOSE vowel is also very central, even fronted in some cases, except before /l/. Given that speech perception materials will be presented to hearing impaired individuals under challenging listening conditions, these differences in phonology could have an impact on the performance of New Zealand English speakers on the British English matrix sentence test. For example, "desks" could be confused with "disks" by a New Zealand listener, and hence some substitutions of the words in the British English matrix (Table 1) would be required. 1.9 Auditory-visual Enhancement A further criticism of speech audiometry in New Zealand is that recorded test material is presented in the auditory-alone condition, which fails to account for the influence of visual cues on speech intelligibility. Evidence suggests that speech intelligibility improves when listeners can both see and hear a talker, compared with listening alone (Sumby & Pollack, 1954; Grant, Walden, & Seitz, 1998). Watching the face of a talker while listening in the presence of background noise can yield an effective improvement of up to 15 dB in the signal to noise ratio relative to auditory-alone (Sumby & Pollack, 1954). Often the advantage of 20 supplementing listening with watching is more than additive (Sommers, TyeMurray, & Spehar, 2005). One reason for this superadditive effect is the complementary nature of the auditory and visual speech signals (Grant et al., 1998). For example, cues about nasality and voicing are typically conveyed very well by the auditory signal, even in adverse listening situations, whereas the visual signal does not convey them at all. On the other hand, cues about place of articulation are conveyed by the visual signal but not very well by a degraded auditory signal, as when listening with a hearing loss or listening in the presence of background noise (Tye-Murray, Sommers, & Spehar, 2007). Figure 1 - Schematised framework of auditory-visual speech recognition Grant et al. (1998) proposed a conceptual framework (Figure 1) for understanding the improved performance for auditory-visual presentations in which both peripheral and central mechanisms contribute to an individual’s ability to benefit from combining auditory and visual speech information. In the initial step of the model, peripheral sensory systems (audition and vision) are responsible for extracting signal related segmental and suprasegmental phonetic cues independently from the auditory and visual speech signals. These cues are then integrated and serve as input to more central mechanisms that incorporate semantic and syntactic information to arrive at phonetic and lexical decisions. Grant et al. (1998) reported great variability among adults with hearing impairment in their ability to integrate auditory and visual information during the process of speech perception. Tye-Murray et al. (2008) found that older adults are less able to use visual cues than younger people. Grant et al. (1998) suggested that poor audio-visual perception of speech may result from difficulties in different 21 areas such as auditory perception, visual perception, integration ability of the sensory information, ability to use contextual and language information, and working memory. Speech audiometry that presents monosyllabic words in quiet in the auditory-alone modality fails to account for many of these factors. The presentation of sentences in noise in the auditory-visual modality may provide a better measure of real world speech perception. 22 1.10 Statement of the Problem Speech audiometry in New Zealand generally utilises monosyllabic words presented in quiet. Speech recognition testing in quiet does not address the main problem experienced by the majority of hearing impaired listeners, which is difficulty understanding speech-in-noise. Furthermore, the method of constants currently used to measure SRTs is susceptible to floor and ceiling effects. Matrix sentence tests have a number of advantages over currently available New Zealand speech tests. The sentence material provides a better representation of everyday spoken language than single words. The potential for 100,000 different sentence combinations gives a practically unlimited set of sentence material, which is useful for repeated testing applications such as hearing aid evaluations. The relatively slow speaking rate and simple vocabulary makes matrix sentences suitable for testing the severely hearing impaired and children. The matrix sentence test utilises masking noise to provide a better assessment of listening abilities in background noise. The use of amplitude modulated, interrupted noise may provide better separation between normal hearing and hearing impaired listeners than the multi-talker babble and speech spectrum noise currently used in clinically available speech-in-noise tests. Matrix sentence tests are compatible with adaptive SRT seeking procedures, which avoid floor and ceiling effects, and are more efficient than the method of constants. A British English version of the matrix sentence test has already been developed. However, differences in phonology between British and New Zealand English may compromise the validity and reliability of the test when used for New Zealand English speakers. Thus, a New Zealand English matrix sentence test needs to be developed. While the method for producing matrix sentence tests in an auditory-alone modality is well documented, the author is unaware of any versions available in an auditoryvisual format. A new procedure will therefore be developed to allow the New Zealand English matrix sentence test to operate in auditory-alone, visual-alone or auditory-visual modes. The addition of visual cues to speech audiometry testing may provide a more accurate evaluation of the difficulties experienced by hearing impaired listeners in the real world. 23 2 Methodology 2.1 Composition of Base Matrix The British English word matrix (Table 1) was used as the basis for development of the New Zealand English word matrix (Table 2). Table 2 - New Zealand English word matrix The base list consists of 10 different five-word sentences with the same syntactical structure (name, verb, quantity, adjective, object), which is consistent with the format used by other language versions of the matrix sentence test. The composition of the word matrix aims to achieve a balanced number of syllables within word groups, semantic neutrality and grammatical correctness, and to match the language-specific phoneme distribution (Hochmuth et al., 2012). The boxes in Table 2 highlight the words that differ from the British matrix. Substitution of some of the words in the British matrix was necessary in order to remove potential vowel confusions during open-set testing for speakers of New Zealand English. Substitution of other words were necessary in order to best match the phonemic distribution of New Zealand English. 24 The substitutions between the British and New Zealand matrices were: "Amy" replaces "Alan" for gender and phonemic balance "David" replaces "Barry" for phonemic balance "Oscar" replaces "Lucy" for gender and phonemic balance "Sophie" replaces "Steven" for gender and phonemic balance "William" replaces "Nina" for gender and phonemic balance "those" replaces "five", which has the same vowel as "nine" "good" replaces "pink", which may be confused with "punk" "new" replaces "thin" for phonemic balance "bikes" replaces "beds", which may be confused with "bids" "books" replaces "chairs", which may be confused with "cheers" "coats" replaces "desks", which may be confused with "disks" "hats" replaces "rings", which may be confused with "rungs" "skirts" replaces "tins", which may be confused with "tens" While there is no "gold standard" for the distribution of phonemes in New Zealand English, the phonemic distribution of the New Zealand Hearing in Noise Test (NZHINT; Hope 2010) provided a basis for comparison. The NZHINT is based on five hundred sentences of 5-7 syllables collected from New Zealand children’s books and recorded by a native New Zealand English speaker. 25 The phonemic distribution of the UC Auditory-visual matrix (New Zealand English) was compared against the NZHINT (Figure 2). Figure 2 - Phonemic distribution of UC Auditory-visual matrix vs NZHINT There was a significant positive relationship between the phonemic distributions [r = 0.6, n = 42, p < 0.001]. In comparison to the NZHINT, the UC Auditoryvisual matrix has an underrepresentation of "dth" phonemes as it does not contain any articles such as "the". The UC Auditory-visual matrix has an overrepresentation of "s" phonemes as all nouns are plural. 26 2.2 Sentence Generation A list of 100 sentences (Appendix A) were selected from the matrix such that each of the 400 unique word pair combinations (e.g. David-bought, bought-three, threebig, big-books) were included in the sentence recording list. The sentences in the recording list were created by following a pattern (Figure 3) of pairing the name, quantity and object in each row with the verbs and adjectives in the adjacent columns. Figure 3 - Matrix sentence generation pattern In this way, all sentences beginning with the name "Amy" also contained the quantity "two" and the object "bikes". All sentences beginning with the name "David" also contained the quantity "three" and the object "books". Thus, the first three sentences in the recording list were: "Amy bought two big bikes", "David bought three big books" and "Hannah bought four big coats". 27 2.3 Sentence Recording A single walled IAC soundproof booth (Industrial Acoustics Company Ltd) was used as a recording studio. The layout of the recording equipment inside the booth is shown in Figure 4. Microphone Green screen Speaker Autocue Video camera Figure 4 - UC Auditory-visual Matrix Sentence Test recording set-up 28 An autocue was constructed from a cardboard box, piece of glass and an iPhone (Figure 5). Amy bought two Camera lense big bikes Sentence displayed on glass 45° Glass iPhone Figure 5 - UC Auditory-visual Auditory Matrix Sentence Test autocue utocue set-up set The autocue was constructed in a simpler manner than many commercially available autocue systems. A software program was written in LabVIEW (version 9.0.1, National Instruments, TX, USA) that presented the sentence text in a timed manner. Video of this text text was captured using Camtasia 6 Studio (TechSmith Corp., MI, USA), which was then inverted and rotated using VirtualDub (www.virtualdub.org). The video was downloaded onto an iPhone which sat in the base of the autocue and projected the text upwards onto a piece of glass angled at 45 degrees. The ghosted image of the text was then able to be read off the front of the glass while the text was invisible to the camera behind the glass due to the angle of reflection. The autocue setup was enshrouded in a cardboard cardboard box lined with black plastic to keep the light out and limit reflections from outside the box. 29 The speaker was a New Zealand born, 32 year old female actress from the School of Fine Arts at the University of Canterbury. The speaker was seated with her back against the wall and her head cradled by a support in order to maintain a stable head position throughout the recording. The head support was covered by a green screen allowing it to be later edited out of the recording. The speaker read the 100 sentences (Appendix A) aloud from the autocue. The sentences were delivered every 3.3 seconds with a 0.9 second gap between each sentence to allow the speaker to return to the mouth shut position. This rate enabled all 100 sentences to be recorded in seven minutes. The recording was captured by a Sony PMW-EX3 video camera and AKG C 568 EB condenser microphone. The video was captured in HD format at a frame rate of 50 fps, pixel resolution of 1280 x 720, and pixel aspect ratio of 1.0 using a progressive scan. The audio was captured in PCM format at a rate of 48,000 Hz. Three recordings of the list of 100 sentences were made in consecutive order with only a 20–30 second gap between each recording. The recordings were saved in mp4 format, and were later transferred to a laptop via USB cable. 30 2.4 Vowel Recording and Accent Analysis A sample of the speaker’s vowels was taken by recording the speaker voicing the 11 H_D words listed in Table 3. Three sets of these H_D words were recorded on an iPhone and stored in m4a (mpeg4 audio) format. Lexical Set (Wells, 1982) H_D Frame Trap Had Start Hard Dress Head Nurse Heard Fleece Heed Kit Hid Thought Hoard Lot Hod Foot Hood Strut Hud Goose Phoneme (Mitchell, 1946) Who’d æ a e ɜ i ɪ ɔ ɒ ʊ ʌ u Table 3 - Vowel notation The first formant (F1) and second formant (F2) frequencies of the speaker’s vowels were analysed using Praat (5.1.32) software. The formant frequencies were averaged over the three recordings (Figure 6). 31 Had F2=2227Hz F1=622Hz Had F2=2202Hz F1=596Hz Had F2=2290Hz F1=587Hz F2 Average=2240Hz F1 Average=602Hz Figure 6 - Formant frequency analysis of the word "Had" The formant frequencies of the speaker’s vowels were compared against normative data from Maclagan and Hay (2007) for typical speakers of New Zealand English of her approximate age (Figure 7). A strong positive correlation was found for F1 [r = 0.965, n = 11, p < 0.001] and F2 [r = 0.969, n = 11, p < 0.001]. Figure 7 - Speaker’s vowel formant frequencies vs normative NZ data 32 The vowels of New Zealand English are continuously evolving, and there’s a wide variety of pronunciations found geographically across the country, but also between generations. The New Zealand accent has become more and more Kiwi1 with every subsequent generation. UC Associate Professor Margaret Maclagan, an expert New Zealand linguist, judged the speaker’s voice to be a typical New Zealand English accent for someone her age. The subjective judgement by an expert listener in combination with the objective analysis of formant frequencies confirmed the speaker had a typical New Zealand accent. 1 Kiwi is the colloquial name used to describe someone from New Zealand. The name derives from the Kiwi, a flightless bird, which is native to, and the national symbol of, New Zealand. 33 2.5 Sentence Segmentation The overall sentence segmentation process is shown below in Figure 8. Raw footage Edit 100 sentences Edit 400 word pairs controls Adjust video output Encode Figure 8 - Auditory-visual sentence segmentation process Raw Footage The raw mp4 video file was transferred from the video camera to a Compaq Presario C700 Notebook PC with dual 1.46GHz Intel Pentium processors, 2 GB RAM, running a 32 bit Windows Vista operating system and Adobe Premiere Pro CS4 (V4.2.1) video editing software (abbreviated here as APP). The raw video file was imported into APP in 720p50 format (Table 4). Audio format Pulse code modulated (PCM) Audio sample rate 48,000 samples/second Video frame size 1280 horizontal x 720 vertical Video frame rate 50 frames/second Pixel aspect ratio 1.0 Colour depth 32 bit Fields No fields (progressive scan) Table 4 - 720p50 audio and video format Edit 100 Sentences The raw video footage contained all three sentence recording sessions. The final of the three recordings of sentences was selected for segmentation as this was the most consistent in terms of speech and body position (see 3.1 Control of head position). To divide the recording into 100 separate sentences, a cutting point was made at the midpoint of the silence between each sentence (Figure 9). 34 Amy bought two big bike s Amy gives two cheap s bike Cutting point Figure 9 - Auditory-visual segmentation between sentences The start and end point of the individual sentences was then adjusted. The video was monitored frame by frame to find the point at which the mouth begins to open to form the first word of the sentence (Figure 10). The video was then spooled backwards by 25 frames and a cutting point was made to define the beginning of the sentence. Mouth closed Mouth opening 25 frames Amy bought two big bike s Cutting point Figure 10 - Auditory-visual segmentation at start of sentence Similarly, the video was monitored to find the frame where the mouth just closes at the end of the sentence. The video was then advanced another 25 frames and a cutting point was made to define the end of the sentence. This gave consistency to the segmentation procedure such that each sentence contains the same lead in time (0.5 seconds) to the mouth open position, and the same lead out time from the mouth closed position. Edit 400 Word Pairs Having cut the first and last words in the sentence according to a fixed lead in / lead out time, the second, third and fourth words in each sentence were cut according to a set of editing rules (Figure 11); the rules were created by trial and error with the objective of finding the smoothest audio and video transition point. 35 Waveforms start at zero amplitude Yes Consistent facial expression Yes Cut start at zero amplitude Yes Consistent facial expression Yes Cut before "s" at zero amplitude Yes Consistent facial expression Yes Cut after "s" at zero amplitude Yes Consistent facial expression Yes Cut at zero amplitude Yes Consistent facial expression Yes Cut end at zero amplitude Yes Consistent facial expression Yes Cut at consistent amplitude No Waveforms2 end with "s" No Waveforms2 begin with "s" No Waveforms contain zero amplitude No Waveforms end at zero amplitude No Waveforms contain consistent amplitude Figure 11 - Auditory-visual sentence segmentation rules 2 2 In these cases waveform refers to the words or syllables that begin or end with "s" 36 The 100 sentences contained 10 instances of each word in the matrix. APP was used to group together each of the 10 instances and inspect the waveforms and video frames for potential segmentation points. The same editing rule was applied to each of the 10 instances. Waveforms starting at zero amplitude Appendix 4 shows that all sentences containing the word "bought" have a clearly defined point of zero amplitude at the start of the word. The first two instances are illustrated in Figure 12. Amy bought David bought two three big bike big s book s Zero amplitude at start of word Figure 12 - Auditory-visual segmentation of waveforms starting at zero amplitude The point of zero amplitude on a waveform represents a silent space between words, which was found to be the most ideal segmentation point for the audio. The video frame at the segmentation point was also inspected to ensure there was a uniform facial expression across all instances of the word. The mouth closed 37 position was found to be the most ideal segmentation point for the video as it provides a smoother transition between video frames compared to the mouth open position. Waveforms ending with "s" It was found that the point of zero amplitude at the start of a word is not always the most ideal segmentation point. Appendix 4 illustrates that words ending with "s" have a point of zero amplitude at the beginning of the "s" waveform. The first two instances of the word "gives" are illustrated in Figure 13. Amy David give s give s two three Zero amplitude at Zero amplitude start of word before "s" cheap cheap bike book s s Figure 13 - Auditory-visual segmentation of waveforms ending with "s" While a point of zero amplitude could be found at the start of each instance of the word "gives", the open mouth position in the video frames was not ideal. Instead, the segmentation point was made before the "s", where the corresponding video frames had a more consistent facial expression and a closed mouth position. 38 Waveforms beginning with "s" Appendix 4 illustrates that words beginning with "s" often do not have a point of zero amplitude at the start of the word. Two instances containing the word "some" are illustrated in Figure 14. Thomas Thomas want win s s s s ome ome red small spoon spoon s s Zero amplitude after "s" Figure 14 - Auditory-visual segmentation of waveforms beginning with "s" The "s" waveforms at the end of "wants" and "wins" blend together with the "s" at the start of "some" to eliminate a point of zero amplitude between the words. The most ideal3 cutting point for the audio was found to be the point of zero amplitude at the end of the "s" waveform. While the corresponding mouth position in many cases was not closed, the selection of an ideal cutting point for the audio took precedence over the video. 3 most ideal refers to the subjective smoothness of audio and video transition points. 39 Waveforms containing a point of zero amplitude Words that did not start or end with "s" were inspected for a point of zero amplitude as a potential cutting point. Two instances containing the word "large" are illustrated in Figure 15. Amy like Sophie like s s two lar twelve lar ge ge s bike hoe s s Zero amplitude within word Figure 15 - Auditory-visual segmentation of waveforms containing zero amplitude It was not possible to make a cut at the start or the end of the word "large" as some instances (e.g. "two large", "large shoes") were blended with the preceding or following waveform, which eliminated the point of zero amplitude between the words. All instances of the word "large" were found to have a point of zero amplitude in the middle of the waveform, which was an ideal cutting point for the audio. With the mouth partially open, it was not the most ideal cutting point for the video, however, the audio cutting point took precedence. 40 Waveforms ending at zero amplitude Words that did not contain a point of zero amplitude at the start or within the word were inspected for a point of zero amplitude at the end of the word as a potential cutting point. Two instances containing the word "red" are illustrated in Figure 16. Amy Oscar want want s two s red eight red bike mug s s Zero amplitude at end of word Figure 16 - Auditory-visual segmentation of waveforms ending at zero amplitude It was not possible to make a cut at the start of the word "red" as some instances (e.g. "two red") were blended with the preceding waveform, which eliminated the point of zero amplitude between the words. The word "red" does not contain a point of zero amplitude within the word. All instances of the word "red" were found to have a point of zero amplitude at the end of the waveform. While the end of a waveform is not usually the most ideal place to make a cut, as it may contain the co-articulation to the next word, the point of zero amplitude was still found to be the best cutting point in terms of the subjective smoothness of the audio and video transition points. 41 Waveforms containing a point of consistent amplitude Finally, a point of zero amplitude may not exist at the beginning, in the middle, or at the end of a word. In this case, the waveform was inspected for a point of consistent amplitude4. Two instances containing the word "new" are illustrated in Figure 17. Amy Oscar see see s s two eight new new bike mug s s Consistent amplitude in middle of word Figure 17 - Auditory-visual segmentation of waveforms containing consistent amplitude It was not possible to make a cut at the start of the word "new" as some instances (e.g. "two new") were blended with the preceding waveform. It was not possible to make a cut at the end of the word "new" as some instances (e.g. "new mugs") were blended with the following waveform. The word "new" does not contain a point of zero amplitude within the word. While a point of non-zero amplitude is not usually the most ideal place to make a cut, as it may create an audible "transient", cutting at a point of consistent amplitude was found to be an acceptable compromise. As there were numerous options for selecting a point of 4 Consistent amplitude refers to a point where the amplitude of two separate waveforms is approximately equal. 42 consistent amplitude, the point where their mouth position was closed was chosen in order to provide the best video cutting point. Adjust Video Output The video output was adjusted (Figure 18) in order center the viewer’s attention on the speaker’s face. The brightness and contrast of video was adjusted to better illuminate the speaker’s facial features. The head support system (see 3.1 Control of head position), which was used to maintain the speaker’s head in a constant position throughout the recording, created a noticeable pattern of creases on the green screen. In order to remove this, a chroma key was applied to the video output, which replaced the green background with a black background. The chroma key relies on being able to distinguish green from the other colours in the video. The chroma key was unable to remove all of the dark green shadow on the speaker’s right side, leaving a patch of green fuzz next to the speaker’s neck. In order to remove this, a garbage matte was applied, which blocks anything outside of the matte from appearing in the final video output. The video adjustment procedure was applied to all 400 word pair segments. 43 Original video Apply garbage matte Ajust colour & brightness Apply chroma key Final video Figure 18 - Post recording adjustment of video output 44 Encode Encoding is the process of writing audio and video files to a format suitable for presentation to the end user. The overall encoding process is illustrated in Figure 19. Segmented word pairs Resize video output Encode video Convert to concatenateable format Encode audio Figure 19 - UC Auditory-visual Matrix Sentence Test encoding process Segmented word pairs Adobe Media Encoder CS4 Version 4.2.0.2006 (abbreviated here as AME) was used to extract the 400 segmented word pairs from APP and separate them into their audio and video components. Resize video output Throughout the segmentation process the video was maintained in its original high definition 720p50 format (Table 4). Starting from high definition, the video can be compacted to suit many user interface displays such as computer and television screens without having to modify the sentence segmentation points. The user interface for the project was a computer screen with a small video player window measuring 640 horizontal x 480 vertical pixels. AME was used to separate the audio and video portions of the word pair segments into individual audio and video files (e.g. amy_bought.wav, amy_bought.avi). 45 The video output was resized to fit the user interface using the following parameters: Video format Uncompressed Microsoft avi Video frame size 640 horizontal x 480 vertical Video frame rate 50 frames/second Pixel aspect ratio 1.0 Colour depth 32 bit Fields No fields (progressive scan) Table 5 - UC Auditory-visual Matrix Sentence Test video output settings The video files were maintained in an uncompressed format in order to preserve their high definition. The video files were outputted to a 250 GB external hard drive (TEAC HD3U S/N 7201188) for storage as each of the 400 uncompressed video files was approximately 200 MB in size. The video files were named according to the word pair they represent e.g. amy_bought.avi. Encode video FFmpeg (www.ffmpeg.org) is a free, open source software tool used to record, convert and stream audio and video. FFmpeg contains a library of encoding and decoding software (codecs) for converting audio and video files between formats, which is widely used in the multimedia industry (Cheng, Liu, Zhu, Zhao, & Li, 2011). FFmpeg version SVN-r18709 was used to encode the word pair video files into Microsoft mpeg4 (Moving Picture Experts Group, standard 4) file format, such that they could be played by the standard version of Windows Media Player installed on every Windows based computer. As FFmpeg operates from a MSDOS (Microsoft Disk Operating System) command line, a batch file was created to automate the process of encoding the 400 word pair video files into mpeg4 format (see Appendix 5 for syntax). 46 Convert to concatenateable format The word pair segments needed to be able to be joined together in series to form a sentence (concatenateable). As the mpeg4 video format does not allow concatenation, the word pair video files were converted to mpg (Moving Picture Experts Group, standard 1) format, which does support concatenation. FFmpeg was used to convert from mpeg4 format to a mpg of equal video quality. A batch file was used to automate the process of converting the 400 word pair video files into mpg format (see Appendix 5 for syntax). Encode audio AME was used to extract the audio portion of the segmented word pairs from APP and encode them with the following parameters: Audio format Windows waveform wav Audio codec Pulse code modulated (PCM) Sample rate 44,100 Hz Sample type 16 bit Channels Mono Table 6 - UC Auditory-visual Matrix Sentence Test audio output settings The audio files were named according to the word pair they represent e.g. amy_bought.wav. 47 2.6 User Interface Development A user interface (Figure 20) was developed using the LabVIEW (version 9.0.1) development environment. A virtual instrument (VI)5 was written that mixes the word pair segments together on the fly to produce a seamless sentence in less than 1.5 seconds. Sentences can be presented in auditory-alone, visual-alone or auditory-visual mode. Auditory-alone plays sound without video; visual-alone plays video without sound; while auditory-visual plays sound and video together. Figure 20 - UC Auditory-visual Matrix Sentence Test user interface The software code behind the user interface performs three essential functions. The first is to concatenate the audio and video files together to form a sentence. The second is to play the sentence to the user in auditory-alone, visual-alone or auditory-visual mode. The third is to score the response and store the data for analysis. It is essential that each function is performed sequentially. 5 A virtual instrument, or VI, is the name given to any piece of software written using LabVIEW 48 Concatenate audio and video files The software design for performing the concatenation process, which joins the audio and video portions of the word pairs together, is illustrated in Figure 21. Generate sentence Locate video files Concatenate video files Write file Visual alone.mpg Locate audio files Concatenate audio files Write file Auditory alone.wav Convert to .avi Auditory alone.wav + Visual alone.avi = Auditory-visual.avi Figure 21 - UC Auditory-visual Matrix Sentence Test mixing software flow chart Generate sentence For testing purposes, the developer was able to use the response interface to manually select a sentence to be played by pressing the button of each word. The selected sentence was highlighted in green. In the fully developed version (see 3.8 Future Development), sentences will be generated randomly, or pulled from predetermined lists. Locate audio and video files Based on sentence selected, the software identified the audio and video files that were required to generate the sentence. For example, "Amy bought two big bikes" required the following files: Audio Video amy_bought.wav amy_bought.mpg bought_two.wav bought_two.mpg two_big.wav two_big.mpg big_bikes.wav big_bikes.mpg 49 Concatenate audio files Different methods were required for concatenating the audio files and video files together. LabVIEW was able to treat audio files as arrays of sample values (44,100 Hz, 16 bit, mono) which could then be concatenated to form a longer audio file. Write audio files The four word pair audio files making up a sentence were concatenated to form a single wav file. In the case of "Amy bought two big bikes" the word pairs were: amy_bought.wav + bought_two.wav + two_big.wav + big_bikes.wav = Auditory alone.wav The Auditory alone.wav file is a temporarily stored file. It always maintains the same name; however, its content is overwritten each time a new sentence is generated. Concatenate video files The standard version of LabVIEW does not include any pre-programmed code for concatenating video files; however, it does allow for an MS-DOS command line to be executed. The mpg video files were joined together by utilising the concatenate function of MS-DOS (see Appendix 6 for syntax) Write video files The four word pair video files making up a sentence were concatenated together to form a single mpg file. In the case of "Amy bought two big bikes" the word pairs were: amy_bought.mpg + bought_two.mpg + two_big.mpg + big_bikes.mpg = Visual alone.mpg Visual alone.mpg is also a temporarily stored file, with its content overwritten each time a new sentence is generated. Some attributes critical for playback, such as the length of the file, are not preserved during the concatenation process. 50 Windows Media Player requires such information in order to playback the file correctly. Visual alone.mpg was re-encoded into mpeg4 format to enable playback by Windows Media Player. The conversion from Visual alone.mpg to Visual alone.avi was made using FFmpeg. Finally, the Auditory alone.wav and Visual alone.avi files were joined together using FFmpeg to form Auditory-visual.avi (see Appendix 5 for syntax). Playback sentence The software design for performing sentence playback is illustrated in Figure 22. Before the playback code can execute, the mixing code must have completed writing the audio and video files Auditory alone.wav, Visual alone.avi and Auditory-visual.avi. Playback mode sentence Open file in WMP Set WMP controls Play file Close WMP Figure 22 - UC Auditory-visual Matrix Sentence Test playback software flow chart Playback mode The user selects the playback mode via the Auditory-alone, Visual-alone and Auditory-visual selection menu in the user interface (Figure 20). Based on that choice, the software selects the corresponding file for playback. Open file in Windows Media Player The LabVIEW development environment is able to utilise the functionality of Microsoft Windows Media Player via an Active X container. Active X allows software manufacturers to embed each others code in their applications. The software creates an Active X container for Windows Media Player and then loads either Auditory alone.wav, Visual alone.avi or Auditory-visual.avi. Set Windows Media Player controls 51 Windows Media Player controls were set to play the audio and video files in a 640 horizontal x 480 vertical pixel window in the user interface. The standard Windows Media Player controls such as stop, start and volume were hidden from the user so they could not be modified. Control of these variables was instead performed by the software. Play file then close Windows Media Player The selected audio and video files play to the end and then close, which allows for the next file to be played. 52 2.7 Video Transition Analysis The key feature of matrix sentences is the ability to generate 100,000 unique sentences from just 100 recorded sentences edited into 400 word pairs. However, the large number of possible word pair combinations can result in the pairing of video frames that do not match up sufficiently well enough to produce a smooth looking transition between frames. A method for objectively evaluating the smoothness of the video transitions was developed. A VI was written to extract the first and last frame of each word pair video and store them as jpeg (Joint Photographic Experts Group) images (Figure 23). nine good (last frame) good ships (first frame) sold eight (last frame) eight big (first frame) Figure 23 - Audio-visual word pair video transitions Another VI was written to measure the absolute difference in RGB (Red, Green, Blue) colour channels between images. The maximum possible difference is 78643200 (640 horizontal pixels x 480 vertical pixels x 256 colours). The smaller the absolute difference between images, the smoother the transition. The best transition was "nine-good ships" with a difference of 268393 or 0.34% (268393/78643200 x 100%), while the worst transition was "sold-eight big" with a difference of 1313803 or 1.67%. An analysis was also made of the difference 53 between images with the mouth region excluded (range = 0.23% - 0.96%). The 400 word pair segments provided 3000 unique transitions. The smoothness of the transitions was normalised by ranking the absolute percentage difference between frames from 1 to 3000 (Figure 24). Figure 24 - Normalised smoothness ranking of video transitions 54 The 3000 video transitions were assigned a quality rating between 0% (worst transition) and 100% (best transition). The lower quality video transitions (e.g. bottom 10%) can be excluded from the matrix sentences; however, the more word pairs that are excluded, the less unique matrix sentences are available. A spreadsheet macro was written to list all 100,000 unique sentences. A lookup function was then used to mark sentences containing any of the unacceptable word pair transitions. The unacceptable sentences were filtered out of the list and the remaining acceptable sentences were counted (Figure 25). Including or excluding the mouth region from the analysis gives essentially the same function. Sentence Availability Function Figure 25 - Percentage of word pairs excluded vs number of available matrix sentences 55 3 Discussion 3.1 Control of Head Position A stable head position was found to be essential for the visual component of the UC Auditory-visual Matrix Sentence Test. A number of different schemes were trialled before arriving at the method described in section 2.3. Without any head support A trial recording was made of the author voicing the 100 sentences listed in Appendix A without any head support. The sentences were pre-recorded and then played back with a gap between sentences such that the author could listen to the sentence, voice the sentence during the gap, and then return to the mouth closed position in anticipation of the next sentence. Upon editing the recorded sentences into word pairs and then recombining them into new sentences, a number of noticeable jumps were observed in the video output. The jumps were occurring at the cutting points between word pairs. Slight changes in head position throughout the recording were causing a mismatch between video frames when the word pairs were recombined to form new sentences. While the transition of the audio component had a natural sound, the rapid jumps in head position made the video component look unnatural. Support behind the head A basic head support system (Head support 1, Figure 26) was constructed behind the green screen of the recording set-up (Figure 4). A piece of Styrofoam was used to provide a basic cradle for the back of the head while the foam and towel provided cushioning. The head support was fixed to the wall with tape in order to hold it in a constant position. 56 Head support 1 Head support 2 Figure 26 - UC Auditory-visual Matrix Sentence Test head support system The author made another recording of the 100 sentences (Appendix A). While the head position was more stable than without the head support, the jumps between video editing points was still very noticeable. Changes in facial expression were also noticeable. The first few sentences were made with a lot of enthusiasm and expressiveness; however, as the recording session progressed, tiredness set in and the last few sentences did not have the same energy and facial expression as the first. One of the main factors contributing to the tiredness was trying to maintain the body in a still, upright position for an extended period of time. Lying on the floor A trial recording was made with the author lying on the floor with supports on both sides of the head. The lying position reduced the stress on the muscles and required very little effort to maintain position. The resulting word pair video transition points were very stable; however, the affect of gravity pushing straight down on the face produced a distorted, unnatural looking facial expression. 57 Self correction of head position A trial recording was made using a visual feedback system. The output of the video camera (Figure 4) was fed into a screen located above the video camera. A sheet of transparent plastic was placed over the screen and the silhouette of the author’s head was traced with a marker pen. This allowed the author to monitor his own head position on the screen. Unfortunately, this method of self correction did not prove to be successful. The combination of voicing the sentences, monitoring head position on screen, and making subtle adjustments simultaneously was too much. The feedback screen provided a mirror image, which meant that if the head was out of position to one side, the author needed to move in a counter intuitive direction to correct it. Also, with the feedback screen mounted above the camera lens, the author appeared to be gazing upwards in the recorded video. Multiple recordings with support behind the head Experimentation into different head support systems had been undertaken by the author in order to find the best solution before a recording was made by the official speaker. With the support behind the head proving most successful, the method was refined to provide to most stable head position for the speaker. A new head support system (Head support 2, Figure 26) was made from pieces of cardboard in order to provide a more rigid support system with more pressure points in contact with the back of the head. The support was adjusted to the speaker’s height while allowing her to sit comfortably in a chair with good back support. The autocue system allowed the speaker to look directly at the camera lens, which avoided previous issues with gaze direction. The autocue system also allowed for a faster presentation of sentences. Reading sentences from an autocue was faster than the auditory presentation system used previously whereby the speaker had to wait for each sentence to be played before repeating the sentence. A faster presentation of sentences reduced the recording time and hence the speaker was not as tired by the end of the recording session. There was still a noticeable difference between the first few sentences and the last few sentences in 58 the recording. The speaker’s facial expression tended to become more sombre and her position would slump further down as the recording progressed. For this reason, three recordings were made in succession without giving the speaker a break in between. In this way, the first recording showed a difference in position and facial expression; while during the second recording the speaker fell into a consistent rhythm, and by the third recording the speech, position and facial expression were quite uniform from start to finish. While the method developed certainly proves the concept of an auditory-visual matrix sentence test can work, there is still room for improvement, especially in the development of better head support systems (see 3.8 Future Development). 59 3.2 Video Recording Procedures A number of different video recording techniques were trialled before arriving at the method described in section 2.3. Video frame rate Video recordings were initially made using the PAL (Phase Alternating Line) television format, which has been used in New Zealand and around the world for decades as a standard format for analogue television transmission (Arnold, Frater, & Pickering, 2007). The PAL format is captured at a rate of 25 video frames per second. When applied to the UC Auditory-visual Matrix Sentence Test, a frame rate of 25 fps was found to be too slow to capture the changes in mouth position between word pairs. It was found that the mouth position could pass from open to closed within 1 frame, which sometimes caused a misalignment of video frames between adjoining word pairs. This resulted in a noticeable jump in mouth position when sentences were played back. The 720p50 format (Table 4) was used in the final recording of the UC Auditoryvisual Matrix Sentence Test as it has a frame rate of 50 fps, which is double that of the PAL format. The higher frame rate provided better definition to the mouth position, thus making for a smoother transition between word pairs when the sentences were played back. A frame rate greater than 50 fps may provide even better performance; however, there is the issue of video playback compatibility to consider. The 720p50 format is a high definition digital television format that is supported by many standard computer media players such as Microsoft Windows Media Player. Frame rates greater than 50 fps may not be supported by standard media players. 60 Video recording setup Section 2.5 describes a number of adjustments that were made to the video output post recording. These adjustments included changes to brightness and contrast and the keying out of the green background. Some of these adjustments were quite time consuming and could have been avoided if more attention had been paid to the recording setup. Figure 27 illustrates a frame from the original recording from which potential improvements can be identified. See through hairline Shadow Uneven shoulder alignment Figure 27 - UC Auditory-visual Matrix Sentence Test recording setup errors The lighting used was perhaps not bright enough. This resulted in the need to adjust the brightness and contrast of the final video. The lighting was angled such that it created a shadow on the speaker’s right side. This shadow, coupled with the crinkles in the green screen from the underlying head support, created a background colour with varying shades of green. The keying process used to remove the green background relies on the difference in colour between the background and foreground. The dark green shadow contained elements of colour similar to the darkness of the speaker’s hair, eyes and shirt, which made the separation of background from foreground very difficult. This necessitated the use of more advanced and time consuming filtering techniques such as the application of a garbage matte filter in order to completely remove the green background. More attention paid to good lighting, a uniform green background colour, and 61 clear contrast between foreground and background would simplify and hasten the video editing process. Figure 27 also illustrates an uneven shoulder alignment, with the speaker’s right shoulder being higher than the left. As the speaker was wearing a black shirt, the use of a black background in the final video helped to mask out the difference in shoulder alignment. The black background also helped to mask out the residue left behind when the green background was keyed out. A grey or transparent background colour in the final video would have been more appealing, but would only be possible with a consistent shoulder position and more attention paid to background and lighting as described above. 62 3.3 Video Editing Procedures A number of video editing techniques for stabilising the video transition between word pairs were investigated. Alpha channel masking A video is essentially a series of picture frames with each picture being made up of a number of pixels. The 720p50 video format (Table 4) uses 32 bit colour depth. The first 24 bits (3 channels) are used to define the RGB colour of each pixel, while the last 8 bits, which is known as the alpha channel (Porter & Duff, 1984), defines the transparency. When two picture layers are combined, the alpha channel acts as a mask by defining how much of the underlying picture shows through the overlying picture. Figure 28 - Alpha channel mask In order to stabilise the transitions between word pair video frames, an alpha channel mask was applied to the eyes, nose and mouth region of the face. Adobe After Effects CS4 (abbreviated here as AAE) was used to apply the mask to each video frame. In this way, the essential facial features (eyes, nose, mouth) associated with speech showed through the mask, while the outline of the face (hair, ears, neck) remained still. Unfortunately, the jumps in facial position 63 between video frames against the static outline of the head created unnatural looking facial movements in the final video output. Head stabilisation algorithms An attempt was made to smooth out the jumps between word pairs using video image stabilisation algorithms. AAE contains a number of built-in algorithms for video image stabilisation. Options are available to stabilise position, rotation and scale. Each stabilisation algorithm functions in a similar way; a point or series of points on the video image are nominated and then the software adjusts the video frames in order to maintain the chosen point(s) in a fixed position, rotation or scale. Figure 29 - Head position stabilisation algorithm Figure 29 illustrates an attempt at head position stabilisation by fixing a point on each eyebrow. AAE adjusted each frame in the video in order to keep "Track Point 1" and "Track Point 2" in a fixed position in the final video output. Various combinations of tracking points were trialled including the eyebrows, eyes, ears, nose, mouth and neck. The results were similar in each case; the tracking points remained in a solidly fixed position while the rest of the face jumped up and down and from side to side in an unnatural looking manner. 64 It became clear that physical control of the head position was the most important factor in determining smooth transitions between word pair video frames. If the physical movements of the head position are too great, the image stabilisation algorithms will not be able to compensate for it. While the attempts at image stabilisation were not entirely successful, they did show promise for the fine tuning of the final video output. Further investigation of video image stabilisation techniques is recommended (see 3.8 Future Development). 65 3.4 Video Transition Analysis With 400 word pairs that can be used in combination to form 100,000 unique sentences, it is inevitable that some sentences will sound more naturally spoken than others. A natural sounding sentence requires smooth transitions between adjacent word pairs. Some previous matrix sentence tests have excluded unnatural sounding sentences from their final test lists. The developers of the British version (Hall, 2006; Hewitt, 2007) randomly generated lists of test sentences, which were then evaluated to ensure they sounded naturally spoken and did not contain clicks or other audible unwanted inclusions. Hall (2006) found that their female speaker had difficulty pronouncing the word pair "cheap chairs", and this error was carried through to the final sentences. Hall (2006) therefore removed any sentences with "cheap chairs" and generated new sentences in their place. The developers of the Spanish matrix sentence test (Hochmuth et al., 2012) evaluated different concatenation overlap durations between successive word pairs. Hochmuth et al. (2012) found that some sentences had a better sound quality using an overlap of 15 ms, while for other sentences, fewer artefacts were noticed when no overlap was used. Only those sentences with the least perceivable artefacts were used for the further development of the test. Adding a video component to the UC Auditory-visual Matrix Sentence Test creates more complexity as both the audio and video components need to sound and look naturally spoken. Unlike previous matrix sentence tests, no ramping or overlapping of the sound files was used. While these methods were trialled, a more natural sounding audio transition was achieved through careful selection of the word pair cutting points. Furthermore, the absence of ramps and overlaps allowed the audio and video components to remain synchronized. Very few sentences were found in which the video component looked natural, while the audio component sounded unnatural. However, sentences containing the word "red" were found to be one such example. The "red" waveform does not contain a sustained interval of near-zero amplitude and must therefore be cut in the middle of the waveform. When combined with other word pairs, the mismatch in 66 waveform amplitude can create an audible click. Sentences in which the audio component sounds natural while the video component looks unnatural were much more common. A method for eliminating the worst video transitions by comparing the difference in RGB colour is described in section 2.7. "Nine-good ships" was found to be the best video transition. "Peter has nine good ships" was part of the 100 originally recorded sentences and therefore "Nine-good ships" is naturally spoken and not made up of word pairs from separate sentences. The worst transitions were found to be those containing the word pair "eight-big". Appendix A shows the "eight" waveform in the originally recorded "Oscar bought eight big mugs" has the greatest amplitude of all of the "eight" waveforms. It was the fifth of one hundred sentences and perhaps the burst of energy coincided with the speaker realising she was at the start of the sentence list for the third and final time. While the reason for the extra effort is open for debate, the greater waveform amplitude and open mouth position created the largest difference between adjoining word pairs, and therefore the worst transition. Two sets of video transition analysis were made; the first including the mouth position and the second excluding. The mouth region contains the most movement and therefore by excluding it from the analysis the absolute difference between video frames is reduced (Figure 24). When the mouth region is excluded the difference between video frames can be mainly attributed to changes in head position. As more word pairs are excluded, the number of available sentences is reduced. Although this function has been modelled (Figure 25), care needs to be taken with the interpretation. One cannot simply exclude 50% of the word pairs and think the remaining 10,000 sentences will be more than enough sentence material. Although there may be many unique sentences remaining, they will be very repetitive as the same word pairs are used over and over again. Also, by excluding 67 word pairs the phonemic distribution of the matrix becomes increasingly unbalanced. Straight subtraction of RGB colour channels may not be the best way to evaluate the smoothness of the video transitions. The comparison of motion vectors between video frames may be more appropriate and further investigation into video image analysis techniques is recommended. 3.5 Clinical Applications Matrix sentence tests in the auditory-alone modality have found clinical application where repeated measures of speech audiometry on the same subject are required. The matrix sentences are unlikely to be memorised in contrast to other sentences test such as everyday sentences, Plomp sentences and HINT sentences. This makes matrix sentences very useful for hearing aid evaluations. The simple 50 word structure also makes matrix sentences useful for evaluating cochlear implants users, and testing children. In addition to the applications described above, the UC Auditory-visual Matrix Sentence Test will allow the evaluation of lip-reading and auditory-visual integration abilities. As described in section 1.9, the auditory-visual integration abilities of the hearing impaired vary greatly between individuals. The measurement of these abilities will help to provide counselling on realistic expectations as to the likely benefits of sensory aids, and a better understanding of the listening environments where such aids are likely to be effective. Ultimately, the assessment of auditory-visual integration abilities can be used to develop individually based aural rehabilitation strategies. 3.6 Research Applications The clinical uses of the UC Auditory-visual Matrix Sentence Test also extend themselves into the research environment, where studies involving repeated measures are more common. The practically endless supply of sentences makes 68 the UC Auditory-visual Matrix Sentence Test a powerful tool for researcher’s needing repeat measures of speech audiometry. While conventional speech-in-noise tests have used multi-talker babble and speech spectrum masking noise, the body of evidence in the literature (see 1.4 Masking Noise) suggests that amplitude modulated and/or noise with spectral gaps may provide better separation of normal hearing and hearing impaired listeners. Four different masking noise options will be made available in the UC Auditory-visual Matrix Sentence Test: 1. Continuous octave band noise 2. Octave band noise with spectral gaps 3. Amplitude modulated noise 4. Amplitude modulated noise with spectral gaps The specific characteristics and implementation of the masking noise options will require further investigation. However, once implemented, it will allow research to be conducted in the auditory-alone, visual-alone and auditory-visual modalities in order to find the masking parameters that provide the best sensitivity and specificity for separating normal hearing and hearing impaired groups of listeners. 3.7 Limitations Matrix sentence tests in the auditory-alone modality tend to lack the real world context that is provided by everyday sentences, Plomp sentences and HINT sentences. In contrast to monosyllabic speech tests, matrix sentence tests require good working memory to store and recall the 5 word sentences. The same limitations apply to the UC Auditory-visual Matrix Sentence Test. In addition, the visual stimuli is more susceptible to looking unnatural, than the auditory stimuli is to sounding unnatural. The human visual system is very sensitive to the kind of rapid movements created when there is a mismatch between video frames (Cropper & Derrington, 1996). 69 3.8 Future Development The future development of the UC Auditory-visual Matrix Sentence Test needs to progress along three pathways. Firstly, there needs to be further refinement of the procedures for stabilising the head position, which is essential in order that the sentences in the final video output look naturally spoken. Secondly, the core software application which mixes the audio and video files together to form sentences needs to be integrated with the University of Canterbury Adaptive Speech (UCAST) test platform. The UCAST already contains an adaptive procedure for detecting speech reception threshold, the ability to add masking noise, and scoring procedures. Thirdly, the auditory and visual speech stimuli need to be normalised. Stabilisation of head position The speaker’s head needs to be held in a constant position throughout the entire sentence recording session. If this can be achieved, the transition between word pair video frames will appear smooth and naturally spoken in the final video output. The head support system illustrated in Figure 26 was very basic in order to prove the concept. More advanced head support systems need to be investigated. The ideal head support system would provide support behind the neck and shoulder region such that the speaker is not able to slump down as the recording session proceeds. The ideal head support system would also prevent the speaker from straining to maintain an upright body position, which would help to maintain a more constant facial expression. 70 One possibility is to use a head alignment clamp that has been developed previously for eye tracking studies (Figure 30). Figure 30 - Head alignment clamp (Swosho, 2012) The advantage of a support behind the head is that it can be hidden from view or easily edited out with a green screen. However, the disadvantage is that it may not provide enough support to prevent the subtle head movements that can ruin a video recording session. 71 Another possibility is to use a head brace that has been developed previously for the treatment of neck and spinal injuries. Figure 31 - Halo head brace (Bremer Medical Incorp, Jacksonville, FL, USA) The advantage of a halo head brace is that the head is supported at all angles, which should result in a very stable head position. The disadvantage is the effort required to edit the brace out of the final video output. However, the alpha channel masking techniques described in section 3.3 would be suitable in this case. An image of the speaker without the head brace could be used to mask out the brace during the post recording editing process. While the bulk of the head movement needs to be controlled with a physical support, there is the potential for fine tuning of the final video output with image stabilisation algorithms. Experiments using algorithms that attempt to stabilise one or two reference points in the video have not been successful (see 3.3 Head stabilisation algorithms). Algorithms that compare entire video frames and attempt to find the best average between them should be investigated. A lot of computing power is required for editing video footage, especially in high 72 definition at frame rates of 50 fps. It is recommended that the highest specification computer processor and graphic card that resources will allow be used for the editing of future versions. A 64 bit Windows operating system is also recommended as the latest versions of video editing software suites such as Adobe no longer support the 32 bit version of Windows. Once the techniques for head stabilisation have been finalised, a subjective rating exercise to assess the "naturalness" of the UC Auditory-visual matrix sentences will be undertaken with a group of 20 - 30 normal hearing listeners. Listeners will be presented with 50 actual sentences (i.e. original sentences voiced by the speaker), and 50 synthesised sentences, all in randomised order. The listeners will subjectively rate the sentences on a scale of 1 – 10 (1 = very unnatural, 10 = very natural). The procedure will be repeated in auditory-alone, visual-alone, and auditory-visual modes. The results will be compared against the objective measures of video image stability described in section 2.7. 73 Integration with University of Canterbury Adaptive Speech Test The University of Canterbury Adaptive Speech Test (UCAST; O’Beirne, McGaffin, & Rickard, 2012) is based on the Monosyllabic Adaptive Speech Test of Mackie and Dermody (1986). The UCAST was developed as an adaptive, lowpass filtered speech test whereby the user selects one of four choices from a touch screen. Figure 32 - University of Canterbury Adaptive Speech Test user interface The UCAST was developed in the LabVIEW environment and includes programming for the addition of masking noise, adaptive threshold seeking procedures, touch screen functionality and scoring. The UCAST is being developed into suite of audiological tests which will include the NZHINT (Hope, 2010) and New Zealand Digit Triplet Test (NZDTT; King, 2011). The NZDTT is a hearing screening tool that uses spoken numbers presented in background noise to estimate speech recognition thresholds. The UC Auditory-visual Matrix Sentence Test will also be integrated with, and include the functionality of, the UCAST platform. 74 Normalisation The normalisation of the UC Auditory-visual matrix sentence stimuli presents specific challenges because of the way the sentence material is edited into fragments. Normalisation is aimed at ensuring that each audio or video fragment is equally difficult, and this in turn ensures that sentences made from these fragments are also equally difficult. Because scoring a response correct or incorrect is done at the word level, the score must be mapped onto the audio or video files that contain that that particular material. The sentence "Amy bought two big bikes" is formed from four files, named amy_bought, bought_two, two_big, and big_bikes. Similarly, "William wins those small toys" is made from william_wins, wins_those, those_small, and small_toys. However, due to the complicated editing process, the way the written words correspond to the sounds in the files is quite different in each case, as shown below: Figure 33 - Word pair vs sound file contents The scoring procedures for the UCAST have been coded in the LabVIEW environment based on binary number operations. In order to integrate the UC Auditory-visual Matrix Sentence Test with the existing binary functions of the UCAST, the following scoring procedure is proposed: 75 Example 1: "Amy bought two big bikes" Figure 34 - Scoring of "Amy bought two big bikes" The word pairs are divided into parts A and B in order to track their location with each sound file. Figure 34 shows that Amy_bought (Part A_part B) is represented by "A" + "my" = 1A + 1A = 2A, while "bought" is represented by "bou" + "ght" = 0B + 0B = 0B. The zero indicates that the "bought" portion of the sound file is not actually contained within Amy_bought. Instead, the "bought" portion is located in part A of the bought_two sound file. When a user makes an incorrect response, e.g. "Amy bought those big shirts" instead of "Amy bought two big bikes", the scores for the incorrect words ("those" and "shirts”) are set to zero within the binary matrix. The user’s response is then compared to the actual sentence presented in order to calculate the score e.g. ["shirts" (incorrect) = 0B] / ["bikes" (correct) = 2B] = 0% for part B of big_bikes 76 Example 2: "William wins those small toys" Figure 35 - Scoring of "William wins those small toys" Figure 35 shows that the sound "wins" is divided across the two files William_wins and wins_those. The "win" portion is contained in part B of William_wins while the "s" portion is contained in part A of wins_those. If the user selects "wins" as a response, the score is represented by 1B in the William_wins file and by 1A in the wins_those file. Any scores that create a division by zero error (e.g. 0B / 0B those) are reassigned to 0%, which indicates an incorrect response. 77 A pilot study will be undertaken in order to determine the approximate signal-tonoise ratio (SNR) corresponding to 30, 50 and 70% intelligibility of the matrix sentences. The resulting three fixed SNR values will be tested with a group of normal hearing listeners to find the word specific speech intelligibility functions (Figure 36). Figure 36 - Word specific intelligibility function Each participant will be presented with 20 practice sentences in order to become familiar with the testing procedure and user interface. The participants will then be presented with 100 randomly generated sentences containing all 400 word pair combinations. A spreadsheet macro has been prepared to randomly generate lists of 100 sentences while ensuring all 400 word pairs are used once. Masking noise will be presented at a constant level of 65 dB SPL while the level of the sentence presentation is varied in order to achieve a SNR corresponding to 30, 50 or 70% intelligibility. Participants’ responses will be scored using the procedure described above and stored by the user interface for analysis. Ten data points will be recorded for each SNR for each word pair combination. A total of 12,000 data points will be recorded (100 sentences x 4 word pairs per sentence x 3 SNR). The average SNR corresponding to 50% intelligibility of each word pair combination will be compared to the average SNR corresponding to 50% intelligibility of the sentences (Figure 37). 78 Figure 37 - Participants' word intelligibility vs sentence intelligibility (hypothetical example) The levels of each word pair will be adjusted up or down in order to match the average intelligibility of the sentences. From the example given in Figure 37, the level of "William" needs to be decreased while the level of "wins" needs to be increased in order to match the sentence intelligibility. Level adjustments will be limited to ±4 dB as previous studies (Wagener et al., 2003; Ozimek et al., 2010) have found this to be the maximum allowable level adjustment that can be made without causing the sentences to sound unnatural. Equalising the intelligibility of each word in the matrix should result in sentences of equal intelligibility. Another group of normal hearing listeners will be required to confirm this hypothesis. Studies comparing performance in auditory-alone, visual-alone and auditoryvisual modes should also be undertaken. 79 4 Conclusions • The UC Auditory-visual Matrix Sentence Test holds promise as a multimodal test of speech perception. • Presentation in auditory-alone, visual-alone, and auditory-visual modes allows assessment of auditory-visual integration abilities. • A stable head position and consistent facial expression are essential for a natural looking synthesized sentence. • More research is required into both physical and mathematical methods for stabilising head position. 80 Appendix 1 – New Zealand Matrix Sentence Recording List 81 Appendix 2 – Phonemic Distribution Analysis 82 Phonemic Distribution Analysis - Continued 83 Appendix 3 – Vowel Formant Frequency Analysis 84 Had F2=2227Hz F1=622Hz Hard Had F2=2202Hz F1=596Hz Hard Had F2=2290Hz F1=587Hz F2 Average=2240Hz F1 Average=602Hz Hard F2=1550Hz F2=1555Hz F2=1632Hz F2 Average=1579Hz F1=841Hz F1=818Hz F1=838Hz F1 Average=832Hz 85 Head F2=2652Hz F1=405Hz Heard 86 Head F2=2417Hz F1=407Hz Head F2=2430Hz F1=401Hz F2 Average=2500Hz F1 Average=404Hz Heard Heard F2=1963Hz F2=1917Hz F2=1974Hz F2 Average=1951Hz F1=516Hz F1=498Hz F1=481Hz F1 Average=498Hz Heed Heed Heed F2=2645Hz F2=2752Hz F2=2779Hz F2 Average=2725Hz F1=366Hz F1=372Hz F1=392Hz F1 Average=377Hz Hid Hid Hid F2=2166Hz F2=2064Hz F2=2087Hz F2 Average=2106Hz F1=519Hz F1=529Hz F1=529Hz F1 Average=526Hz 87 Hoard F2=781Hz 88 Hoard F2=775Hz Hoard F2=788Hz F2 Average=781Hz F1=503Hz F1=493Hz F1=508Hz F1 Average=501Hz Hod Hod Hod F2=1068Hz F2=1042Hz F2=1044Hz F2 Average=1051Hz F1=652Hz F1=660Hz F1=639Hz F1 Average=650Hz Hood F2=1104Hz F1=549Hz Hud Hood F2=1271Hz F1=550Hz Hud F2=1524Hz F2=1442Hz F1=841Hz F1=821Hz Hood F2=1101Hz F1=527Hz F2 Average=1159Hz F1 Average=542Hz Hud F2=1638Hz F2 Average=1535Hz F1=872Hz F1 Average=845Hz 89 90 Who’d Who’d Who’d F2=1690Hz F2=1644Hz F2=1647Hz F2 Average=1660Hz F1=438Hz F1=426Hz F1=460Hz F1 Average=441Hz Appendix 4 – Audio-visual Segmentation Points 91 Amy David bought two three Hannah bought four Kathy bought s ix Oscar 92 bought bought eight Peter bought nine Rachel bought ten Sophie bought twelve Thomas bought s ome William bought those big bikes big books big coats big big big big big big big hats mugs ships shirts shoes spoons toys two giv s David giv s Hannah giv s Kathy giv s Oscar giv s eight giv s nine cheap ships Rachel giv s ten cheap shirts Sophie giv s twelve Amy Peter Thomas giv William giv s s s s cheap three cheap four cheap ix ome those cheap cheap cheap cheap cheap bikes books coats hats mugs shoes spoons toys 93 got two dark got three dark Hannah got four dark coats Kathy got ix dark hats Amy David got eight dark Peter got nine dark Rachel got ten Sophie got Thomas got William got Oscar 94 s twelve s ome those dark dark dark dark bikes books mugs ships shirts shoes spoons toys has two good bikes has three good books Hannah has four good Kathy has ix good hats has eight good mugs Peter has nine Rachel has ten has twelve Amy David Oscar Sophie Thomas has William has s s good good good coats ships shirts shoes ome good spoons those good toys 95 Amy lar ge s David like s three lar ge Hannah like s four lar ge Kathy hats eight lar ge mugs lar ge ships s Peter like s Rachel like s ten Sophie like s twelve William s like s nine s coats ge like like books lar s Thomas s bikes ix like Oscar 96 two like ome those lar ge lar ge lar ge lar ge shirts shoes spoons toys Amy kept two David kept three Hannah kept four green coats Kathy kept ix green hats Oscar kept Peter kept s eight nine Rachel kept ten Sophie kept twelve Thomas kept William kept s ome those green green green green green green green green bikes books mugs ships shirts shoes spoons toys 97 Amy s ees two ne w bikes David s ees three ne w books Hannah s ees four Kathy s ees s ees Peter s ees nine Rachel s ees ten Sophie s Thomas s ees William s ees Oscar 98 s ees s ne w coats ix ne w hats eight ne w mugs ne w ne w ships shirts shoes twelve ne w ome ne w spoons ne w toys those s old two s old three old books Hannah s old four old coats Kathy s old Oscar s old Peter s old Rachel s old ten Sophie s old twelve Amy David Thomas s old William s old s s old bikes old hats eight old mugs nine old ships old shirts ix ome those old old old shoes spoons toys 99 want s two red bikes David want s three red books Hannah want s four red coats Kathy want s ix red hats want s red mugs Peter want s nine red ships Rachel want s ten red shirts Sophie want s twelve red Thomas want s William want s Amy Oscar 100 s eight s ome those shoes red spoons red toys Amy David win s win Hannah win Kathy win Oscar win two mall three s s four s s s s s s mall s ix mall mall eight s mall bikes books coats hats mugs Peter win s nine s mall ships Rachel win s ten s mall shirts win s twelve mall shoes Sophie Thomas win s s William win s s ome s mall spoons those s mall toys 101 Appendix 5 – FFmpeg Command Syntax Definitions: -ab 192k = set the audio bit rate to 192 kbits/sec -an = no audio -b 1000k = set the maximum total bit rate to 1000 kbits/sec -i = in file -s 640x480 = set the picture size to 640 horizontal x 480 vertical pixels -acodec copy = copy existing audio codec –newaudio = create a new audio file -newvideo = create a new video file -sameq = conversion from mpeg4 to mpg of the same video quality -vcodec copy = copy existing video codec -vcodec msmpeg4 = encode video in Microsoft mpeg4 file format -vn = no video -y = overwrite without prompting amy_bought.avi = word pair video file in mpeg4 format amy_bought_1.avi = uncompressed word pair video file amy_bought.mpg = word pair video file in mpg format folder 1 = location of FFmpeg executable folder 2 = folder containing video files folder 3 = folder containing audio files Conversion from uncompressed video to mpeg4 format (in avi container): "C:\folder 1\ffmpeg.exe" -i "C:\folder 2\amy_bought_1.avi" -y -an -vcodec msmpeg4 -ab 192k -b 1000k -s 640x480 "C:\folder 2\amy_bought.avi" 102 Conversion from mpeg4 (in avi container) to concatenateable mpg format: "C:\folder 1\ffmpeg.exe" -i "C:\folder 2\amy_bought.avi" -y -sameq "C:\folder 2\amy_bought.mpg" Conversion from mpg to mpeg4 format (in avi container) "C:\folder 1\ffmpeg.exe" -i "C:\folder 2\Visual alone.mpg" -y -an -vcodec msmpeg4 -ab 192k -b 1000k -s 640x480 "C:\folder 2Visual alone.avi" Combination of audio and video (in avi container) "C:\folder 1\ffmpeg.exe" -i "C:\folder 2\Visual alone.avi" -i "C:\folder 3\Auditory alone.wav" -y -vcodec copy -an -vn -acodec copy "C:\folder 2Auditoryvisual.avi" -newvideo –newaudio 103 Appendix 6 – MS-DOS Command Syntax Definitions: cmd = call MS-DOS command /b = indicates a binary file /c = run command and then terminate amy_bought.mpg = first word pair bought_two.mpg = second word pair two_big.mpg = third word pair big_bikes.mpg = fourth word pair Visual alone.mpg = concatenated sentence video file Concatenation of mpg video format: cmd /c copy /b "amy_bought.mpg"+"bought_two.mpg"+"two_big.mpg"+" big_bikes.mpg" "Visual alone.mpg" 104 References Arnold, J. F., Frater, M. R., & Pickering, M. R. (2007). Digital Television : Technology and Standards. New Jersey: John Wiley & Sons, Inc. Beattie, R. C. (1989). Word recognition functions for the CID W-22 test in multitalker noise for normally hearing and hearing-impaired subjects. The Journal of Speech and Hearing Disorders, 54(1), 20-32. Beattie, R. C., Barr, T., & Roup, C. (1997). Normal and hearing-impaired word recognition scores for monosyllabic words in quiet and noise. British Journal of Audiology, 31(3), 153-164. Boothroyd, A., & Nittrouer, S. (1988). Mathematical treatment of context effects in phoneme and word recognition. The Journal of the Acoustical Society of America, 84(1), 101-114. Buss, E., Hall, J. W., Grose, J. H., & Dev, M. B. (2001). A comparison of threshold estimation methods in children 6-11 years of age. The Journal of the Acoustical Society of America, 109(2), 727-731. Carhart, R. (1951). Basic principles of speech audiometry. Acta Otolaryngologica, 40(1-2), 62-71. Carhart, R., & Tillman, T. W. (1970). Interaction of competing speech signals with hearing losses. Archives of Otolaryngology, 91(3), 273-279. Carhart, R., Tillman, T. W., & Johnson, K. R. (1966). Binaural masking of speech by periodically modulated noise. Journal of the Acoustical Society of America, 39, 1037–1050. Cheng, Y., Liu, Q., Zhu, X., Zhao, C., & Li, S. (2011). Research on Digital Content Protection Technology for Video and Audio Based on FFmpeg. International Journal of Advancements in Computing Technology, 3(8), 917. Cox, R. M., Alexander, G. C., & Gilmore, C. (1987). Development of the Connected Speech Test (CST). Ear and Hearing, 8(5 Suppl), 119S-126S. Cropper, S. J., & Derrington, A. M. (1996). Rapid colour-specific detection of motion in human vision. Nature, 379(6560), 72-74. Dirks, D. D., & Bower, D. (1970). Effect of forward and backward masking on speech intelligibility. Journal of the Acoustic Society of America, 47, 1003-1008. Dirks, D. D., Morgan, D. E., & Dubno, J. R. (1982). A procedure for quantifying the effects of noise on speech recognition. The Journal of Speech and Hearing Disorders, 47(2), 114-123. Etymotic Research. (2005). Bamford-Kowal-Bench Speech-in-Noise Test (Version 1.03) [Audio CD]: Elk Grove Village, IL: Author. Finney, D. J. (1952). Statistical Methods in Biological Assay. London: Griffin. Gifford, R. H., Shallop, J. K., & Peterson, A. M. (2008). Speech recognition materials and ceiling effects: considerations for cochlear implant programs. Audiology and Neuro-Otology, 13(3), 193-205. Grant, K. W., Walden, B. E., & Seitz, P. F. (1998). Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence 105 recognition, and auditory-visual integration. The Journal of the Acoustical Society of America, 103(5 Pt 1), 2677-2690. Hagerman, B. (1982). Sentences for testing speech intelligibility in noise. Scandinavian Audiology, 11(2), 79-87. Hall, S. J. (2006). The Development of a New English Sentence in Noise Test and an English Number Recognition Test. MSc, University of Southampton. Hallgren, M., Larsby, B., & Arlinger, S. (2006). A Swedish version of the Hearing In Noise Test (HINT) for measurement of speech recognition. International Journal of Audiology, 45(4), 227-237. Hewitt, D. R. (2007). Evaluation Of An English Speech-In-Noise Audiometry Test. MSc, University of Southampton. Hirsh, I. J., Davis, H., Silverman, S. R., Reynolds, E. G., Eldert, E., & Benson, R. W. (1952). Development of materials for speech audiometry. The Journal of Speech and Hearing Disorders, 17(3), 321-337. Hochmuth, S., Brand, T., Zokoll, M. A., Castro, F. Z., Wardenga, N., & Kollmeier, B. (2012). A Spanish matrix sentence test for assessing speech reception thresholds in noise. International Journal of Audiology, 51(7), 536-544. Hope, R. V. (2010). Towards the Development of the New Zealand Hearing in Noise Test (NZHINT). MAud, University of Canterbury, Christchurch. Howard-Jones, P. A., & Rosen, S. (1993). The perception of speech in fluctuating noise. Acustica, 78, 258-272. Killion, M. C., Niquette, P. A., Gudmundsen, G. I., Revit, L. J., & Banerjee, S. (2004). Development of a quick speech-in-noise test for measuring signalto-noise ratio loss in normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 116(4 Pt 1), 2395-2405. King, S. M. (2011). Development and Evaluation of a New Zealand Digit Triplet Test for Auditory Screening. MAud, University of Canterbury, Christchurch. Kollmeier, B., & Wesselkamp, M. (1997). Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment. The Journal of the Acoustical Society of America, 102(4), 2412-2421. Leek, M. R. (2001). Adaptive procedures in psychophysical research. Perception and Psychophysics, 63(8), 1279-1292. Levitt, H. (1971). Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America, 49(2), 467-477. Levitt, H. (1978). Adaptive testing in audiology. Scandinavian Audiology(6), 241291. Luts, H., Boon, E., Wable, J., & Wouters, J. (2008). FIST: a French sentence test for speech intelligibility in noise. International Journal of Audiology, 47(6), 373-374. Maclagan, M., & Hay, J. (2007). Getting fed up with our feet: Contrast maintenance and the New Zealand English front vowel shift. Language Variation and Change, 19(01), 1-25. 106 McArdle, R., & Hnath-Chislom, T. (2009). Speech Audiometry. In J. Katz (Ed.), Handbook of Clinical Audiology (6th ed., pp. 64-79). Baltimore: Lippincott Williams & Wilkins. McArdle, R. A., Wilson, R. H., & Burks, C. A. (2005). Speech recognition in multitalker babble using digits, words, and sentences. Journal of the American Academy of Audiology, 16(9), 726-739. Mendel, L. L. (2008). Current considerations in pediatric speech audiometry. International Journal of Audiology, 47(9), 546-553. Miller, G. A. (1947). The masking of speech. Psychological Bulletin, 44, 105-129. Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech. Journal of the Acoustic Society of America, 22, 167-173. Mitchell, A. G. (1946). The Pronunciation of English in Australia. Sydney: Angus & Robertson. Nilsson, M., Soli, S. D., & Sullivan, J. A. (1994). Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. The Journal of the Acoustical Society of America, 95(2), 1085-1099. Niquette, P., Arcaroli, J., Revit, L., Parkinson, A., Staller, S., Skinner, M. (2003). Development of the BKB-SIN Test. Paper presented at the annual meeting of the American Auditory Society, Scottsdale, AZ. O’Beirne, G. A., McGaffin, A. J., & Rickard, N. A. (2012). Development of an adaptive low-pass filtered speech test for the identification of auditory processing disorders. International Journal of Pediatric Otorhinolaryngology, 76(6), 777-782. Orchik, D. J., Krygier, K. M., & Cutts, B. P. (1979). A comparison of the NU-6 and W-22 speech discrimination tests for assessing sensorineural hearing loss. The Journal of Speech and Hearing Disorders, 44(4), 522-527. Ozimek, E., Kutzner, D., Sek, A., & Wicher, A. (2009). Polish sentence tests for measuring the intelligibility of speech in interfering noise. International Journal of Audiology, 48(7), 433-443. Ozimek, E., Warzybok, A., & Kutzner, D. (2010). Polish sentence matrix test for speech intelligibility measurement in noise. International Journal of Audiology, 49(6), 444-454. Plomp, R., & Mimpen, A. M. (1979). Improving the reliability of testing the speech reception threshold for sentences. Audiology, 18(1), 43-52. Pollack, I. (1954). Masking of speech by repeated bursts of noise. Journal of the Acoustical Society of America, 26, 1053–1055. Pollack, I. (1955). Masking by periodically interrupted noise. Journal of the Acoustical Society of America, 27, 353–355. Porter, T., & Duff, T. (1984). Compositing digital images. Computer Graphics, 18(3), 253-259. Rowland, J. P., Dirks, D. D., Dubno, J. R., & Bell, T. S. (1985). Comparison of speech recognition-in-noise and subjective communication assessment. Ear and Hearing, 6(6), 291-296. 107 Smits, C., Kapteyn, T. S., & Houtgast, T. (2004). Development and validation of an automatic speech-in-noise screening test by telephone. International Journal of Audiology, 43(1), 15-28. Sommers, M. S., Tye-Murray, N., & Spehar, B. (2005). Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults. Ear and Hearing, 26(3), 263-275. Strom, K. E. (2006). The HR 2006 dispenser survey. The Hearing Review, 13(6), 16–39. Stuart, A., & Phillips, D. P. (1996). Word recognition in continuous and interrupted broadband noise by young normal-hearing, older normalhearing, and presbycusic listeners. Ear and Hearing Research, 17, 478– 489. Stuart, A., & Phillips, D. P. (1998). Deficits in auditory temporal resolution revealed by a comparison of word recognition under interrupted and continuous noise masking. Seminars in Hearing, 19, 333–344. Sumby, W. H., & Pollack, I. (1954). Visual contributions to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212-215. Swosho. (2012). Head alignment clamp Retrieved 18 August, 2012, from http://www.thingiverse.com/thing:22852 Tye-Murray, N., Sommers, M., & Spehar, B. (2007). Auditory and visual lexical neighborhoods in audiovisual speech perception. Trends in Amplification, 11(4), 233-241. Tye-Murray, N., Sommers, M., Spehar, B., Myerson, J., Hale, S., & Rose, N. S. (2008). Auditory-visual discourse comprehension by older and young adults in favorable and unfavorable conditions. International Journal of Audiology, 47(s2), S31-S37. Versfeld, N. J., Daalder, L., Festen, J. M., & Houtgast, T. (2000). Method for the selection of sentence materials for efficient measurement of the speech reception threshold. The Journal of the Acoustical Society of America, 107(3), 1671-1684. Wagener, K., Brand, T., & Kollmeier, B. (1999b). [Development and evaluation of a German sentence test II: Optimization of the Oldenburg sentence test]. Zeitschrift für Audiologie, 38, 44-56. Wagener, K., Brand, T., & Kollmeier, B. (1999c). [Development and evaluation of a German sentence test III: Evaluation of the Oldenburg sentence test]. Zeitschrift für Audiologie, 38, 86-95. Wagener, K., Josvassen, J. L., & Ardenkjaer, R. (2003). Design, optimization and evaluation of a Danish sentence test in noise. International Journal of Audiology, 42(1), 10-17. Wagener, K., Kühnel, V., & Kollmeier, B. (1999a). [Development and evaluation of a German sentence test I: Design of the Oldenburg sentence test]. Zeitschrift für Audiologie, 38, 4-15. Wells, J. C. (1982). Accents of English (Vol. 1). Cambridge: Cambridge University Press. 108 Wilson, R. H., & Carhart, R. (1969). Influence of pulsed masking on the threshold for spondees. The Journal of the Acoustical Society of America, 46(4), 998-1010. Wilson, R. H., McArdle, R. A., & Smith, S. L. (2007). An Evaluation of the BKBSIN, HINT, QuickSIN, and WIN Materials on Listeners With Normal Hearing and Listeners With Hearing Loss. Journal of Speech, Language, and Hearing Research, 50(4), 844-856. Zokoll, M. A., Wagener, K. C., Brand, T., Buschermöhle, M., & Kollmeier, B. (2012). Internationally comparable screening tests for listening in noise in several European languages: The German digit triplet test as an optimization prototype. International Journal of Audiology, 51(9), 697707. 109
© Copyright 2024