Speech-Interface Prompt Design: Lessons From the Field

Speech-Interface Prompt Design: Lessons From the Field
Jerome White
Mayuri Duggirala
New York University
Abu Dhabi, UAE
Tata Consultancy Services
Pune, India
[email protected]
ABSTRACT
Designers of IVR systems often shy away from using speech
prompts; preferring, where they can, to use keypad input.
Part of the reason is that speech processing is expensive
and often error prone. This work attempts to address this
problem by offering guidelines for prompt design based on
field experiments. It is shown, specifically, that accuracy can
be influenced by prompt examples, depending on the nature
of the information requested.
Categories and Subject Descriptors
[Human-centred computing]: Empirical studies in interaction design
General Terms
Human Factors
Keywords
Spoken Web, interactive voice response, prompt design
1.
INTRODUCTION
Voice-based systems, and in particular those based on interactive voice response (IVR), have played an influential
role in technology geared for human development. Within
this setting, voice-based systems have been broadly used
for information dissemination [11], community building [8],
real-time monitoring [4], and data collection [9]. They have
been applied, specifically, for purposes as diverse as health
care [10], journalism [7], agriculture [8], and education [6].
In a recent international conference focusing on ICTD [12],
in fact, 20 per cent of accepted work centred around some
application of the voice modality.
Mobile interfaces driven by speech are generally restricted
by inaccuracy: overall task performance is hindered by recognition errors, which in turn negatively affect users’ task
completion and experience. Although this is the nature of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICTD ’15 May 15–18, 2015, Singapore, Singapore
Copyright 2015 ACM 978-1-4503-3163-0/15/05 ...$15.00.
http://dx.doi.org/10.1145/2737856.2737861
[email protected]
speech, quality can be improved by using professional voice
recognition software. The expense of such software, however,
makes the solution unrealistic for small organisations operating on limited resources. For organisations that can afford
such solutions, local languages within developing regions are
often unsupported. Thus, for speech to be a realistic option
within the IVR space, and in turn IVR to be a more viable option within the development community, affordable
improvements to recognition technology are paramount.
One such approach is through prompt design that encourages recognisable input. That is, giving enough dialogue such that a user’s utterance is likely to be within the
recognised vocabulary. Studies around such approaches exist [2, 5, 3], but none have done so on a large-scale, live
deployment, within a developing region. This work fills that
gap. Specifically, it evaluates differences in spoken user input as a function of a prompt’s instructional dialogue. The
results suggest that, depending on the nature of the requested information, a prompt can influence user input, and
in doing so ultimately improve accuracy.
2.
2.1
BACKGROUND
The Employment Service
This work is based on interaction data from a voice-based
employment platform. The employment service allowed candidates to input their resume information, employers to input their job details, and for the two parties to be matched
where appropriate. Candidates were able to apply for jobs
of interest, and employers were able to obtain contact information for candidates they deemed appropriate. The entire
interaction—leaving information, editing information, and
searching for matching parties—took place over the phone.
Users were taken through a series of keypad, known as dualtone multi-frequency (DTMF), and speech prompts to acquire their information and navigate through the system itself. Interaction took place in either English or Kannada,
the local language of the deployment region. Language was
chosen at the start of the call, and interaction took place
in that language for the remainder of the call. The application was built using the Spoken Web platform, a spoken
dialogue system that provides Internet-like features on top
of IVR [1]. Although the service was designed for use on
phones in general, it was intended for use on low-end mobile
phones in particular.
2.2
System Specifics and Definitions
When a candidate called for the first time, they were re-
Table 1: Prompt transcription.
Prompt content
Section
Original
Altered examples
No examples
Location
“Speak the name of the district
where you live. For example, you
may say Mysore, Mandya, Bijapur, Dharwad, et cetera.”
“Speak the name of the district
where you live. For example, you
may say Kolar, Hassan, Gulbarga, Belgaum, et cetera.”
“Speak the name of the district
where you live.”
Skill
“Now, using a single phrase, inform your skills. For example
you may say data entry operator, DTP, plumber, electrician, welder, et cetera.”
“Now, using a single phrase, inform your skills. For example
you may say secretary, waiter,
driver, mechanic, technician,
et cetera.”
“Now, using a single phrase, inform your skills.”
Qualification
“Speak your highest educational
qualification. For example you
may say ITI, diploma, Below
SSLC, PUC, BA, et cetera.”
“Speak your highest educational qualification. For example
you may say B-com, SSLC,
12th standard, B-tech, PGdiploma, et cetera.”
“Speak your highest educational
qualification.”
quired to register with the system. The registration process
consisted of a series of question-answer dialogues that resulted in the creation of a user resume. For purposes of this
paper, an entire question-answer dialogue is referred to as a
section, where the “question” portion of a section is known
as a prompt.
The entire registration process consisted of seven sections:
two sections—a user’s age and work experience—required
DTMF input while the remaining five sections required spoken input. Of the five spoken sections, two, name and free
speech,1 were saved for presentation to job providers—no
speech recognition was attempted in these cases. For the remaining three sections—qualification, skill, and location—
an attempt was made to recognise what the user said. In
an ideal case, the speech was recognised, converted to text,
and stored in a database. For cases in which the speech was
not recognised, a null value was stored, and the user’s input was recorded and saved for offline analysis. A user was
given two attempts to speak an utterance that the system
recognised; the offline recording was made from the third
and final attempt.
2.3
User demographics
The system was deployed throughout the state of Karnataka for a total of 11 months. In that time, candidate
registration was attempted 29,152 times; 8,625 of those attempts completed all sections of the registration process.
Most callers were in their early twenties with one to two
years of work experience. Approximately 21 per cent specified that they were from Bangalore, the largest city in Karnataka; there were over 20 other districts across the state
that were represented by more than 1 per cent of callers,
respectively. The educational split was more uniform, with
most callers possessing at least 10 years of education.
3.
SETUP
As previously mentioned, three sections of the registration process attempted to recognise what the user said and
1
During “free speech,” the user was instructed to speak freely
about themselves for thirty seconds.
then convert that value to text. The prompt for these three
sections instructed the caller not only on the purpose of the
section, but on appropriate words that they should speak.
Specifically, the user was presented with the sections purpose, along with examples of valid input; see Table 1 for details. The basis for the examples was twofold: first, to give
the user an idea of the type of input that was expected. That
is, both to reinforce what was meant by “skill” or “qualification,” for example, and to convey that they should condense
their answer to one or two words. The second reason for
examples was to increase the chances that what a user said
was in the speech recognisers vocabulary. Thus, examples
were provided that were expected to be recognised.
The examples chosen for the original prompts were done so
based on field studies [13] and discussions with government
employment specialists; thus, they were chosen for their relatability to the target audience and designed in conjunction
with the speech recognition dictionaries. The examples for
the experimental prompts were chosen based on what was
already in the recogniser vocabulary. They were chosen at
random, with the only condition being that the system was
capable of recognising them if spoken and that they were not
already present in the original prompt. Equivalent prompts
were run for both Kannada and English.
The original prompts ran in the system, exclusively, for 10
weeks. From this, a reference distribution of user input was
established. After this period, users received one of the three
prompts—original, altered examples, or no examples—from
a random uniform distribution. Original prompt statistics
are reported from the start of the experimental time period,
not from system inception.
4.
RESULTS
Results from the experimental study are outlined in Table 2. Three aspects of the data are of particular interest:
the amount of time required to go through each section,
the accuracy of speech recognition, and the distribution of
recognised answers. Reported times are exclusively user interaction time—the amount of time spent listening to the
prompt itself is removed.
Table 2: Prompt effects on user input. N is the number of samples taken for a given prompt. Time
refers to the amount of time users spent in a given prompt; stars (?) denote values that are signficantly
different from “original” (two-sided paired t-test, p < 0.05). Accuracy refers to the amount of user input that
was successfully recognized. Ex. Dist. (“example distribution”) is cumulative distribution of the examples
mentioned in the prompts. Top-five inputs are the most common system-recognized input given by users,
presented in descending order; daggars (†) denote input values were also used as examples in the prompt.
Prompt Statistics
Time (sec.)
Section
Prompt
N
Avg
SD
Accuracy
Ex. Dist.
Location
Original
1597
33.3
22.1
73.6%
0.179
Altered Examples
1550
32.4?
19.8
75.8%
0.112
No Examples
1506
29.9?
24.0
69.7%
Original
1915
51.7
29.6
38.3%
0.155
Altered Examples
1953
45.1
33.6
50.5%
0.281
No Examples
1938
51.0?
30.9
18.7%
Data entry operator† , Electrician† ,
DTP† , Attendant, Fitter
Technician† ,
Secretary† ,
Teacher,
†
Driver , Mechanic†
Computer Science, Electrician, Cashier,
Finance
Original
Altered Examples
No Examples
2140
2101
2174
43.8
45.9?
38.0?
24.7
26.7
26.3
64.2%
62.3%
58.8%
0.331
0.128
PUC† , BA† , ITI† , BEd, BCom†
BA, ITI, BCom† , BEd, 12th standard†
PUC, BA, ITI, BCom, BEd
Skill
Qualification
4.1
Input bias
Example distribution (Table 2, “Ex. Dist.”), and top-5 input, are telling indicators of whether users are merely repeating the examples they hear. Example distribution is the
fraction of users exposed to a given prompt who responded
with a value that was also present in the example. Based
on values observed across all prompts, it is unlikely that the
results are dominated by repeats. If this were the case, observed values would have been closer to one—in no section,
in fact, was the cumulative distribution of the prompt examples greater than 35 per cent. Further, in no prompt did the
top-5 values observed coincide completely with the values
presented in the prompt examples. These findings suggest
that the values extracted during this experiment were actual
representations of the population.
4.2
Accuracy
In all cases, irrespective of section, the accuracy of the
speech recogniser was lowest when no examples were presented to the user. The gain in accuracy, however, was not
constant across sections: whether a gain was observed if
examples were presented depended on the section, and ultimately the nature of the information extracted from the
user. In the case of location and qualification, an increase of
4 and 2.8 percentage points, respectively, were observed, in
the worst case.2 In the case of skill, however, accuracy doubled with the introduction of examples. Yet, the best case
accuracy of any skill example section was almost 10 percentage points lower than the no-example accuracies of location
2
Where “worst case” is the difference between the noexample accuracy and the lower of the original and altered
example accuracies.
Top-Five Inputs
Bangalore, Bijapur† , Bellary, Dharwad† ,
Bagalkot
Bangalore, Bijapur, Bellary, Belgaum† ,
Dharwad
Bangalore, Bijapur, Bellary, Bagalkot,
Hassan
and qualification; which happened to be the worst case observed accuracies in those sections.
It is likely that skill accuracy was poor, and that qualification was slightly worse than location, because of the nature of the information. When asked to describe their skill,
many people will either list several concise topics, or speak
freely, in a conversation-like manner, about a single ability. Moreover, what constitutes a skill, and how to describe
it, are often unique across a populous: the same expertise
may be described several different ways depending on who
is speaking. This combines to make skill a difficult concept
to classify, and in turn to recognise automatically.
Within the location section, regardless of whether examples were presented, the user was asked to speak the name
of their “district,” an administrative division in India, that
is well-defined and widely understood. There are approximately 700 districts within the country, 30 of which are
contained in Karnataka. This not only simplifies the responsibility of the speech recognition engine, it reduces the
values considered by the caller to a set that is finite and universal. While there may be several ways to describe skill,
and slight variations in degrees offered across syllabi, districts are relatively well known and stable.
Qualification accuracy, overall, was better than the accuracies observed for skill, but worse than those observed for
location. Qualification benefits from many of the same advantages that were seen in location: a finite, well-established
set of values, which was further reduced given that the platform had a target audience. However, qualification—like
skill, but to a lesser extent—is plagued by a lack of standardised structure, along with various methods of expression.
India has various boards that govern primary, secondary,
and trade-specific tertiary education. As such, equivalent
degrees may go by different names depending on the board
governing a particular student’s education. Complicating
matters further, some people may express their last educational instance as a combination of that instance, and the
instance result: as opposed to “PUC,” for example, a person
may use what is a more accurate description of “PUC-fail.”
They may also combine their last successful degree and the
degree that they are currently pursuing, making the speech
interaction more conversational and harder to automatically
discern.
4.3
Time
With the exception of skill, the amount of time required to
move through a section was smallest when no examples were
presented. The difference is slight—about five seconds in the
worst case—but observable. This increase in section time,
however, generally led to accuracy improvements. Skill was
an exception: the duration of the no-example prompt was
longer, and the accuracy was much lower, when compared
to the original and altered example prompts.
5.
CONCLUSION
What examples to use within a prompt, and even whether
to use examples at all, depends on the nature of the data
and the acceptable trade-off between time and accuracy. In
all cases, there was an improvement in accuracy when examples were introduced. There was also—with the exception
of skill—an increase in time spent to move through the section. However, whether the increased time was worth the
improved accuracy was not always clear. This is most notable in the case of location: examples, in the best case,
improved accuracy by over 6 percentage points. Inclusion
of these examples, however, cost the caller, and the system,
an additional 2.2 seconds of connection time, on average.
Moreover, irrespective of which examples were used, or if
examples were used at all, there was little difference between
the observed top-five inputs. Again, this suggests that location is relatively standard in the minds of callers, and that
creating prompts that optimise time would not significantly
hinder the quality of the data collection.
Qualification benefits from examples, but the time required for those examples could likely be reduced; presenting
fewer examples than were presented in our system, for example, would be one way of achieving this. Again, from the
overlap in top-five inputs across example schemas, what a
qualification is seems clear to our user group; examples likely
serve to educate them on what format—standard one-word
abbreviations—is expected.
The overwhelming conclusion for skill is that examples
are not just helpful, but necessary. As noted in Table 2,
the discrepancy between the best case and worst case skill
prompts—no examples and altered examples, respectively—
was the largest across all prompts and sections in which
the experiment was conducted. Further, that users spent as
much or more time in the no-example case as they did in the
original and altered example cases, suggests that either their
speech was too long, or that it required numerous retries to
move through the prompt.3 Further, what was particularly
interesting was that in the case where examples were cho3
And even then, “success,” in the sense of the system recognising their input, was not guaranteed.
sen at random (“altered examples”) the accuracy was higher
than in the case where the examples were chosen based on an
understanding of the potential user base (“original”). This
suggests that for relatively open-ended data collection—as
is the nature of skill—systems, and researchers, should constantly adapt and observe the prompt to maximise success.
6.
ACKNOWLEDGEMENTS
The authors would like to thank the Karnataka Vocational
Training and Skill Development Corporation for their assistance in making this work possible; Vaijayanthi Desai and
Kundan Shrivastava for their infrastructural assistance; and
Brian DeRenzi whose meaningful discussions played a key
role in the formation of this work. Appreciation is also extended to the reviewers for their insightful comments and
suggestions.
References
[1] S. Agarwal et al. The spoken web: a web for the underprivileged. SIGWEB Newsletter, (Summer):1:1–1:9,
June 2010.
[2] C. Baber et al. Factors affecting users’ choice of words
in speech-based interaction with public technology. Int.
J. Speech Tech., 2(1):45–59, 1997.
[3] K. Baker et al. Constraining user response via multimodal dialog interface. Int. J. Speech Tech., 7(4):251–
258, 2004.
[4] W. Curioso et al. Design and implementation of CellPREVEN: a real-time surveillance system for adverse
events using cell phones in Peru. In AMIA Annual Symposium, 2005.
[5] L. Karsenty. Shifting the design philosophy of spoken
natural language dialogue: From invisible to transparent systems. Int. J. Speech Tech., 5(2):147–157, 2002.
[6] M. Larson et al. I want to be Sachin Tendulkar!: A spoken English cricket game for rural students. In CSCW,
2013.
[7] P. Mudliar et al. Emergent practices around CGNet
Swara, voice forum for citizen journalism in rural India.
In ICTD, 2012.
[8] N. Patel et al. Avaaj Otalo: a field study of an interactive voice forum for small farmers in rural India. In
CHI, 2010.
[9] S. Patnaik et al. Evaluating the accuracy of data collection on mobile phones: A study of forms, SMS, and
voice. In ICTD, 2009.
[10] J. Sherwani et al. Healthline: Speech-based access to
health information by low-literate users. In ICTD, 2007.
[11] J. Sherwani et al. Speech vs. touch-tone: Telephony
interfaces for information access by low literate users.
In ICTD, 2009.
[12] B. Thies and A. Nanavati, editors. Proceedings of the
3rd ACM DEV, 2013.
[13] J. White et al. Designing a voice-based employment
exchange for rural India. In ICTD, 2012.