2018 is the year where audio and voice interfaces are finally emerging and taking prominence as viable alternatives and complements to screen based interactions. In this experiential session you will be inspired to consider voice interface design from a fresh and expanded perspective.
As we all become more accustomed to voice interactions, our expectations for more complex activities and more personalised interfaces are inevitable. Drawing on learnings from information-rich voice applications designed for, and with input from, blind users, Tim will share insights and his unique understandings around elegant and efficient voice experience design.
A solid grounding in speech and voice output is crucial for great voice application design. Tim will explore and demonstrate the power of the human voice and how it can be harnessed to create an increased sense of connection and inclusion with users.
Tim will highlight the fundamental differences between screen based and voice-first application design. Tim will conclude the session with his top suggestions for creating intuitive, natural-sounding voice interfaces and applications that speak for themselves.
Drawing from two advanced voice application case studies which Tim headed up, and based on 25 years experience in designing and implementing voice applications, this presentation provides broad insights and learnings that aren’t well covered in the current voice literature.
When you are blind, listening is never optional.
One thing most blind people have in common is that they have had to become high functioning listeners, skilled in efficiently processing and retaining auditory information.
Blind users provided extensive input and feedback to both case studies covered here.
The overarching idea of this presentation is that we can learn from non-visual user’s needs, preferences and strategies, as we design and enhance modern voice experiences.
The other high-level theme recurring throughout this presentation is that voice (has the potential to be) so much more than a string of words to be automatically converted into sound.
At the moment, modern voice assistants are largely single-turn call and response based. However, people are becoming more accustomed to voice interactions and as a result, their desires and expectations for more complex transactions, richer conversational sessions and expanded functionality are inevitable.
I consider this nascent field of voice assistants is only at around version 0.1 level so we have immense opportunities for progress in the coming years.
A key challenge for advanced voice applications is the transformation, navigation and presentation of complex or voluminous data for the user. In differing ways the voice services behind both of the following case studies devised new ways and refined existing means to address this challenge.
This session is all about voice and sound
It’s actually visually slides free, but the recorded session contains various audio samples (sound slides).
So for the next 40 minutes I invite you to:
In addition to giving your visual centres a rest, Closing your eyes also helps to bring you into a more open and expansive listening state (listening position).
Just a little about me, so you know where I’m coming from:
I’ve worked in voice UX design since the early 90s.
I describe myself as a ‘Professional Listener’.
This presentation draws on most of My Core Interests:
Voice & Sound,
Listening & Speaking,
People, Technology & Design
Voice UI is often broader service design
Voice UI Design or Voice Experience Design as I prefer to think of it, is a totally different paradigm to screen-UX.
Voice UX is different to systems that present screen-based information into sound – They are called Screenreaders.
Whereas voice UIs optimise the sound output to maximise the naturalness, clarity and intuitiveness of the service, the job of screen-readers is to convey — through speech or braille — all relevant visual elements of the application and operating system.
A screen-reader doesn’t have knowledge of the content; it is a translator into a different modality.
Visual Interfaces are naturally spatial – our eyes do most of the scanning, navigating, focusing etc.
Sound interfaces – in contrast – work in the dimension of time. There is no cursor or moveable pointer. How we sequence the words and the messages, completely determines how the listener experiences the service.
Our main focus today is on the voice output side of voice user interfaces and how voices are perceived by users.
iVote had two interfaces:
To support users through a long and complex voting session and to ensure no un-due emphasis was given to any of hundreds of candidates, meticulous Voice selection and direction was paramount to the design process for iVote by Phone. Other than for early prototyping, we consciously used no synthetic speech in this application.
We used two distinct voices (voice fonts), one (female) for the telephone service itself and another (male) for speaking all candidate names.
@TimNoonan #professionalistener thanks for sharing your important work today. Voice fonts! Who knew. Also great to finally understand why the voices on phone systems sound the way they do. Dry and monotone to not influence a voter for example. @UXAustralia
— dionw (@dionw) August 31, 2018
While visual designers obsess about typeface, font, colours and iconography in visual apps, there is an obvious gap in the literature and the collective consciousness about voice properties and its importance in good design.
Through various audio samples, I demonstrated some of the strategies we employed enabling users to independently navigate through a complex ballot paper using a telephone keypad, including enabling users to even vote Below-The-Line.
In particular, reflecting the ballot paper layout in the telephone service. The keypad acted like a cursor cross for navigating to groups and candidates.
Because users would only have need to use the service once, it was paramount that usability and discoverability were central to the design.
We also provided a practise service so users could try and learn the system as many times as they wished, ahead of casting their vote. User confusion or misunderstandings about how they were completing their ballot were obviously unacceptable.
“The fully automated iVote system used in NSW is superior to any other we have seen in an Australian election so far, and voters have clearly endorsed the system by using it in greater numbers than ever before. We will be working towards encouraging this system as the gold standard for future elections.” — Marianne Diamond, Vision Australia
Conceptual design and comprehensive scripting of wording for iVote by Phone;
Preparation of functional software design specifications;
Development of automation systems and processes for processing audio, automating text-to-speech and maximising audio quality;
Researching recording studio options and recommending a studio Twenty5Eight who we worked with for the extensive and highly time-critical voice recording requirements of the iVote project;
Preparation of a Vocal Branding profile for each of the key ‘voices’ required for iVote which we used to cast voice talent;
Meticulous vocal direction of voice talent for the recording of hundreds of prompts and nearly 1000 candidate names;
Provided in-house Voiceover and audio production services for iVote promos;
Consulting to the Commission on promotional strategy to maximise the uptake of iVote by people with disabilities;
Web Accessibility and usability services including design recommendations, access consulting, conducting observational usability studies and ensuring web accessibility compliance in conjunction with Scenario Seven Pty Ltd;
Designing and conducting observational usability design studies for iVote by Phone;
Publishing an updated Standard after completion of the iVote project, to document UX recommendations.
Download The updated Version of the Australian Telephone Voting Standard in PDF
Developed in-house in 1997 onwards, TNN was A sophisticated text-to-speech Information-rich voice application for browsing, reading and reviewing newspaper articles over the phone.
The voice UI approaches we formulated were based on standards as well as Enlisting real-time Input from dedicated blind and vision impaired users of the service.
We used DTMF (Touch Tone) input strategies for searching, navigating, skipping reading and reviewing rich content from the service.
Even today, it would be difficult to create a reliable voice input approach for power users of TNN and the iVote system.
This is an area needing more work as we move into a ‘Voice First’ era.
Whereas iVote was centred around carefully scripted and directed human voices, Today’s News Now is entirely automated and utilised the DECTalk speech synthesiser..
We devised PERL and regular expression-base techniques to fully automate the transformation of print information to a spoken word format that reflected spoken conventions, through the use of PERL & regular Expression pattern-matching to better render phone numbers, Times, opening hours, pronunciation of proper names etc.
A key design challenge here was not to discard the original text, as users also wanted to review content to check the spelling of names and the like.
For example, although we wanted the system to pronounce ‘Grand Prix’ as ‘Graun Pre’ it was important that users could check how the event was actually spelt in the print edition of the newspaper.
Because there were few developer tools available for TTS and IVRs in the 90s, We developed a high-level scripting language called PhoneScript which was optimised for the rapid prototyping and development of powerful telephone applications which are able to present a range of rich information sources to callers, through synthetic speech.
The three Applications we developed in the PhoneScript environment were:
JobPhone for presenting structured access to job vacancy advertisements from the mycareer.com.au website;
LibTel for browsing Royal Blind Society’s braille and talking book catalogue and allowing online ordering; and
Today’s News Now structured access to the full text of Fairfax and News Ltd newspapers.
A development environment optimised for text-to-speech IVR services (most existing platforms are recorded-message based);
Automatic processing of text through extensive PERL regular expressions, so as to dramatically improve pronunciations, and to convert written conventions into their spoken equivalents. Examples are Australian place names, names of politicians, British pronunciations, reading out complex currency values, intelligent reading of dates times and date ranges, identification and clear rendition of telephone numbers, identification of compound words and more;
Sophisticated acronym processing module which is able to identify (based on a large number of context rules) whether upper case words should be spoken or spelt out. Most speech synthesisers don’t employ enough context information to do this job very well;
Full ‘review’ mode, allowing a caller to navigate a document by paragraph, sentence or word. Any word can be spelt out;
A set of menus for adjustment and personalisation of speech parameters including speed, volume, pitch and personality.
Three separate sets of voice parameters, one for menus/system messages, one for article/document reading and one for help messages. This provides increased navigation context and can increase comprehension of reading (listening);
Based around a very high-level scripting environment which hides the complexity of preparing text-to-speech buffers, telephony controls etc. This means that limited programming skills are required to tailor or fine-tune application user interface elements. Examples of some scripting commands are “hangup”, “say”, “spell” “saysubst” “title” “SayArticle” etc;
The scripting language facilitates automatic compliance with the Australian and New Zealand standard, with respect to standard key assignments and timeouts, but these can be easily overridden as required;
An intuitive ‘talking keypad’ approach to alphabetic entry, which complies to Appendix B of the standard;
Centred around a database-driven approach to data access, allowing a clean separation of back-end processing of source information, and front-end presentation of information to callers;
Read the full Article on TNN, LibTel and JobPhone Features and Capabilities
Language understanding coupled with human-like expression is starting to blur the line between human speech and computer generated speech.
Siri voices and Google work is the most obvious but this is clearly the direction of future text to speech research and application.
For example, IBMs Watson recentby took part in a debate, to influence and persuade listeners of its case against a human speaker.
Recent research by Pablo Arias a final-year PhD student in perception and cognitive science at the audio research lab, IRCAM, in Paris has identified the main articulatory factors that are audible when a person does or does not smile. He has developed an algorithm that can desmile or ensmile any human utterance.
Pablo also discusses research finding that as we listen to a voice, our brain waves adjust in response to what we hear.
We are just starting out on the journey of truely intelligent assistants and the next 5-10 years will be very interesting indeed!
Today, the Voice Assistant race is mostly about features and functionality, but the personality, trustability and relatability of voice assistants and interfaces will be just as important in the longer term.
As technology continues to better understand voice, language and emotion, lets work together to ensure that future use of voice technology is always respectful of users and their emotions and make sure that it serves to constructively assist and support all walks of humanity.
Voice technology is starting to blend with humanity – it’s harder to tell if we’re listening to a human or a machine. Our challenge is we need to be respectful to users when building these services and act ethically. @TimNoonan #UXA18
— Allison Ravenhall (@RavenAlly) August 31, 2018
Today’s session draws on my own experiences and those of blind beta testers and users. I hope our insights, learnings and experiences can inform and improve voice services now and into the future.
The main focus of my presentation was the two case studies I’ve described here. I mentioned some of the following concepts in passing, but they are listed here as additional background information.
Voice First is the Amazon catch-cry but all but low-hanging fruit is shunted to a screen-based app. – Accessibility and usability of the Amazon Alexa app or the Google Home app for installation or configuration is arduous and in no way mirrors the voice simplicity of the device itself.
Now that Siri is launched on the HomePod, which has no screen, Apple’s multi-modal touch-first model starts to show its biases. Calendars weren’t available at launch, because they had been designed on the assumption that calendar detail were being presented on screen, in addition to minimal voice feedback.
The take-away is that even if you are multi-modal, there are many situations where users will be constrained to audio only, so this needs to be factored into designs.
Note that depending on the platform, these are usually issues beyond the control of a voice application designer, but longer-term these biases and factors need to be considered and addressed.
Is your choice of output voice aligned with your user base? Gender, ethnicity, accent, age, and personality? Do they feel included or separate?
Is your app’s purpose and audience suited to its voice? As a hypothetical, Is a female voice (on all the leading assistants) going to work for a voice-oriented gay male dating app akin to Grindr?
Is your speech recognition engine able to handle speech impediments, stutters, nervous speech, shaky and broken speech? The Mozilla Speech Recognition Corpus Project may be an opportunity to include folks with different speech profiles, speech impediments etc.
Are your timeouts sufficient for people who speak slowly or take more time to formulate their requests/responses?
Does your service have understanding of and respond to colloquial informal terms and phrases from your users? This also has a bearing on how comfortable and accepted your users feel.
Though not immediately apparent to everyone, eye-hand coordination should not be a design requirement for voice assistants, as they principally work in the auditory (non visual) domain.
Google Home Mini and HomePod both employ touch-sensitive controls for volume adjustment, pause/play etc.
Alexa in contrast has physical buttons – with nominal tactually differentiated surfaces, so it can be operated by feel, in the dark. – Though better design overall, physical buttons could be more problematic for people with physical disabilities to operate.
Blind from birth, Tim Noonan is a voice experience designer, inclusive design consultant and an expert in voice & spoken communication.
Building on his formal background in cognitive psychology, linguistics and education, Tim has been designing and crafting advanced voice interfaces since the early 90s and was one of the principle authors of the Australian and New Zealand standard on interactive voice response systems AS/NZS 4263.
Tim is the principle author of several other standards relating to automated voice systems, including automated telephone-based voting, telephone-based speech recognition and four industry standards on the accessibility of electronic banking channels and inclusive authentication solutions.
Tim has also been a pioneer in the accessibility field for more than three decades. He particularly loves working with emerging and future technologies to come up with innovative ways to make them more inclusive, effective and comfortable to use.
A career highlight for Tim was working as the lead Inclusive User Experience designer for iVote – a fully automated telephone-based and web-based voting system for the NSW Electoral Commission. iVote was issued with Vision Australia’s Making A Difference Award and was recommended as the ‘Gold Standard’ for accessible voting.
For the last 25 years Tim has been leading the way in teaching, conceptualising and designing technologies that communicate with users through voice and sound – both for accessibility and mainstream users.
website by twpg