Amazon’s Echo Into the Void

11 Dec 2022

From Business Insider:

The vast majority of Worldwide Digital's losses were tied to Amazon's Alexa and other devices, a person familiar with the division told Insider. The loss was by far the largest among all of Amazon's business units and slightly double the losses from its still nascent physical stores and grocery business.

I recall back in the summer of 2017 meeting with some senior marketing executives who worked for a multinational fashion and beauty company. We were there to talk about AI. Voice was central to the discussion. In a few years, it seemed inevitable to most people in the room that voice would be an important “touch point” for consumers wanting to interact with their brands. I felt slightly more cautious, though was easy to get wrapped up in the hype. I always imagine a simple task of ordering a meal in a restaurant. It’s far easier to peruse a menu with your eyes, than to have the waiter read you a list of what they have available to order while you try and remember and make a decision in a reasonable amount of time. In general I subscribe to the view that AI is an accelerator for human-like skills and interactions. It can speed up and automate tasks that humans do, but if those tasks don’t already make for a great experience, then AI by itself won’t make it better, unless speed and accuracy are the cause of the poor experience.

Alexa suffers from this “restaurant problem”. While modern Natural Language Understanding capabilities are very good, they haven’t progressed at the rate it seemed they would back in 2017. This makes Alexa great for simple commands like setting timers and playing music, but useless for anything for more substantial. A common misconception with systems such as Alexa and Apple’s Siri is that they generate the answers using AI. They don’t. Generative AI systems do except (See GPT3 and ChatGPT), but they cannot be trusted to provide accurate answers and because they are trained based on crawling the internet, they are unable to generate answers that require knowledge or recent or future events. ChatGPT won’t be able to tell you the weather tomorrow, and it won’t be able to tell you what time your local supermarket opens. Instead, systems like Alexa and Apple Siri use a form of text classification. After the sound waves from your voice are converted into symbols (letters and numbers) and those symbols are then converted into words, they take this sentence you uttered and classify it into one or more intents. The intent that scores the highest probability from the machine learning model is the one Alexa will presume was your actual intent. That’s why when I recently asked Alexa “At what temperature should I hang washing outside?” it thought I was was asking for a weather forecast. Someone at amazon has to have created that intent and fed the ML model with example utterances for it to be able to detect it. These systems cannot understand intents they haven’t been trained on. Once the assistant knows your intent, the next task is to extract any parameters from your utterance. Examples would be the date and location in the phrase “_Will it rain in Newport next wee_k?”. Once your voice assistant knows your intent and any parameters, it will then perform some kind of logic based on that intent. This is where the AI and machine learning typically stops. If the intent was asking the weather, then the next step would be to query a weather API. If it was to send a message to someone, then it would be to to start whichever process it used to send messages on your device. Of course the weather API itself may use AI or machine learning to predict the weather, but that is totally separate and no different to a weather presenter telling you the same forecast on the TV. This approach is extraordinarily useful for many things: most chatbots and voice assistants work like this. For people who can’t see, or find it difficult to use a touchscreen or mouse, they provide invaluable ways to interact with computing devices.

I use Siri all the time to set reminders, timers and to control my lights. What Alexa and Siri are not so good at is deep and meaningful conversation. This is where it seems Amazon’s hope that Alexa might one day be a shopping destination falls short. When you have a device that is centred around a conversational user experience, it will hit a wall due to current technical limitations and the fact that for many people, speaking is less efficient that using a smartphone when they need to both receive and provide information to complete the task. The fact that Amazon seemingly has no way to monetise Alexa means the experience has been gradually getting worse. Now when I ask it the weather, it responds with the forecast - great - but then immediately starts telling me I can order groceries from it as well. Ads like this are infuriating and a sign of desperation from Amazon.

So were we foolish to think the future of human computer interaction will be voice? No. I think in the long term, when devices are advanced enough to provide human level, meaningful conversation then there is no doubt in my mind that voice will be the one of primary user interfaces we use for some tasks at least. When I ask ask Alexa to order the precise groceries I want and have the confidence to know it will work, and that the device will be capability to ask me for confirm anything its unsure about then maybe I can see it working. But I still can’t help thinking that humans like to see as well as hear things, especially when it comes to making choices. Voice is great for issuing commands and receiving quick updates, but your voice assistant starts talking for more than about 20 seconds, then it’s usually quicker to glance down at a screen and see a text or graphical representation.

I think the future is bright for voice assistants like Siri because they complement alternative user interfaces and are part of a deep ecosystem, and so can integrate with health, home automation, contacts and other information users have provided. Voice based AI is also making large strides in call centres. Unless Amazon changes tact, the Amazon Echo however with its limited ecosystem will remain glorified clock radios for a while longer.