I want to tell a story about how a frustrating 15-minute struggle to procure a burrito tells me something about the design of so-called “digital assistants”. This happened in late 2014, and I intended to write about it back then, but somehow it fell by the wayside until Dag Kittlaus unveiled Viv at the TechCrunch Disrupt NY conference.
If you have no idea who Dag Kittlaus is, he was one of the co-creators of Siri, which we all know as the digital assistant software inside iPhones. Apple is far from alone in this space: Google has its Google Now platform as part of Android and, soon, other physical devices. Even Amazon is getting in on the act, with its Echo speaker. No doubt they imagine a future where we’re standing in our kitchen, and have emptied our box of corn flakes into a bowl, and are able to say right then and there “Amazon, add a box of corn flakes to my weekly delivery on Friday.”
That’s all well and good, but for anyone who has actually used Siri, its limitations are laid bare pretty quickly. From the outside, I would wager that the hardest part about developing Siri was decoding naturally spoken language, first and foremost, enough to glean the user’s intent. Once they managed that, Siri appears to be essentially a database lookup service with some functionality tacked on (for example integration with an API to make restaurant reservations). So you can say: “What’s the score in the Blue Jays game?” and it talks to the MLB or whomever and fires the result back to you. This kind of thing is helpful, particularly in an environment where being hands-free is important, like driving a car, or while holding a baby in one arm and a package of diapers in the other.
Unfortunately, our needs can’t always be expressed in one sentence like that, because sometimes elements in the world create a context that limits or adds additional complexity to the actions we can perform. It would be helpful if digital assistants understood some of these real-world limitations and automatically worked within them.
Which brings me back to what turned into the seemingly sisyphian task of getting my grubby paws on a burrito.
Above is a map of central Massachusetts, in the USA, just to the northeast of Worcester. After handling some business south of Worcester, I had driven north through Worcester, east across 290, and had just merged onto 495 north, about where the town of Hudson appears in this map.
You’ll notice there are some Chipotle Mexican Grills highlighted. That is not a coincidence; we are on a burrito quest, after all. Since I was driving, and didn’t relish the idea of my head smashing violently into the windshield as my car careened into a tree after I lost control because of staring at my phone instead of the road, I opted to ask Siri “where’s the nearest Chipotle?”. She dutifully responded “I found a Chipotle nearby” and showed me that it was in Marlborough. I’m somewhat familiar with this area; I’ve noticed on previous drives that there’s a mall around there. On the map above, that location is the one directly south of Hudson, on route 20. It is indeed the closest Chipotle, and I had just driven past it.
I’m traveling on 495 north, more like next to the “H” in Hudson on the map. I don’t particularly want to get off the highway, do a U-turn, head south until route 20, and then turn onto that and drive a mile+, all before circling back north to eventually be where I already am. I try another approach: “Where is the second nearest Chipotle?” Siri spins her wheels and says “I found a Chipotle nearby”, and shows me the same one again. I try a few other queries, like “Is there a Chipotle between here and Lowell?” (a city just off the northeast corner of this map that I would be driving through). She spins her wheels and says “I found a Chipotle nearby”. The point is, no matter what I did, it feels like Siri works for the Marlborough branch’s franchisee. Siri really, really wants me to go to that Chipotle. As a developer, this tells me is that basically Siri pays attention to the word “Chipotle” and does a geo-search on my location, and spits out the #1 response.
For those hanging on every suspenseful thread of this riveting tale, yes, I did eventually get my burrito, up there north of Littleton. I sat outside on a beautiful sunny day and ate it while I talked to my girlfriend, who was on the other side of the world. Technology can be great. How did I find the Chipotle near Littleton? By pulling off the highway, launching Google Maps and typing in a search query. So technology, still great (a map in our pocket!), but unfortunately in this instance it made me temporarily cease my northward progress to fiddle with it. This experience can be better.
When you’re in a car, the world is structured according to roads, and some of those roads (like 495) are limited access highways, so our ability to make a quick u-turn is limited or inconvenient. I have a destination to the north, and I would like to get there at a reasonable time. The iPhone has a wealth of information about me in that moment (scarily so, actually). It knows that it is traveling at a reasonable rate of speed. It knows that it is traveling along a particular highway. It knows the direction of the car along the road.
So hopefully developers could create a virtual assistant that gives us not just one option, but a couple. It would be great if she could say “There’s a Chipotle about ten minutes to the south, but there’s also one twenty-five minutes away in the direction you’re going. Would you like one of those, or another option?” Further to that, these assistants should allow us to restrict the search ourselves from the get-go: “Siri, is there a Chipotle near 495 to the north of here?”
There are other situations where these assistants could benefit from knowing about the constraints the real world places on its users. Imagine you’re in Hong Kong International Airport. Siri says there’s a Starbucks here. Great! But…is the Starbucks outside of security, or inside, beyond the security checkpoint? If I’m just here at the airport picking up a friend, I’m not permitted to go to the Starbucks beyond security. Conversely, if I’ve ventured beyond security already, boarding pass in hand, then I’m probably not going to want to exit through immigration and customs and so on just to get a latte. Another example: If it can tell from elevation, speed and location that I’m in the Seoul subway riding on Line 4, rather than telling me about the Tous Les Jours locations closest to my location, it should tell me about Tous Les Jours shops that are close to subway stops along this line in the direction I’m going. You get the point.
All of this brings the discussion back to Viv. It’s clear that the team wants to build a ubiquitous, platform agnostic virtual assistant (so that they can have a bajillion users to monetize), and it’s also clear that they have expended time on handling more complex use cases. In the talk below, Kittlaus describes how Viv is able to decode natural language, and then compose a short program dynamically to represent logic it believes existent in the instruction. Maybe then we could have queries like “Viv, show me some Mexican places around here, but only if they have at least four star ratings.” Would it have helped me do a more fuzzy search for Chipotle while heading north? I am not sure yet, but if these assistants are to be worthwhile beyond simple queries, we’re going to need to imbue them with data that guides search algorithms with an understanding that the real world isn’t as flexible as the virtual one.