AI can do a passable job transcribing what one person says. Add multiple voices and tangents, things get a lot murkier.
Imagine holding a meeting about a new product release, after which AI analyses the discussion and creates a personalised list of action items for each participant. Or talking with your doctor about a diagnosis and then having an algorithm deliver a summary of your treatment plan based on the conversation. Tools like these can be a big boost given that people typically recall less than 20% of the ideas presented in a conversation just five minutes later. In healthcare, for instance, research shows that patients forget between 40% and 80% of what their doctors tell them very shortly after a visit.
You might think that AI is ready to step into the role of serving as secretary for your next important meeting. After all, Alexa, Siri, and other voice assistants can already schedule meetings, respond to requests, and set up reminders. Impressive as today’s voice assistants and speech recognition software might be, however, developing AI that can track discussions between multiple people and understand their content and meaning presents a whole new level of challenge.
Free-flowing conversations involving multiple people are much messier than a command from a single person spoken directly to a voice assistant. In a conversation with Alexa, there is usually only one speaker for the AI to track and it receives instant feedback when it interprets something incorrectly. In natural human conversations, different accents, interruptions, overlapping speech, false starts, and filler words like “umm” and “okay” all make it harder for an algorithm to track the discussion correctly. These human speech habits and our tendency to bounce from topic to topic also make it significantly more difficult for an AI to understand the conversation and summarise it appropriately.
Say a meeting progresses from discussing a product launch to debating project roles, with an interlude about the meeting snacks provided by a restaurant that recently opened nearby. An AI must follow the wide-ranging conversation, accurately segment it into different topics, pick out the speech that’s relevant to each of those topics, and understand what it all means. Otherwise, “Visit the restaurant next door” might be the first item in your post-meeting to-do list.
Another challenge is that even the best AI we currently have isn’t particularly good at handling jargon, industry-speak, or context-specific terminology. At Abridge, a company I cofounded that uses AI to help patients follow through on conversations with their doctors, we’ve seen out-of-the-box speech-to-text algorithms make transcription mistakes such as substituting the word “tastemaker” for “pacemaker” or “Asian populations” for “atrial fibrillation.” We found that providing the AI with information about a conversation’s topic and context can help. In transcribing conversations with a cardiologist, for example, medical terms like “pacemaker” are assumed to be the go-to.
The structure of a conversation is also influenced by the relationship between participants. In a doctor-patient interaction, the discussion usually follows a specific template: The doctor asks questions, the patient shares their symptoms, then the doctor issues a diagnosis and treatment plan. Similarly, a customer service chat or a job interview follows a common structure and involves speakers with very different roles in the conversation. We’ve found that providing an algorithm with information about the speakers’ roles and the typical trajectory of a conversation can help it better extract information from the discussion.
Finally, it’s critical that any AI designed to understand human conversations represents the speakers fairly, especially given that the participants may have their own implicit biases. In the workplace, for instance, AI must account for the fact that there are often power imbalances between the speakers in a conversation that fall along lines of gender and race. At Abridge, we evaluated one of our AI systems across different sociodemographic groups and discovered that the systems’ performance depends heavily on the language used in the conversations, which varies across groups.
While today’s AI is still learning to understand human conversations, there are several companies working on this problem. At Abridge, we are currently building AI that can transcribe, analyze, and summarise discussions between doctors and patients to help patients better manage their health and ultimately improve health outcomes. Microsoft recently made a big bet in this space by acquiring Nuance, a company that uses AI to help doctors transcribe medical notes, for $16 billion. Google and Amazon have also been building tools for medical conversation transcription and analysis, suggesting that this market is going to see more activity in the near future.
Giving AI a seat at the table in meetings and customer interactions could dramatically improve productivity at companies around the world. Otter.ai is using AI’s language capabilities to transcribe and annotate meetings, something that will be increasingly valuable as remote work continues to grow. Chorus is building algorithms that can analyse how conversations with customers and clients drive companies’ performance and make recommendations for improving interactions with customers.
Looking to the future, AI that can understand human conversations could lay the groundwork for applications with enormous societal benefits. Real-time, accurate transcription and summarisation of ideas could make global companies more productive. At an individual level, having AI that can serve as your own personal secretary can help each of us focus on being present for the conversations we’re having without worrying about note taking or something important slipping through the cracks. Down the line, AI that can not only document human conversations but also engage in them could revolutionise education, elder care, retail, and a host of other services.
The ability to fully understand human conversations lies just beyond the bounds of today’s AI, even though most humans are able to more or less master it before middle school. However, the technology is progressing rapidly and algorithms are increasingly able to transcribe, analyse, and even summarise our discussions. It won’t be long before you find a voice assistant at your next business meeting or doctor’s appointment ready to share a summary of what was discussed and a list of next steps as soon as you walk out the door.
________________________________________________________________________
Author: Sandeep Konam. Sandeep Konam is a machine learning expert who trained in robotics at Carnegie Mellon University and has worked on numerous projects at the intersection of AI and healthcare. He is the cofounder and CTO of Abridge, a company that uses AI to help patients stay on top of their health.