Microsoft achieves ‘human parity’ in speech recognition

19 Oct 2016

With machine learning making rapid strides in the last few years, researchers at Microsoft claim that their speech recognition system has reached 'human parity', marking a new milestone in artificial intelligence or AI.

This effectively means that the speech recognition system can recognise words from a conversation as well as humans would do.

'Human parity' doesn't mean error-free recognition, as even professional transcriptionists make errors and don't recognise every word perfectly from conversations. But the rate of errors made by the recognition system is now at par with the error rate of humans while performing a similar task.

During their tests, researchers at Microsoft reported a word error rate of 5.9 per cent by the speech recognition system. "The 5.9 per cent error rate is about equal to that of people who were asked to transcribe the same conversation, and it's the lowest ever recorded against the industry standard Switchboard speech recognition task," the Redmond-based company said in its official blog post.

"We've reached human parity," the company's chief speech scientist Xuedong Huang was quoted as saying in the blog.

This advance in speech recognition can have wide implications for both business-based products as well as consumer products. It can improve the user experience on a simple entertainment console like Xbox or can be used in speech-to-text transcription software. Microsoft's digital assistant Cortana is also likely to benefit from the improvement in speech recognition.

"This will make Cortana more powerful, making a truly intelligent assistant possible," company's head of artificial intelligence Harry Shum said.

In order to improve speech recognition to this level, the researchers made use of neural language models, which puts similar words together for better recognition.

Even though Microsoft's speech recognition system has achieved a huge milestone by reducing the error rate significantly, the researchers still have a long way to go in order to make the technology useable in real-life situations.

This will include situations where there is noise in background or when multiple people are talking and for devices like our phones to better understand the context of the conversation, like humans can do.