Scientists develop new AI system for faster extraction of data from the internet
15 Nov 2016
Scientists have developed a new artificial intelligence system that can more effectively extract data from the vast wealth of information present on the internet, PTI reported.
The data necessary to answer all kinds of questions was available online in form of plain text. However, the extraction of the data from plain text and its organisation for quantitative analysis might be prohibitively time consuming.
Researchers from Massachusetts Institute of Technology (MIT) in the US had developed a new approach to information extraction.
Machine learning mostly happens by combing through training examples and identifying patterns corresponding to classifications provided by human annotators.
By way of an example, humans label parts of speech in a set of texts, and the machine-learning system would try to identify patterns that resolved ambiguities - for instance, when "her" was a direct object and when it was an adjective.
Typically, computer scientists tried to supply their machine-learning systems with as much training data as possible, thus increasing the chances of the system being able to handle difficult problems.
The new research has scientists training their system on scanty data.
"In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article," said Regina Barzilay, professor at MIT.
In their new paper, the MIT researchers trained their system on scanty data -- because in the scenario they're investigating, that's usually all that's available, IANS reported. But then they found the limited information an easy problem to solve.
"That's very different from what you or I would do. When you are reading an article that you cannot understand, you are going to go on the web and find one that you can understand," Barzilay, who also a senior author of the paper, added.
A machine-learning system assigned each of its classifications a confidence score, a measure of the statistical likelihood that the classification in the was correct -- given the patterns detected the training data.
With the new system developed by the researchers, if the confidence score was too low, the system automatically did a web search to pull up texts likely to contain the data it was trying to extract.
It then attempted to pull the relevant data from one of the new texts and reconcile the results with those of its initial extraction.
If the confidence score continued to be low, it then moved on to the next text pulled up by the search string, and so on.
The system eventually learned to generate search queries, gauge the likelihood that a new text was relevant to its extraction task, and determine the best strategy for combining the results of multiple attempts at extraction.