Language for machines
Updated: Aug 1
Usually, machines are not familiar with natural language. Personal assistants such as Alexa or Siri try to mimic language understanding and are doing nowadays increasingly well. Setting timers, asking for the weather, doing contextual searches all work consistently well. For a few weeks, however, a small caveat arose in our flat. Upon the request: "Hey Siri, start Robbie" – while Robbie is our robot vacuum - not the familiar vacuuming sound started, but instead the voice of Robbie Williams sounded from Siris speakers. Nice feature, but not the intended behavior.
Interestingly this is not only entertaining but also shows the limitations of Siri, Alexa et. al. . It seems that Siris last training data contained more requests asking to listen to Robbie Williams, than start a robot vacuum called Robbie. With the aforementioned command.
This makes sense since Robbie is listed only as of the 10th most common robot vacuum name (https://www.ctrl.blog/entry/popular-roomba-names.html). In addition to that, we assume the overlap of people having a robot vacuum and Siri, to be much less than people that have Siri and listen to Robbie Williams.
This shows how important training data is for well working and reliable natural language processing.
This is especially true for scientific and law-related language. So far there is no well-labeled dataset that enables reliable training for a scientific paper and patent analysis. Also, Alexa and Siri currently rely on statistical text analysis, that assigns machine-readable word representations more randomly, than systematic.
For complex texts, this is not very elegant and quickly leads to undesired results. Therefore we are using labeled patent data to train our model and understand technical phrases based on their technical meaning. You are likely to see some of the results in the upcoming weeks. Stay tuned…