Hacking Together a Google Conversational Search or What I Did Over the Long Berlin Weekend

Untitled (Jean-Marie Hullot / CC By 3.0)
Untitled (Jean-Marie Hullot / CC By 3.0)
May 21st, 2013
|

At Google I/O Johanna Wright gave an in-browser demo of "conversational search", Amit Singhal describes it here "A multi-screen and conversational search experience". Motivated, I decided to throw-together a weekend project emulating aspects of "conversational search" using the backend I wrote for forty.to.

Here it is in action:



NOTE: As I am not using https for the demo (Buying a certificate for just this demo was out of the question.), Chrome asks again and again if the domain forty.to should be given permission to use the microphone. This is just a little annoyance. With a proper certificate, the question will only be asked only once.

The first step in putting all of this together was finding out how to use Google's in-browser speech-to-text engine. A great resource for this is the article Voice Driven Web Apps: Introduction to the Web Speech API from Glen Shires, a Google engineer. He introduces the basics of using Google's in-browser speech-to-text engine and provides source code.

So, to follow along at home, get his sources from the git repository webplatform-samples, then take a look at the main file webspeechdemo.html where all of the action occurs.

On line 225 the browser is checked to determine if the webkitSpeechRecognition object exists. If it does, then a webkitSpeechRecognition instance is newed up, set to recognize continuous speech, produce interim results, and various callbacks are registered to deal with starting speech, speech results, errors, and speech ending.

if (!('webkitSpeechRecognition' in window)) {
  upgrade();
} else {
  var recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onstart = function() { ... }
  recognition.onresult = function(event) { ... }
  recognition.onerror = function(event) { ... }
  recognition.onend = function() { ... }
  ...

So far so good.

Now webspeechdemo.html requires the user to click the start_button to start the speech-to-text process. However, to emulate conversational search, I want the system to simply start answering questions without the user having to click anything.

To do this is relatively easy. I only needed to make a few changes:

  1. Remove the onclick handler from the start_button
  2. Introduce a <body> tag along with its closing tag
  3. Place the original handler as an onload handler on <body>

In addition I needed to remove any dependence the handler had on the event, but this is easily done. Doing so the HTML has the form:

<body onload="startButton()">
...
<button id="start_button">
  <img id="start_img" alt="Start" src="mic.gif"/>
</button>
...
</body>

and the Javascript has the form:

...
function startButton() {
  if (recognizing) {
    recognition.stop();
    return;
  }
  final_transcript = '';
  recognition.lang = 'en-US';
  recognition.start();
  ignore_onend = false;
  final_span.innerHTML = '';
  interim_span.innerHTML = '';
  start_img.src = 'mic-slash.gif';
  showInfo('info_allow');
}
...

Also note, I hard-coded the language. (Hey, this is a weekend hack.)

Other than this change I had to add a form to submit the question:

...
<form id="form" method="get" action="http://127.0.0.1/cgi-bin/demo.fcgi">
  <input id="question" type="hidden" name="question">
</form>
...

and add a Javascript callback which submits the question when the user has finished talking:

...
  recognition.onresult = function(event) {
    var interim_transcript = '';
    for (var i = event.resultIndex; i < event.results.length; ++i) {
      if (event.results[i].isFinal) {
        final_transcript += event.results[i][0].transcript;
      } else {
        interim_transcript += event.results[i][0].transcript;
      }
    }
    final_transcript = capitalize(final_transcript);
    final_span.innerHTML = linebreak(final_transcript);
    interim_span.innerHTML = linebreak(interim_transcript);
    if('' == interim_transcript) {
      final_transcript = questionize(final_transcript);
      final_span.innerHTML = linebreak(final_transcript);
      var formObject = document.forms['form'];
      formObject.elements["question"].value = linebreak(final_transcript);
      recognizing = false;
      recognition.stop();
      formObject.submit();
    }
  };
...
function questionize(s) {
  return s + '?';
}

Beyond that the changes are all relatively trivial, removing unneeded code and cosmetics. For the full details you can look at the sources on GitHub.

As for the backend which actually answers the questions, a description of that will have to wait. It would take many posts to describe. However, in broad strokes, the backend does the following:

  1. Produces a grammatical analysis of the question.
  2. Uses the grammatical analysis as features for ML components which guess what the question is asking for.
  3. Does a web search for documents which could answer the question.
  4. Does an analysis of these documents, extracting answer hypotheses with associated confidences.
  5. Merges equivalent hypotheses boosting the confidences of the resultant hypothesis.

So, that's it!

If you have any questions, feel free to comment or drop me an email Kelly, and don't forget to sign up to get a pre-release version of our app at forty.to.

Leave a comment: