The Voice-Controlled, Face Recognizing, Drone Journey – Part 8

Total Shares

Introduction to the Drone Journey

Speech to Text

This post is the eighth post in documenting the steps I went through on my journey to build an autonomous, voice-controlled, face recognizing drone. There are 7 other posts building up to this one which you can find at the end of this post.

Focus of this post

We have come a long way from when we first started with a drone controlled from the computer. In the last post we spent time understanding how to use the Bing Speech API to convert supplied text to speech. In this post we will:

  • Show how you can use the Bing Speech API to derive text from speech.
  • Integrate that approach into our DroneWebServer.js web application and front end HTML so that we can control the drone via speech.

The Steps

To succeed in this task I knew there were a number of steps I needed to follow:

  1. I needed a way to capture the spoken text as a WAV file which I could send to the Microsoft Cognitive Services Bing Speech API and it needed to work from the browser
  2. I needed to grab the output coming back from the Microsoft Cognitive Services Bing Speech API and use that within my DroneWebServer.js to replace the hard coded name of the person we were looking for
  3. I needed to look at stringing together a set of currently separate calls so that the drone would operate autonomously based on receiving the audio command

How hard can that be?


It turns out that this part of the task caused me quite some headaches.

I started out, having watched the video of Lukas, wanting to replicate the great visualization he had, and his approach, given I have not really done this sort of thing before. I did a look around but could not find anything similar despite a lot of searching. Eventually I again reached out to Lukas over twitter and he kindly pointed me to the this web site and to this GitHub entry.

Using that information I was able to quickly integrate the capabilities of capturing a WAV file, with a neat visualization, and creating a download link into the application. It was really a matter of just moving over the relevant JavaScript files, and then using a saved version of the above mentioned website to move things over into my own DroneWebServer.js. I managed to do that very quickly and was suitably impressed.

At this point I could click the microphone icon to start recording, having ok’d it in the browser, and capture the WAV file which would then be nicely visualized. I could also click on the “download” icon to have the browser prompt me to save the WAV file. Saving the WAV worked fine and I could listen to it.

The issues I ran into were how to get that WAV file posted back to the express server. I could see the underlying, copied, javascript was creating a blob which was the file. I could also see that it was creating a link to the created file and substituting that back in the browser.  Having tried, what felt like, 100 ways to get that same file into a blob I could get back to the server I eventually got frustrated and stopped. I will go back to it once everything is working but I needed to find another way.

Then the light hit

In searching around for an answer to how to do this I did something I should have perhaps done first. I went to the documentation page for the relevant part of the API. This led me to this GitHub location and a sample application.

Taking a closer look at the application it seemed pretty simple. This was because a lot of the really clever stuff was hidden in a JavaScript helper file you just needed to download and consume.

The other thing that was neat about this was that the approach could be triggered from the web browser and the result would be a text string which I knew I could get very easily back into my application. I was sold. I would lose the neat visualization but I would have a working solution. The rest of this blog is concentrated on the steps I took.

Adding in the speech recognition

In this section we will be working in multiple files and creating a new file. We need to make updates to the front-end and the back-end code.

Grab the JavaScript Helper File

The very first thing you need to do is download the JavaScript Helper file by clicking on this link. Once the link opens you need to copy the entire contents and save them into a file, which you need to create, called speech.1.0.0.js in your c:\Development\Drone\public directory. You will not need to open or understand that file.

Edit the index.html file

As the majority of the work will be driven from the client we need to update our index.html file. We need to add into the file some new JavaScript functions in the <head> section. The code you need is below.

Here is what each part is doing:

  1. function callme – This is going to issue a GET request to the assigned URI and include with it a parameter called inputspeech. It sets the value of the inputspeech parameter based on what is passed into the function as a string. We will be passing the value returned by the Bing Speech API, after some post processing, to this function to send the information to the DroneWebServer web application listening on port 3000.
  2. function clearText – As part of this process I wanted to ensure that we could show a little of the output of the speech recognition. To do that we need two functions. This is the first one. It simply sets the value of an element called output in our HTML body to the text you see.
  3. function setText – It essentially will set the same element in our HTML (which clearText wipes for a standard phrase) with a new phrase dependent on what is passed to it.
  4. function capitalizeEachWord – Takes  a string and makes every word have a capital letter. I did this as it makes matching easier later in my backend code.
  5. function start –  this is doing the heavy lifting.  The first variable sets the mode of the speech recognition that you will use. The next variable contains your api key for the Bing Speech API and the last one allows you to set the language of speech recognition.Then you have a long call which is using the config information you just provided, and the previously copied helper JavaScript file to setup your connection to the Bing Speech API service, by creating an object you can leverage called client.Using that new client object we can turn on the microphone which is what the next line does.We turn the Microphone on for 5000ms (or 5 secs) so that you can capture the speech you want. At the end of that period we turn off the microphone again which also causes the submission to the Bing Speech API in the background.Then we wait for the response. Once we get it we parse out of it one thing called display (which is the text the service has decoded from your speech) an turn that into a string using the JSON.stringify function. We then capitalize the first letter of every word. The next line removes the quotation marks which the service includes with the response and finally we chop the string (which assumes the primary command is “Find” to remove all except the persons name (the returned string also comes with a full stop at the end).The very last steps are to call the setText function so we can see what we are sending to the server on the screen of our web application and then to call the callme function passing in that text so that it will be sent to the server as the parameter inputspeech.
<script type="text/javascript">
  var speechclient;
  function callme(text) {
     var xhr1 = new XMLHttpRequest();'GET', 'http://localhost:3000/audio?inputspeech=' + text, true);
  function clearText() {
     document.getElementById("output").value = "Not Currently Looking For Someone";

  function setText(text) {
       document.getElementById("output").value = text;

  function capitalizeEachWord(str) {
    return str.replace(/\w\S*/g, function(txt) {
        return txt.charAt(0).toUpperCase() + txt.substr(1).toLowerCase();
function start() {
  var mode = Microsoft.CognitiveServices.SpeechRecognition.SpeechRecognitionMode.shortPhrase;
 var language = "en-GB";
   speechclient = Microsoft.CognitiveServices.SpeechRecognition.SpeechRecognitionServiceFactory.createMicrophoneClient(mode,language,apiKey);
   setTimeout(function () {
   }, 5000);

   speechclient.onFinalResponseReceived = function (response) {
        var text = JSON.stringify(response[0].display);
        text = capitalizeEachWord(text);
        text = text.replace(/"/gi,'');
        text = text.slice(5,text.length-1); 
        setText("Searching For " + text + " Now!!"); 

At this stage we now need to add the helper file URL so that the relevant functions are available. To do that add this line into the <head> section of your index.html file.

<script language='javascript' src="speech.1.0.0.js"></script>

Since we want to set the value of the element identified by output (which we will add in a moment) when the page loads we need to find the html <body> tag in our index.html file and edit it so it looks as shown below. This will cause the function clearText, explained previously, to be executed when the web page first loads.

 <body onLoad="clearText();">

So far everything we have done is the plumbing to make this work. Now we need to add the visual elements into our index.html file.

We need to first add the HTML element we have constantly referenced as output. We will use this to display some text dynamically. Find the line  where you have defined the iFrame and add the lines shown in bold (text area and p tags)  below above it.

<textarea align="center" rows="1" id="output" style='width:400px;'></textarea>
<iframe src="http://localhost:3000/DroneImage.html" height="400" width="660" name="iframe_DroneImage"></iframe>
Next up we need to add a way to start the process of recording our input. To do that I cot a microphone image and I saved that into my c:\Documents\Drone\public directory and called it microphone.png. I could then simply add in an additional image on top of all the others we have making calls. The code I added is shown below and I inserted it directly after the <table cellpadding=”4″> html tag.
 <td align="center">
   <a onclick="start();"><img height="70" width="70" src="microphone.png"></a>
Phew. That is all you needed to do in the index.html file. In fact this is pretty much all you need to do in order to use the Microsoft Cognitive Services Bing Speech API. I was impressed :).
If you start up your DroneWebServer by executing node DroneWebServer.js from the command window in the c:\Development\Drone directory you should then be able to head over to the web browser and go to http://localhost:3000. If all is working well you should see a page looking as shown below. Do not click anything yet!

GUI With Microphone

Work on the back-end

Next up we need to edit our DroneWebServer.js in a few places so we can capture the information we are sending from the web browser.

The first thing we need is a global variable to be defined where we can store the name of the person we want to find.

At the top of the DroneWebServer.js file you have a lot of variables being defined now. Simply add this line:

var nameToFind ="";

The next thing we need is a new router function in our file. Below is the code to add in a new one that will be called when someone uses the audio path.  So you can see EXACTLY what is sent to the back-end there is a console.log line which prints the value of the parameter inputspeech which is passed in with the request.  The code is below.

 app.get('/audio', function(req, res) {

Finally we need to replace the hard coded instances of names in our DroneWebServer.js file.  The hard coded names are being used in the identifyPATH function.

I made two adjustments. Firstly I replaced the line :

        if (name == “Mark Torr”) {


         if (name == nameToFind) {

The next thing is I adjusted the speech synthesis line changing from

tts.Synthesize(“Found You Mark Torr.  I Hope You Are Having A Great Day!”);


 tts.Synthesize(“Found You” + nameToFind + “. I Hope You Are Having A Great Day!”);

That was all I did at first. Now you can save the file.

Next start up your DroneWebServer, by executing node DroneWebServer.js from the command window in the c:\Development\Drone directory, you should then be able to head over to the web browser and go to http://localhost:3000.

If all is working well you now be able to click on the microphone, approve localhost using it, and then say “Find XXX” replacing XXX with the name of someone in your trained PersonGroup. You get 5 seconds to say what you want from when you approve localhost accessing the first time or from when you click on the microphone on subsequent occasions.

All clicking is doing is setting the back-end variable at this point and writing it out to the console. You will see the text you spoke (after find) being written to the console as well as into the browser as shown below.

If you now click the takeoff icon and turn on photos by clicking that then your drone will look for the specific person you have requested.

More autonomous

Of course clicking a whole series of buttons is not what we want to be doing if this is to be truly autonomous. What you need to do is make the microphone input trigger several actions. This needs us to edit the new audio router we put into the DroneWebServer.js file as show below and concatenate several code fragments we were driving individually previously.  Essentially we:

  1. We write the inputspeech variable to the console
  2. Grab the incoming inputspeech variable value and assign it to the nameToFind variable.
  3. We then tell the drone to take off (we could include a check here to see if it is a valid name).
  4. Next we start the camera capturing images which in turn triggers the Face identification. If the face we want is found that in turn triggers the speech synthesis.
 app.get('/audio', function(req, res) {
   console.log("Drone Taking Off");
   console.log("Drone Taking Pictures");    
    var pngStream = client.getPngStream();
    var period = 5000; // Save a frame every 5000 ms.
    var lastFrameTime = 0;
     .on('error', console.log)
     .on('data', function(pngBuffer) {
        var now = (new Date()).getTime();
        if (now - lastFrameTime > period && found == false) {
           lastFrameTime = now;
           fs.writeFile(__dirname + '/Public/DroneImage.png', pngBuffer, function(err) {
             if (err) {
               console.log("Error saving PNG: " + err);
             } else {
               console.log("Saved Frame");  
      identifyPATH(__dirname + '/Public/DroneImage.png',myGroup);

If you save this then next up you need to restart the web application DroneWebServer.js and refresh your browser. Click on the Microphone and say “Find XXX”. Be ready as the drone will now take off and start to look for you.  You should be seeing it end to end as shown in the video below.

CONGRATUATIONS. You now have an Autonomous Voice-Controlled, Face Recognizing, Drone and we have replicated the excellent project that Lukas pulled together in his original post.

Where are we and next steps

We have made it most of the way now with what we will do with Microsoft Cognitive Services .  The next few posts will focus on how we can extend what Lukas did so that we can capture some of the sensor data from the drone.

We will grab that data and get it flowing to the Microsoft Azure IoT Hub where we will extract it using Azure Stream Analytics and visualize it in PowerBI.

Previous Posts in the Series

Although I work for Microsoft these thoughts are 100% my own & may not reflect the views of my employer.


Leave a Reply

Your email address will not be published. Required fields are marked *