The Voice-Controlled, Face Recognizing, Drone Journey – Part 7

Total Shares

Introduction to the Drone Journey

Text to Speech

This post is the eighth post in documenting the steps I went through on my journey to build an autonomous, voice-controlled, face recognizing drone. There are 7 other posts building up to this one which you can find at the end of this post.

Focus of this post

This post is going to make use of another of the APIs offered by Microsoft Cognitive Services – the Bing Speech API. Back in post 4, seems a long time ago now, I explained how to sign up for the various services and get the API key.

You will need to go back now and ensure you copy the BING Speech API key as we will use that in our next steps.

By the end of this post we will have the drone speaking to us when it lands.

Some history and thanks!

In trying to work out how to make use of the Bing Speech API I spent a lot of time on the documentation page. I also found my way to the samples hosted here (since updated from when I first went there).

I used the sample code that was originally there to successfully make a call to the Bing Speech  API Service. I was getting data back but no matter what I tried I could simply not get the returned file to play once it was returned in the node.js world.

At that time I again reached out to my fellow Microsoft Colleagues who took a look at what might be going on. I want to give a shout out here to Song Li  especially and also to  Sheng Yhao and Xin Dong (plus some other people that helped me reach them – you know who you are). Song Li jumped in and helped out ultimately adjusting the online node.js sample based on my experience.

There were a few things I tried before I contacted Song Li:

  1. In the original sample the WAV file was being returned but there was no playback code. I tried to write the file to disk and play it separately using a media player, which showed me things were not working, before also trying to send it directly to the speaker (once I had worked out how to get that package to actually install – more on that later).
  2. With that not working I also played around with adjusting the output format encoding to see if that helped. Adjusting the encoding made no difference.

It turns out the issue with my node code was a relatively simple one. Basically the post method encoding needed setting to null so that the response buffer comes back in utf8. I would never have found that but for the help of Song Li.  If you do not understand then don’t worry. The point is a subtle change makes all the difference to what comes back but I wanted to share my pain here :).

Based on that exchange Song Li has now updated the node.js samples you find here to not only include the encoding option but also to use the speaker package (when you finally manage to install it) to play back the returned WAV file bringing it to parity with what I was doing. Had I had that to start with it would have been super quick.

Installing Packages

To use the code we are going to copy you need to install some new packages. We will be making use of the request package, the xmlbuilder package, the wav package and the speaker package.

To use them you must first install them. This means going into the command console, changing to the C:\Development\Drone directory and then executing these commands. We will look at speaker separately as I had a lot of problems getting that to install.

npm install request -save
npm install xmlbuilder
npm install wav
Looking specifically at the speaker package

Installing the speaker package will be troublesome for many people who might lack certain things. I used at least half a day on this issue. The speaker package is built at install time on your computer.

This means for it to work you need Python and a development environment. A royal pain.

If you go ahead and run npm install speaker and you lack Python the first message will look as shown below – ERRORS everywhere.

No Python

So what! We simply have to go and install Python. Ok – so navigate now to python.orgs/download location. You will see it has multiple options for Windows as shown below.

Python Web
Super IMPORTANT: Make sure you take version 2.7.13. I took the latest one first and it simply did not work for me. Cue uninstall and try again. Save yourself some pain and make sure you click the button to download Python 2.7.13 from the start.

Once it has downloaded you can simply run the resulting .msi file to start the install of Python.  The next important point is that when installing you need to ensure that the option “Add python.exe to the Path” is selected in the “Customize Python” dialogue you will get.

It is all the way at the bottom with a red X by default as shown below.  You need to click on it and select the option “Entire feature will be installed on the local hard drive”. If you do not do this you have to manually add the python.exe to your path. If you have another version of Python you might want to decide how to tackle this yourself. Needless to say, it is important, that when we install the speaker package the 2.7.13 python.exe is in the path.

Customize Python

Once installed you need to close your command window, open a new one and navigate back to the c:\Development\Drone directory.

From there try run the command npm install speaker again.  This time, unless you have something like Visual Studio installed, you will get a different error such as the one shown below. The issue now is that having completed what Python needed to do the install of speaker is now wanting to perform some compilation. Without the right version of a visual c++ build environment it just does not work. Cue frantic searching for an answer.

Speaker Error

That’s right. You have guessed it. We need to also now install a development kit/environment.  I tried the various SDKs that are alluded to here thinking they would be more lightweight. They simply do not play well on Windows 10 so do not even try (I did.. more wasted effort). You have to go download and install Visual Studio 2015 Community Edition.

I am going to strongly advise that you do this when you have some time available.  The download itself takes time but then a whole series of updates happen and it seems to go on forever. I work for Microsoft so I am allowed to complain about that 🙂

The other important thing is that BEFORE you START the install you MUST select custom, when asked, and then expand the “Programming Languages” option. There you need to select “Microsoft Foundation Classes for C++” on top of the Common Tools option that is selected by default as shown below. Then you can hit next, take defaults for the rest and go grab a coffee 🙂

Visual Studio Options

With any luck, once the install is finished, and you have started your machine as well as drunk your coffee, you will be able to open a new command window and run the npm install speaker command and see the screen below meaning you have won :). Take that speaker package!

Speaker Winner

This part of the configuration took me a long time to work through so I hope you avoid my frustration 🙂

Creating TextToSpeechService.js

I am going to keep this really simple. The code to do this is available here on GitHub. Take that code and save it in a file called TextToSpeechService.js.  I will not paste it below as it take a lot of space.

In that file you need to find the line:

var apiKey = "Your api key goes here";

and replace the text between two quotation marks with the API key you got when signing up to the Bing Speech API in part 4.  It is NOT the same key we have used so far for the Face API! If you use the wrong key you will get status code 400 returned.

In addition. You might need to MODIFY the endpoint being used in that same file. During my creation of this demo it changed and this caught me out. If you look through the file you will find a post request to https://speech.platform.bing.com/recognize and this needs to be changed to https://speech.platform.bing.com/synthesize. If you do this then all will work well. If not then you will get errors.

The file you have just created is a little different to the ones we have used before. If you run it by calling node TextToSpeechService.js then nothing will happen. That is because this file is defining a function which can be called from other node applications but on its own it is not doing anything.

Lets look at some key parts:

  1. exports.Synthesize is specifically defining that the function  that will be available to accessing this file will be Synthesize. In the version we have right now there is no parameter. We will adjust that shortly
  2. The next thing you se is the use of the xmlbuilder to create the JSON file we will pass in our post request. You can see the language being set, the gender of the speaker we want and you can also see the selection of speaker (en-US, ZiraRUS). this is then being turned into a single string on the next line.
  3. Then we are making a request to the service to get us a Token using our API key. The token, obtained by passing our api key, is then used to authorize our access to the actual Bing API service.
  4. Once we have the token we can then go ahead and post all the information to the BING Speech Service. You will see in doing that we are passing in the previously created JSON as well as stating what output encoding we want. You will also see the important line “encoding: null” which was originally missing causing garbled output.
  5. Once we get a response (next part with error handling) we then create a new WAV object (this object inserts wave headers etc). Using that object we read the returning stream and pipe it to the speaker object which causes it to play on our computer.

There is a lot of error handling in place but I hope the above gives you the basic idea.  Now create a file called testSynth.js and put the following two lines in it:

var tts = require('./TextToSpeechService.js');  
tts.Synthesize();

Now go ahead and run node testSynth.js. You should get (make sure your speaker is on.. I forgot that once) the service returning you the text you passed in which was “This is a demo to call Microsoft text to speech service.”

The last step is now to edit the Synthesize function in TextToSpeechService.js so that you can pass in the text you want the service to return. To do that you need to only edit two lines.

exports.Synthesize = function Synthesize(){ needs to be updated to be exports.Synthesize = function Synthesize(InputSpeech){ which is basically defining that this function will now accept a parameter.

Then we need to find the line .txt(‘This is a demo to call Microsoft text to speech service.’) where we are building the xml document and replace the fixed text with the new parameter .txt(InputText).

You can test this by making a change to your testSynth.js file so you include some text you want to play. Do not make it too long. There are limits on the length of a returned sound clip. Below is my example.

var tts = require('./TextToSpeechService.js');  
tts.Synthesize("Barnsley are the worlds best football team");

Running node testSynth.js should now get you the text you have provided playing. Congratulations – you now have the Bing Speech API of Microsoft Cognitive Services in action. You can play around with voices and more if you want to.

Integrating this into our DroneWebServer.js

Of course the reason for spending time doing this was so that the drone could speak to a named person when they were found before it landed. It turns out this is pretty easy because of the use of the separate file and the exports function provided everything before was working.

All you need to do is go into your DroneWebServer.js file and locate update the section of code which looks as follows:

if (name == "Mark Torr") {   client.land();   found = true();}

to make the call to the service and say what we want it to say similar to what is shown below (note the line wrap).

if (name == "Mark Torr") {   
    var tts = require('./TextToSpeechService.js');  
 tts.Synthesize("Found You" + Name + ". I Hope You Are Having A Great Day!");
    client.land();   
    found = true();
}

If all goes well you should be able to run node DroneWebServer.js and see something happening similar to the video below. In this video I am making the drone take off using the app we built on the computer and telling it to start taking images. The rest is happening automagically because of our code.

Listen for the speech at the end! The speech API has some latency which is why the drone lands before the speech arrives. I also modified the name of my person to just return my first name as it sounded better in the speech.

Congratulations. You now have an autonomous face-recognizing and talking drone 🙂

Where are we and next steps

We have made great progress now. The next step is to make it so we can ask our drone to find a specific person and to then autonomously take off and look for them.

This means we need to again make use of the Bing Speech API which is a part of the Microsoft Cognitive Services.

In the next blog I focus on how to get that working. See you next time!

Previous Posts in the Series

Although I work for Microsoft these thoughts are 100% my own & may not reflect the views of my employer.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.