
Hey guys, we have been using Watson Speech-to-Text. Although the platform is stable and it gives us good transcripts I’m wondering if anyone can suggest something else that is better? Just looking for opinions. Thanks

We use IBM/MS/Google regularly. For VM, I give a slight edge to Google, but all three are very comparable and the preference is largely subjective.

Google’s formatting (capitalization, punctuation, dates ,phone numbers, etc) generally wins IMO. It’s a small thing, but makes a difference in the “at a glance” readability of the message.

Again - very subjective - it’s not like I have done any analysis of word error rates or such. Just my opinion having compared routine messages run through all three products.

How would you go about installing something like this? Can it be used to dial out?
i.e. lift handset say ‘call Sue’ - STT performed, lookup done and Sue dialled?

This is very far from polished code but a world ago I was working on TTS and STT with the google cloud resources.

here are a couple of functioning go programs I kicked together to perhaps get someone going

(You will of course need a working gcloud account)

Speech to text . . . ( expects an asterisk like file in …/out.wav)

// arecord  -d 5 -r 8 -c1 -f S16_LE ../out.wav
// Sample  to transcribe
// audio.
package main

import (

        speech "cloud.google.com/go/speech/apiv1"
        speechpb "google.golang.org/genproto/googleapis/cloud/speech/v1"

func main() {
        ctx := context.Background()

        // Creates a client.
        client, err := speech.NewClient(ctx)
        if err != nil {
                log.Fatalf("Failed to create client: %v", err)

        // Sets the name of the audio file to transcribe.
        filename := "../out.wav"

        // Reads the audio file into memory.
        data, err := ioutil.ReadFile(filename)
        if err != nil {
                log.Fatalf("Failed to read file: %v", err)

        // Detects speech in the audio file.
        resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
                Config: &speechpb.RecognitionConfig{
                        Encoding:                   speechpb.RecognitionConfig_LINEAR16,
                        SampleRateHertz:            8000,
                        LanguageCode:               "en-US",
                        EnableAutomaticPunctuation: true,
                        Model:                      "phone_call",
                        //                        Model:           "video",
                        UseEnhanced: true,
                        //                  SpeechContext: {Phrases: []},

                Audio: &speechpb.RecognitionAudio{
                        AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
        if err != nil {
                log.Fatalf("failed to recognize: %v", err)

        // Prints the results.
        for _, result := range resp.Results {
                for _, alt := range result.Alternatives {
                        fmt.Printf("\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)

This to generate a grammatically formatted .wav file given a phone number and a country code, call it with argument 1 as phone nunber and argument 2 as country code, e.g. US for US FR for France, the output is obviously in US English, but of course you can use any of gcloud’s supported languages/voices.

package main                                                                                                                                                                                                                                   
import (
        texttospeech "cloud.google.com/go/texttospeech/apiv1"
        texttospeechpb "google.golang.org/genproto/googleapis/cloud/texttospeech/v1"

func main() {
        // Instantiates a client.
        ctx := context.Background()
num, err := phonenumbers.Parse(os.Args[1], os.Args[2])

e164 := phonenumbers.Format(num, phonenumbers.E164)

natnum := phonenumbers.Format(num, phonenumbers.NATIONAL)

message := ("Area Code " + natnum)
        client, err := texttospeech.NewClient(ctx)
        if err != nil {
        req := texttospeechpb.SynthesizeSpeechRequest{
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Text{Text: message},
                Voice: &texttospeechpb.VoiceSelectionParams{
                        LanguageCode: "en-US",
                        Name:         "en-US-Wavenet-D",
                        SsmlGender: texttospeechpb.SsmlVoiceGender_MALE,
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding:   texttospeechpb.AudioEncoding_LINEAR16,
                        SampleRateHertz: 8000,

        resp, err := client.SynthesizeSpeech(ctx, &req)
        if err != nil {

        // The resp's AudioContent is binary.
        filename := "../out.wav"
        err = ioutil.WriteFile(filename, resp.AudioContent, 0644)
        if err != nil {

        fmt.Printf("Audio content written to file: %v\n", filename)

Hopefully I’ll get back to it and finish it off, but here you can see core functionality of both TTS and STT , splicing these two functions into asterisk dialplan is IMHO the easiest part.

