Speech-to-Text

Hey guys, we have been using Watson Speech-to-Text. Although the platform is stable and it gives us good transcripts I’m wondering if anyone can suggest something else that is better? Just looking for opinions. Thanks

We use IBM/MS/Google regularly. For VM, I give a slight edge to Google, but all three are very comparable and the preference is largely subjective.

Google’s formatting (capitalization, punctuation, dates ,phone numbers, etc) generally wins IMO. It’s a small thing, but makes a difference in the “at a glance” readability of the message.

Again - very subjective - it’s not like I have done any analysis of word error rates or such. Just my opinion having compared routine messages run through all three products.

1 Like

Thanks

How would you go about installing something like this? Can it be used to dial out?
i.e. lift handset say ‘call Sue’ - STT performed, lookup done and Sue dialled?

This is very far from polished code but a world ago I was working on TTS and STT with the google cloud resources.

here are a couple of functioning go programs I kicked together to perhaps get someone going

(You will of course need a working gcloud account)

Speech to text . . . ( expects an asterisk like file in …/out.wav)

// arecord  -d 5 -r 8 -c1 -f S16_LE ../out.wav
// Sample  to transcribe
// audio.
package main

import (
        "context"
        "fmt"
        "io/ioutil"
        "log"

        speech "cloud.google.com/go/speech/apiv1"
        speechpb "google.golang.org/genproto/googleapis/cloud/speech/v1"
)

func main() {
        ctx := context.Background()

        // Creates a client.
        client, err := speech.NewClient(ctx)
        if err != nil {
                log.Fatalf("Failed to create client: %v", err)
        }

        // Sets the name of the audio file to transcribe.
        filename := "../out.wav"

        // Reads the audio file into memory.
        data, err := ioutil.ReadFile(filename)
        if err != nil {
                log.Fatalf("Failed to read file: %v", err)
        }

        // Detects speech in the audio file.
        resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
                Config: &speechpb.RecognitionConfig{
                        Encoding:                   speechpb.RecognitionConfig_LINEAR16,
                        SampleRateHertz:            8000,
                        LanguageCode:               "en-US",
                        EnableAutomaticPunctuation: true,
                        Model:                      "phone_call",
                        //                        Model:           "video",
                        UseEnhanced: true,
                        //                  SpeechContext: {Phrases: []},

                },
                Audio: &speechpb.RecognitionAudio{
                        AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
                },
        })
        if err != nil {
                log.Fatalf("failed to recognize: %v", err)
        }

        // Prints the results.
        for _, result := range resp.Results {
                for _, alt := range result.Alternatives {
                        fmt.Printf("\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
                }
        }
}

This to generate a grammatically formatted .wav file given a phone number and a country code, call it with argument 1 as phone nunber and argument 2 as country code, e.g. US for US FR for France, the output is obviously in US English, but of course you can use any of gcloud’s supported languages/voices.

package main                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                               
import (
        "context"
        "fmt"
        "io/ioutil"
        "log"
        "os"
"github.com/nyaruka/phonenumbers"
        texttospeech "cloud.google.com/go/texttospeech/apiv1"
        texttospeechpb "google.golang.org/genproto/googleapis/cloud/texttospeech/v1"
)

func main() {
        // Instantiates a client.
        ctx := context.Background()
num, err := phonenumbers.Parse(os.Args[1], os.Args[2])
fmt.Println(err)
fmt.Println(num)

e164 := phonenumbers.Format(num, phonenumbers.E164)
fmt.Println(e164)

natnum := phonenumbers.Format(num, phonenumbers.NATIONAL)
fmt.Println(natnum)



message := ("Area Code " + natnum)
                fmt.Println(message)
        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                log.Fatal(err)
        }
        req := texttospeechpb.SynthesizeSpeechRequest{
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Text{Text: message},
                },
                Voice: &texttospeechpb.VoiceSelectionParams{
                        LanguageCode: "en-US",
                        Name:         "en-US-Wavenet-D",
                        SsmlGender: texttospeechpb.SsmlVoiceGender_MALE,
                },
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding:   texttospeechpb.AudioEncoding_LINEAR16,
                        SampleRateHertz: 8000,
                },
        }

        resp, err := client.SynthesizeSpeech(ctx, &req)
        if err != nil {
                log.Fatal(err)
        }

        // The resp's AudioContent is binary.
        filename := "../out.wav"
        err = ioutil.WriteFile(filename, resp.AudioContent, 0644)
        if err != nil {
               log.Fatal(err)
        }

        // The resp's AudioContent is binary.
        filename := "../out.wav"
        err = ioutil.WriteFile(filename, resp.AudioContent, 0644)
        if err != nil {
                log.Fatal(err)
        }
        fmt.Printf("Audio content written to file: %v\n", filename)
}

Hopefully I’ll get back to it and finish it off, but here you can see core functionality of both TTS and STT , splicing these two functions into asterisk dialplan is IMHO the easiest part.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.