🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Any known Text to Speech API that uses SSML and doesn't need internet?

Started by
2 comments, last by Zouflain 4 years, 4 months ago

I note that text to speech has come a LONG way. It's not perfect, but it's better than microsoft Sam.

That said, I use a LOT of procedural text generators in my games, and it would be trivial to add markup to the generator files that would silently produce the corresponding SSML in the background. The next logical leap is to use a Text to Speech engine like Amazon Poly to generate voices for in-game characters (and put a large portion of the amateur voice acting crowd out of business - sorry!), but there's two issues: one, it's financial suicide to pay per letter. And more pertinently, two, it makes no sense to connect to a remote server. What, the user loses audio if they have no Internet connection? Madness.

So my question is, does anyone know a Text to Speech API (for C++) either free for commercial use or with a reasonable licensing structure that DOES NOT connect online - that is, everything required to produce voice from an SSML string is on the end-user's machine?

Advertisement

I tried to get SSML to work with the Microsoft SAPI API previously for a non-game project and didn't get anywhere. While the API's exist (SPF_PARSE_SSML etc.) it doesn't seem like actual voices with that capability ship with Window's and I always got an error return. Apparently there is a lot of custom voices around. If you manage to make that work would certainly be interested.

Another offline one I had is CMU Flite, but people here seem to like the quality a lot less than Amazon/Google/Microsoft solutions. I think maybe if added in some better voices than the default ones could get a much nicer result, and possibly if figure out how to do that could get enough variety for game purposes. And has some SSML support, but never tested as it wasn't a priority as other TTS providers were more desired. Would be interested to know as the only free cross platform one I was using right now.

There is also Festival/Festvox, I have no experience with it however. I believe it's more focused on the research side than as a end-product TTS engine. I believe you can convert these voices to be Flite compatible.

I also saw Cepstral come up a few times, but no customer has asked for it and it's non-free so not investigated. No idea what their pricing model is, and if can get a reasonable one to embed into a product rather than as a server-licence.

Other than those, Amazon, Google and Microsoft each have their cloud one. The pricing didn't seem that bad to me, especially with caching, and the API's were easy to deal with (I forget which, but some with C/C++ SDK's, some I used the HTTP API directly with my own client code). Depends just how many hours of TTS you are thinking of (practically reading a novel?), and of course the online requirement.

Zouflain said:
procedural text generators

How procedural? If you are mixing pre-made parts together, you can get some OK results with having corresponding pre-recorded TTS and joining that together (which you need to do with semi-dynamic voice acted lines as well).

This was something I was actually basically thinking of doing in a game-context, to get something of hopefully passable standard cheaply. The pricing in a do-once context is really cheap, more a question of quality and if actually better than just having the player read the text in the end.

SyncViews said:

I tried to get SSML to work with the Microsoft SAPI API previously for a non-game project and didn't get anywhere. While the API's exist (SPF_PARSE_SSML etc.) it doesn't seem like actual voices with that capability ship with Window's and I always got an error return. Apparently there is a lot of custom voices around. If you manage to make that work would certainly be interested.

Another offline one I had is CMU Flite, but people here seem to like the quality a lot less than Amazon/Google/Microsoft solutions. I think maybe if added in some better voices than the default ones could get a much nicer result, and possibly if figure out how to do that could get enough variety for game purposes. And has some SSML support, but never tested as it wasn't a priority as other TTS providers were more desired. Would be interested to know as the only free cross platform one I was using right now.

There is also Festival/Festvox, I have no experience with it however. I believe it's more focused on the research side than as a end-product TTS engine. I believe you can convert these voices to be Flite compatible.

I also saw Cepstral come up a few times, but no customer has asked for it and it's non-free so not investigated. No idea what their pricing model is, and if can get a reasonable one to embed into a product rather than as a server-licence.

Other than those, Amazon, Google and Microsoft each have their cloud one. The pricing didn't seem that bad to me, especially with caching, and the API's were easy to deal with (I forget which, but some with C/C++ SDK's, some I used the HTTP API directly with my own client code). Depends just how many hours of TTS you are thinking of (practically reading a novel?), and of course the online requirement.

Zouflain said:
procedural text generators

How procedural? If you are mixing pre-made parts together, you can get some OK results with having corresponding pre-recorded TTS and joining that together (which you need to do with semi-dynamic voice acted lines as well).

This was something I was actually basically thinking of doing in a game-context, to get something of hopefully passable standard cheaply. The pricing in a do-once context is really cheap, more a question of quality and if actually better than just having the player read the text in the end.

Extremely procedural. Was it Kevin Levin that did the Narrative Lego talk? Well, whoever it was, think of a generator for that sort of game. Virtually all the dialogue in the game would be generated on the fly at runtime, so there's very little prebaking I could do - if any. Also, a lot of the SSML would change (procedural characters sounds like procedural tonality to me!) so a pay-as-you-go is out of the question too.

It may still be a pipe dream. The technology has improved leaps and bounds over the past few years, but it's no small ask. Thank you for the suggestions, I'll definitely be pouring over them and hoping for a hit!

This topic is closed to new replies.

Advertisement