Chapter 14. Speech Synthesis

Kevin A. Lenzo

Talking computers are ubiquitous in science fiction; machines speak so often in the movies that we think nothing of it. From an alien robot poised to destroy the Earth unless given a speech key (“Klaatu barada nikto” in The Day the Earth Stood Still and echoed in Army of Darkness) to the terrifyingly calm HAL 9000 in 2001: A Space Odyssey, machines have communicated with people through speech. The computer in “Star Trek” has spoken and understood speech since the earliest episodes. Speech sounds easy, because it’s natural to us—but it’s not natural for computers.

Let’s ignore the problem of getting computers to think of things worth saying, and consider only turning word sequences into speech. Let’s ignore prosody, too. Prosody—the intonation, rhythm, and timing of speech—is important to how we interpret what’s said, as well as how we feel about it, but it can’t be given proper care in the span of this article; partly because it’s almost completely unsolved. The input to our system is stark and minimal, as is the output—there are no lingering pauses or dynamics, no irony or sarcasm. (Not on purpose, anyway.)

If we are given plain text as input, how should it be spoken? How do we make a transducer that accepts text and output audio? In this article, we’ll walk through a series of Perl speech synthesizers. Many of the ideas here apply to both natural and artificial languages, and are recurrent themes in the synthesis work at Carnegie Mellon ...

Get Games, Diversions & Perl Culture now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.