Unit Size In Unit Selection Speech Synthesis printable pdf download

EUROSPEECH 2003 - GENEVA

Unit Size in Unit Selection Speech Synthesis

S P Kishore and Alan W Black

Language Technologies Research Center

International Institute of Information Technology, Hyderabad

and ISRI, Carnegie Mellon Univesity

kishore@iiit.net

Language Technologies Institute, Carnegie Mellon University

awb@cs.cmu.edu

Abstract

Speech Synthesis System [3].

FestVox offers a language independent method for build-

In this paper, we address the issue of choice of unit size in

ing synthetic voices, offering mechanisms to abstractly describe

unit selection speech synthesis. We discuss the development of

phonetic and syllabic structure in the language. It is that ﬂex-

a Hindi speech synthesizer and our experiments with different

ibility in the language building process that we will exploit in

choices of units: syllable, diphone, phone and half phone. Per-

this paper.

ceptual tests conducted to evaluate the quality of the synthesiz-

ers with different unit size indicate that the syllable synthesizer

3. Hindi Synthesis

performs better than the phone, diphone and half phone syn-

thesizers, and the half phone synthesizer performs better than

The basic units of the writing system in Indian languages are

diphone and phone synthesizers.

characters which are an orthographic representation of speech

sounds. A character in Indian language scripts is close to a syl-

1. Background

lable and can be typically of the following form: C, V, CV, VC,

CCV and CVC, where C is a consonant and V is a vowel. All

Most of the Information in digital world is accessible to a few

Indian language scripts have a common phonetic base, and an

who can read or understand a particular language. Language

universal phoneset consists of about 35 consonants and about 18

technologies can provide solutions in the form of natural inter-

vowels. In Hindi, there are ﬁve vowels, ﬁve long vowels, two

faces so that digital content can reach to the masses and facili-

diphthongs, four semivowels, and 31 consonants. There are a

tate the exchange of information across different people speak-

few more vowels and consonants existing in Hindi, but we did

ing different languages.

not consider them as they are rarely used in the current times.

These technologies play a crucial role in multi-lingual so-

cieties such as India which has about 1652 dialects/native lan-

3.1. Letter to Sound Rules

guages. While Hindi written in Devanagari script, is the ofﬁcial

language, the other 17 languages recognized by the constitution

The scripts of Indian languages are phonetic in nature. There is

of India are: 1) Assamese 2) Tamil 3) Malayalam 4) Gujarati

more or less one to one correspondence between what is written

5) Telugu 6) Oriya 7) Urdu 8) Bengali 9) Sanskrit 10) Kashmiri

and what is spoken. However, in Hindi the inherent vowel (short

11) Sindhi 12) Punjabi 13) Konkani 14) Marathi 15) Manipuri

/a/) associated with a consonant is not pronounced depending on

16) Kannada and 17) Nepali. Seamless integration of speech

the context. This is referred to as Inherent Vowel Suppression

recognition, machine translation and speech synthesis systems

(IVS) or schwa deletion. For example, the word kamala [lotus]

could facilitate the exchange of information between two peo-

is mapped to a sequence of consonant and vowel sounds /k/ /a/

ple speaking two different languages. Our overall goal is to de-

/m/ /a/ /l/, ignoring the vowel associated with /l/.

velop speech recognition and speech synthesis systems for most

A set of heuristic rules to detect IVS of a consonant charac-

of these languages.

ter are noted below. These rules have been derived by observing

In this paper we discuss the details of the development of a

a few hundred Hindi words, and the rule set may not be a com-

Hindi speech synthesizer using unit selection techniques and in

plete description of the phenomenon.

particular address the issue of choice of unit size in unit selec-

tion synthesis.

1 No two successive characters undergo IVS.

2. Synthesis Framework

2 Characters present in the ﬁrst position of a word, never

undergo IVS. IVS occurs only to the characters present

This work is done within the FestVox voice building framework

in middle and ﬁnal positions.

[1], which offers general tools for building unit selection syn-

thesizers in new languages. The unit selection paradigm is a

3 For characters in ﬁnal position, the inherent vowel (/a/)

cluster based technique where units of the same type (phones,

is always suppressed.

diphones, syllables or whatever) are clustered based on their

acoustic differences [2]. The clusters are then indexed based

4 For characters in word middle position, IVS occurs if the

on high level features such as phonetic and prosodic context.

next character in the word is not the last character or the

Voices generated by this system may be run in the Festival

next character has a vowel other than /a/.

1317

Unit Size In Unit Selection Speech Synthesis

Related Articles

Related forms

Related Categories