Unit Size In Unit Selection Speech Synthesis (Page 2 of 4) in pdf

EUROSPEECH 2003 - GENEVA

3.2. Syllabiﬁcation Rules

ble based approach to synthesis could lead to more reliable

quality.

There have been various suggestions on unit size

In Hindi, words could be composed of basic characters (exam-

for unit selection systems. [7] and other HMM-based tech-

ple samay [time]), as well as complex clusters of C*VC* (ex-

niques are typically using sub-phonetic units: two or three per

ample san’sthaa [organization]). For the latter cases, there is

phoneme. AT&T’s NextGen [8], uses half phones. FestVox’s

need to come up with rules to break the word into syllables.

default method uses a phone based technique. However be-

We derived certain simplistic rules for syllabiﬁcation i.e. rules

cause FestVox supports a method of optimal coupling [9], the

for grouping clusters of C*VC* based on heuristic analysis of

join points may be moved within the preceding unit, thus with

several words in Telugu and Hindi languages.

phone-sized units, something more like diphones are actually

• When nasals such as /n’/, half pronounced /m/ or /n/

selected.

sound, (refer to Figure 1 where Hindi characters are rep-

Larger units are also possible, from demi-syllables to syl-

resented in ITRANS-3, a transliteration scheme) succeed

lables and larger. [10] tie the phones to words for domain syn-

a vowel immediately, they would be treated as a part of

thesis, although this is not the same as having word-sized units

the vowel and also the same syllable. For example, /n’/

it is in that direction. The choice of unit size is an optimization

in san’sthaa will be a part of syllable containing /sa/.

problem, the larger the units the lesser are the discontinuities in

• When there are three or more consonants between two

synthesis but it is harder to ensure general coverage. Smaller

consecutive vowels, the ﬁrst consonant would be a part

units make it easier to cover the space of acoustic units but at

of the coda of the previous syllable while the remaining

the cost of more joins.

consonants would be onset of the next syllable. Applying

The choice of unit size is also related to the language itself.

these rules to san’skrit [sanskrit], the obtained syllable

Languages with a very well deﬁned, and a small number of syl-

sequence would be /san’s/ /krit/.

lables may beneﬁt from a syllable sized unit. As Hindi has a

• When there are exactly two consonants between two

much more regular syllable structure than English we wanted to

experiment to ﬁnd the optimal sized unit for Hindi synthesis.

vowels, the ﬁrst consonant would be part of coda of pre-

vious syllable and the second would be onset of the next

5. Experiment

syllable. For example, dharti [earth] would be split as

/dhar/ /ti/. Exceptions for this rule are the following

In order to investigate the optimal unit size we built synthesizers

cases.

under four different conditions: syllable, diphone, phone and

– When the second consonant is a member of the set

half phone.

{ /r/ /s/ /sh/ /shh/ }, both the consonants would be

The phone synthesizer, the base case, was built with the

a part of onset of the next syllable. For example,

phone set, letter to sound rules and syllabiﬁcation rules deﬁned

yaatra [tour] would be split as /yaa/ /tra/.

for Indian language.

To build the diphone synthesizer we tagged each phone with

3.3. Hindi Speech Database

its preceding phone, thus units were still actually one phone in

length but they are sub-typed based on their previous phone.

To build a unit selection speech synthesizer in Hindi our ﬁrst

For the syllable based synthesizer, we treated the 2344 dis-

task was to deﬁne the phoneme set; then construct a set of

tinct syllables in the database as ”phones” and listed them in our

prompts that best covers the language. We generated a prompt-

phoneset. These syllable-sized phones were assigned phonetic

list covering most of the high frequency syllables in Hindi. A

features based on their combined consonant and vowel part,

syllable is said to be a high frequency syllable if its frequency

with the consonant in onset given more preference over the con-

(occurrence) count in a given text corpus is relatively high. We

sonant in coda. Thus the units in the inventory became full syl-

used the large text corpus available with frequency count of

lables rather than traditional phonemes. The lexicon parser was

the syllables in Indian languages [4]. This text corpus contains

appropriately modiﬁed to generate these syllable-based phones

text collected from various subjects ranging from philosophy to

rather than traditional phone names.

short stories. We selected sentences from this text corpus if it

In implementing half phone synthesizer, each vowel was

contained at least one unique instance of a high frequency syl-

represented by two half phones, while the consonants were full

lable, not present in the previous selected sentences. These sen-

phones. Two phone symbols were deﬁned for each vowel in

tences were examined by a linguist primarily to break the longer

the phoneset, for example vowel /a/ was represented by /a 1/

sentences into smaller ones and to make these smaller sentences

and /a 2/. Labels at half phone level were derived by equally

meaningful and easy to utter. These selected sentences were

dividing the vowel segment into two half phones. The lexicon

recorded by a female speaker, and a speech corpus of about 96

parser was also modiﬁed accordingly, to generate appropriate

minutes was generated. The recording was done in a quiet room

phone strings.

with a noise canceling microphone using the recording facilities

For perceptual evaluation of these synthesizers, we selected

of a typical multimedia computer system. The speech database

a set of 24 sentences from a Hindi news bulletin. The content

was labeled at the phone level and the label boundaries were

of this bulletin was mostly about the political affairs of

hand-corrected.

the world in the middle of March 2003. The syllables and

The duration of the speech data used in this study is about

diphones present in these 24 sentences were covered in the

90 minutes, and it has 620 utterances with 2344 syllables

corresponding synthesizers. These sentences were synthesized

(22960 realizations), 1414 diphones (51282 realizations) and

by phone, diphone, syllable and half phone synthesizers and

48 phones (51282 realizations).

were subjected to the perceptual test of native Hindi speakers.

The people who participated in these perceptual tests were

4. Unit Size

working persons and graduate students and none of them had

Earlier work on Indian languages [5] and preliminary exper-

any experience in speech synthesis. Each listener was subjected

iments with this Hindi database [6] suggested that a sylla-

to AB-test i.e the same sentence synthesized by two different

1318

Unit Size In Unit Selection Speech Synthesis Page 2

Related Articles

Related forms

Related Categories