Unit Size In Unit Selection Speech Synthesis Page 2

ADVERTISEMENT

EUROSPEECH 2003 - GENEVA
3.2. Syllabification Rules
ble based approach to synthesis could lead to more reliable
quality.
There have been various suggestions on unit size
In Hindi, words could be composed of basic characters (exam-
for unit selection systems. [7] and other HMM-based tech-
ple samay [time]), as well as complex clusters of C*VC* (ex-
niques are typically using sub-phonetic units: two or three per
ample san’sthaa [organization]). For the latter cases, there is
phoneme. AT&T’s NextGen [8], uses half phones. FestVox’s
need to come up with rules to break the word into syllables.
default method uses a phone based technique. However be-
We derived certain simplistic rules for syllabification i.e. rules
cause FestVox supports a method of optimal coupling [9], the
for grouping clusters of C*VC* based on heuristic analysis of
join points may be moved within the preceding unit, thus with
several words in Telugu and Hindi languages.
phone-sized units, something more like diphones are actually
• When nasals such as /n’/, half pronounced /m/ or /n/
selected.
sound, (refer to Figure 1 where Hindi characters are rep-
Larger units are also possible, from demi-syllables to syl-
resented in ITRANS-3, a transliteration scheme) succeed
lables and larger. [10] tie the phones to words for domain syn-
a vowel immediately, they would be treated as a part of
thesis, although this is not the same as having word-sized units
the vowel and also the same syllable. For example, /n’/
it is in that direction. The choice of unit size is an optimization
in san’sthaa will be a part of syllable containing /sa/.
problem, the larger the units the lesser are the discontinuities in
• When there are three or more consonants between two
synthesis but it is harder to ensure general coverage. Smaller
consecutive vowels, the first consonant would be a part
units make it easier to cover the space of acoustic units but at
of the coda of the previous syllable while the remaining
the cost of more joins.
consonants would be onset of the next syllable. Applying
The choice of unit size is also related to the language itself.
these rules to san’skrit [sanskrit], the obtained syllable
Languages with a very well defined, and a small number of syl-
sequence would be /san’s/ /krit/.
lables may benefit from a syllable sized unit. As Hindi has a
• When there are exactly two consonants between two
much more regular syllable structure than English we wanted to
experiment to find the optimal sized unit for Hindi synthesis.
vowels, the first consonant would be part of coda of pre-
vious syllable and the second would be onset of the next
5. Experiment
syllable. For example, dharti [earth] would be split as
/dhar/ /ti/. Exceptions for this rule are the following
In order to investigate the optimal unit size we built synthesizers
cases.
under four different conditions: syllable, diphone, phone and
– When the second consonant is a member of the set
half phone.
{ /r/ /s/ /sh/ /shh/ }, both the consonants would be
The phone synthesizer, the base case, was built with the
a part of onset of the next syllable. For example,
phone set, letter to sound rules and syllabification rules defined
yaatra [tour] would be split as /yaa/ /tra/.
for Indian language.
To build the diphone synthesizer we tagged each phone with
3.3. Hindi Speech Database
its preceding phone, thus units were still actually one phone in
length but they are sub-typed based on their previous phone.
To build a unit selection speech synthesizer in Hindi our first
For the syllable based synthesizer, we treated the 2344 dis-
task was to define the phoneme set; then construct a set of
tinct syllables in the database as ”phones” and listed them in our
prompts that best covers the language. We generated a prompt-
phoneset. These syllable-sized phones were assigned phonetic
list covering most of the high frequency syllables in Hindi. A
features based on their combined consonant and vowel part,
syllable is said to be a high frequency syllable if its frequency
with the consonant in onset given more preference over the con-
(occurrence) count in a given text corpus is relatively high. We
sonant in coda. Thus the units in the inventory became full syl-
used the large text corpus available with frequency count of
lables rather than traditional phonemes. The lexicon parser was
the syllables in Indian languages [4]. This text corpus contains
appropriately modified to generate these syllable-based phones
text collected from various subjects ranging from philosophy to
rather than traditional phone names.
short stories. We selected sentences from this text corpus if it
In implementing half phone synthesizer, each vowel was
contained at least one unique instance of a high frequency syl-
represented by two half phones, while the consonants were full
lable, not present in the previous selected sentences. These sen-
phones. Two phone symbols were defined for each vowel in
tences were examined by a linguist primarily to break the longer
the phoneset, for example vowel /a/ was represented by /a 1/
sentences into smaller ones and to make these smaller sentences
and /a 2/. Labels at half phone level were derived by equally
meaningful and easy to utter. These selected sentences were
dividing the vowel segment into two half phones. The lexicon
recorded by a female speaker, and a speech corpus of about 96
parser was also modified accordingly, to generate appropriate
minutes was generated. The recording was done in a quiet room
phone strings.
with a noise canceling microphone using the recording facilities
For perceptual evaluation of these synthesizers, we selected
of a typical multimedia computer system. The speech database
a set of 24 sentences from a Hindi news bulletin. The content
was labeled at the phone level and the label boundaries were
of this bulletin was mostly about the political affairs of
hand-corrected.
the world in the middle of March 2003. The syllables and
The duration of the speech data used in this study is about
diphones present in these 24 sentences were covered in the
90 minutes, and it has 620 utterances with 2344 syllables
corresponding synthesizers. These sentences were synthesized
(22960 realizations), 1414 diphones (51282 realizations) and
by phone, diphone, syllable and half phone synthesizers and
48 phones (51282 realizations).
were subjected to the perceptual test of native Hindi speakers.
The people who participated in these perceptual tests were
4. Unit Size
working persons and graduate students and none of them had
Earlier work on Indian languages [5] and preliminary exper-
any experience in speech synthesis. Each listener was subjected
iments with this Hindi database [6] suggested that a sylla-
to AB-test i.e the same sentence synthesized by two different
1318

ADVERTISEMENT

00 votes

Related Articles

Related forms

Related Categories

Parent category: Medical
Go
Page of 4