Cantonese is one of the most widely spoken languages in Hong Kong. It is written in traditional Chinese characters, which is much similar to the ones used in Taiwan. As Google search by voice has been made available in a number of languages, adding Cantonese to it was one of the most awaited moves by the tech giant.

However, the development of Google voice search in Cantonese involved many challenges.

A few of them are listed below:

  • Data Collection: To build a new recognition system, it needed both audio and text data. The DataHound collection technique of Google helped in audio data collection by using smartphones to record and upload various audio samples in Cantonese. And, for text data, the anonymized search query logs from were sampled to get the needed data for training language models.
  • Chinese Word Boundaries: Since Chinese don’t use spaces between words while writing, therefore instead of words, Google voice search in Cantonese uses characters as the basic units. This limits the size of the words for the speech recognizer and simplifies the lexicon development.
  • Tone: There are around 6-10 varieties of tones in Cantonese. The application models tone-plus-vowel combos as single units. It also merged the rarely-used tone-vowel combinations into single models make the resulting model less complex.
  • Accents and Noisy Environment: People using Google voice search talk in different accents and styles. Thus, the application is integrated with data collected from different volunteers with different speaking accents.

