Ethiopian Speech

About

What Ethiospeech Project is ?

Almost all Ethiopian languages are among the low resourced and technologically un-favored languages, for which no speech technology is available. Even research in the area of ASR for Ethiopian languages is limited to a few languages although more than 80 languages are spoken in the country. No Ethiopian language has a speech corpus which is larger than 50 hours. To the best of our knowledge, even Amharic, which is considered as the most researched compared with the other Ethiopian languages, has only 44 hours of speech corpus. This is an extremely small amount compared with the amount of data that is currently used in the development of ASR technologies for languages such as English. One can imagine how small is the corpus when we come to the application of data greedy algorithms, especially deep learning algorithms. On the other hand, most of the Ethiopian languages have no transcribed speech corpus at all. In conclusion, Ethiopian languages either do not have speech corpus at all or have only a small amount of speech data.

We have been working for about two years, by the support of the Lacuna Fund, on the preparation of speech corpora for six Ethiopian languages. We have considered these languages for they are declared to serve as official languages of the Ethiopian regional states. According to Wikipedia , an official language is a language given a special status in a particular country. Ethiopia has declared to add four more official languages to Amharic in 2020. The newly added official languages are Tigrinya, Oromo, Somali and Afar. In this project, we intend to develop speech corpora for these five official languages (Amharic, Afar, Oromo, Somali and Tigrigna) and the Sidamo language that recently became a regional language of the newly established Sidamo region.

These languages are spoken by a significant number of speakers in the country and in different parts of the world, especially in Eastern Africa. Outside of Ethiopia, Somali is spoken in Somalia as well as Djibouti and Afar is spoken in Djibouti and Eritrea. Moreover, these languages play a lot of social, political, economic and cultural roles.

Besides being the official languages of the country, Amharic and Tigrinya are selected since they are spoken by considerably many people at the federal as well as regional levels in Ethiopia. Tigrinya is spoken in Eritrea as well. Both Amharic and Tigrinya are spoken in different parts of the other world including the US, Europe and Israel.

Although there are existing resources for Amharic and Tigrinya, we have augmented the corpora since the existing corpora are not enough to conduct state-of-the-art machine learning research and for the development of any usable speech applications. For Amharic, two separately prepared speech corpora (Abate et al., 2005; Abate et al., 2020 ), referred to as AMH2005 and AMH2020, are available. The AMH2005 consists of 20 hours of training speech while the AMH2020 consists of 24 hours of training speech. That means a total of 44 hours of training speech is available for the language. Although the existence of these resources facilitated research in the area, the amount is still insignificant compared to the amount of data used in the research community, especially in the application of state of the art machine learning algorithms, such as deep neural networks, that are data greedy. We, therefore, augmented the Amharic speech data. For Tigrinya, there exists two speech corpora: about 18 hours of speech developed by Abera and H/Mariam (2018) and 22 hours of training speech developed by Abate et al. (2020). Again, a total of 40 hours of training speech is insignificant in machine learning.

We also worked on the creation and/or augmentation of data sets for Oromo, Somali and Afar. Oromo is spoken by a considerable number of people in Ethiopia. Although we have a speech corpus consisting of 22.8 hours of training speech (Abate et al., 2020), it is clear that the size of this data set is small. We, therefore, augmented the speech corpus.

Somali and Afar are spoken not only in Ethiopia but also in other countries of the horn of Africa. Somali is spoken in Ethiopia, Djibouti and Somalia while Afar is spoken in Ethiopia, Eritrea and Djibouti. Thus the development of speech corpus for these languages will facilitate speech processing research and applications development in the horn of Africa. The people of these countries will, therefore, benefit from the applications.

Sidamo (Sidamu Afoo) language is spoken in the southern part of Ethiopia, especially in the Sidama Regional State, the 10th and recently established regional state (June, 2020) of Ethiopia. We are not aware of any usable data sets for this language. The development of data sets for this language, therefore, facilitates research and development of speech applications, which will contribute to the development of the language as well as the region.

About the Project

What the project is all about?

The Ethio Speech Corpora is a data set of six read speech corpora developed for the six official languages of the Ethiopian regional states. The languages are: Amharic, Tigrigna, Oromo, Somali, Afar and Sidama. The corpora contains orthographic transcriptions and the corresponding speech of about 391 hours long. The corpora are developed primarily for the development of Automatic Speech Recognition Systems (ASRS) for the corresponding language in a monolingual setup. They can, however, be used for the development of multilingual or cross-lingual ASRSs for other related languages.

Collected by

The corpora are developed by a project that is financed by the Lacuna Fund. The core team members of the project are six Ethiopian experienced researchers in the area of speech processing. These corpora are, available for strictly non-commercial purposes through this official website. It is distributed under the CC-BY 4.0 International license as per the policy of the Lacuna Fund.

Supported by

Thanks to the Lacuna Fund for the financial support and the highly motivated and dedicated team members of the project as well as the assistants in the recording process, the project resulted in to six well organized and high quality speech corpora for six Ethiopian languages.