YAD Model: Data Prep Engine

YAD Model: Data Prep Engine

Technologies used: Python
Github URL: here

Create new Menyo-20k datasets (train and test) based on specific rules
Create Menyo train datasets without processing
Create Menyo train and dev datasets without accents and/or underdots
Prepare the JW300 training dataset
Prepare mixed training data
Create Global Voices training datasets with accents
Create Global Voices datasets with only accents
Create and merge Yoruba greetings datasets with and without accents
Prepare Microsoft (MSFT) data
Combine Yoruba Bible texts into one file
Create new Yoruba datasets based on specific rules
Create Yoruba datasets by removing accents and underdots
Split combined Bible data
Remove accents and underdots from text
Remove accents and underdots specifically for YAD test
Combine different datasets into a unified format for training and testing