YAD Model: Data Prep Engine
- Technologies used: Python
- Github URL: here
- Create new Menyo-20k datasets (train and test) based on specific rules
- Create Menyo train datasets without processing
- Create Menyo train and dev datasets without accents and/or underdots
- Prepare the JW300 training dataset
- Prepare mixed training data
- Create Global Voices training datasets with accents
- Create Global Voices datasets with only accents
- Create and merge Yoruba greetings datasets with and without accents
- Prepare Microsoft (MSFT) data
- Combine Yoruba Bible texts into one file
- Create new Yoruba datasets based on specific rules
- Create Yoruba datasets by removing accents and underdots
- Split combined Bible data
- Remove accents and underdots from text
- Remove accents and underdots specifically for YAD test
- Combine different datasets into a unified format for training and testing