YAD Model: Data Prep Engine

  • Technologies used: Python
  • Github URL: here
  • Create new Menyo-20k datasets (train and test) based on specific rules
  • Create Menyo train datasets without processing
  • Create Menyo train and dev datasets without accents and/or underdots
  • Prepare the JW300 training dataset
  • Prepare mixed training data
  • Create Global Voices training datasets with accents
  • Create Global Voices datasets with only accents
  • Create and merge Yoruba greetings datasets with and without accents
  • Prepare Microsoft (MSFT) data
  • Combine Yoruba Bible texts into one file
  • Create new Yoruba datasets based on specific rules
  • Create Yoruba datasets by removing accents and underdots
  • Split combined Bible data
  • Remove accents and underdots from text
  • Remove accents and underdots specifically for YAD test
  • Combine different datasets into a unified format for training and testing