< children's phonetic speech recognition/ >

Children's Phonetic Speech Recognition

Machine Learning

CER0.561

PER0.581

Project Overview

Children’s speech recognition remains substantially more difficult than adult ASR due to acoustic and phonetic differences such as higher pitch, shorter vocal tracts, and developing pronunciation patterns. In this work, we developed a machine learning system for predicting IPA phonetic sequences directly from children’s speech audio. Potential applications include AI literacy tutors, pediatric speech therapy support tools, and more inclusive voice assistants, helping address the large accuracy gap between adult and child ASR systems.

We formulated the task as unsegmented phonetic speech recognition using Connectionist Temporal Classification (CTC). Our final architecture was a compact ASR-native CLDNN model operating on log-Mel spectrograms with temporal sequence modelling through bidirectional LSTMs (BiLSTMs). The model preserves temporal acoustic structure while remaining lightweight enough for low-resource children’s speech data.

The best-performing configuration, CLDNN h384, achieved a test CER of 0.561 and test PER of 0.581. Results demonstrate that smaller sequence-oriented acoustic architectures can effectively model children’s IPA recognition and achieve strong phonetic transcription performance in low-resource settings.

Collaborators

Shreya Khanna

Researcher / Collaborator

Rohan Gupta

Researcher / Collaborator

Tech Stack

Python TensorFlow BiLSTM CTC