This image shows a simple graphical interface from a Python application built using tkinter. The background is blue, and at the top, there is a header bar showing the current score as "15/21" in yellow text. Next to it, there are four buttons labeled "previous," "audio repeat," "submit," and "next." Below the header, a collection of word tiles is displayed, containing words such as "was," "get," "very," "tired," "Alice," "sister," and "having." These tiles are clickable or draggable, and beneath them are several empty boxes arranged horizontally, where the user can place the words to form a sentence. There is also a button labeled "effacer" to clear the arrangement. At the bottom of the image, the sentence “Alice was beginning to get very tired of sitting by her sister [by] the bank, and [having] [nothing] [to] [do:]” is displayed, which appears to be the target sentence that the user needs to reconstruct using the word tiles above.

Python source code is available on my GitHub repo.

As a demo, I’ve processed the first chapter of Alice in Wonderland, creating 174 audio fragments with aligned transcripts. The app is built using tkinter and Python, with audio processing handled by librosa and TorchAudio’s wav2vec model. Audio is sourced from LibriVox, and text from Project Gutenberg.