Austrian Dialect in ASR: A Pilot Study – Carinthian-Dominant Dataset Creation, Fine-Tuning and Evaluation of Whisper and Parakeet

Elena Falle, BSc
Master Digital Healthcare, St. Pölten University of Applied Sciences 2026

Aim and Research Question(s)

The aim of this exploratory pilot study is to evaluate the impact of Carinthian-dominated Austrian dialectal variation, on ASR performance in simulated emergency medical communication. Three research questions guided the study: RQ1 examines how dialect variation affects ASR performance and what clinically critical error patterns emerge in medical terminology. RQ2 investigates whether fine-tuning with domain-specific data can improve transcription accuracy and reduce critical errors. RQ3 explores whether augmenting the dataset with synthetically generated medical speech further improves ASR performance.

Background

Despite growing interest in ASR for clinical documentation, existing research has focused predominantly on standard language varieties. However, the performance of ASR systems can decrease significantly due to linguistic variation such as accents and regional dialects, as these models are predominantly trained on standard language varieties [1]. In emergency medical settings, accurate transcription is especially critical, as errors in medication names, dosages, or clinical findings can carry direct patient safety implications [2].

Methods

Four datasets were used: a self-created dialect dataset (1,765 recordings, Carinthian-dominant, emergency medical scenarios), the MultiMed reference dataset (3,103 files, standard German medical speech), and synthetic data from ElevenLabs and Voxtral (2,632 files each, TTS, standard German). Audio quality was assessed via SNR (>10 dB) & RMS filtering, followed by baseline evaluation and fine-tuning (LoRA for Whisper, LayerNorm for Parakeet). Training data was augmented with synthetic sets and evaluated using WER and BERTScore error clustering.

Results and Discussion

The evaluation was conducted on two datasets: a self-created dataset, also augmented with synthetic data via Voxtral and ElevenLabs and the MultiMed reference dataset. On the dialect data, Parakeet improved dramatically from a baseline WER of 30.08% to 6.47% after fine-tuning and further to 4.68% with augmentation, while Whisper remained largely unchanged across all conditions. On MultiMed, both models showed consistent gains through fine-tuning from 10.85% to 9.65% (Whisper) and 11.56% to 7.71% (Parakeet). Error clustering of the best-performing configuration (Parakeet fine-tuned on the augmented dialect dataset) revealed that 87.6% of outputs were acceptable, with medical errors accounting for 9.4% and massive errors for only 3.0%. Dialect variation increased WER for both models compared to the reference dataset. Fine-tuned Parakeet (LayerNorm) substantially outperformed fine-tuned Whisper (LoRA), and synthetic data augmentation improved WER primarily through increased exposure to domain-specific medical vocabulary.

Conclusion

Fine-tuned Parakeet TDT 0.6B v3 proved the most suitable model for Carinthian-dominated Austrian dialect, achieving a WER of 4.68% and 87.6% clinically acceptable transcriptions. However, human verification remains essential before clinical deployment.

References

[1] M. G. Elfeky, P. Moreno, and V. Soto, "Multi-Dialectical Languages Effect on Speech Recognition," Procedia Computer Science, vol. 128, pp. 1–8, 2018. doi: 10.1016/j.procs.2018.03.001; [2] J. Shor, R. A. Bi, S. Venugopalan, S. Ibara, R. Goldenberg, and E. Rivlin, "Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings," in Proc. 5th Clinical Natural Language Processing Workshop, pp. 1–7, 2023. doi: 10.18653/v1/2023.clinicalnlp-1.1