Fine-tune a speech recognition model for your voice

Last updated

2/18/2025

Speech

Document

Fine-tune a speech recognition model for your voice

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything can run locally - even on your laptop, ensuring your data stays private. You can fine-tune a model using your own data or leverage the Common Voice dataset, a community-led project from Mozilla that supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

Mozilla.ai

Time

25 min

Complexity

Medium

Status

Stable

Contributors

Tags

Speech-to-Text

Local AI

Finetuning

Automatic Speech Recognition

License

Apache 2.0

Hosted demo

Hosted Demo

Step by step walkthrough

Tools used to create

Trusted open source tools used for this Blueprint

HuggingFace Transformers

Use HF Transformers to fine-tune the ASR model, and HF Hub to load Common Voice.

Use Common Voice to select a dataset tailored to your language or dialect.

Gradio

Gradio is used to build an interactive app for both voice data collection and ASR inference.

Choices

Insights into our motivations and key technical decisions throughout the development process.

Focus

Decision

Rationale

Alternatives Considered

Trade-offs

Focus

Decision

Rationale

Alternatives Considered

Trade-offs

Overall Motivation

Build a local-focused workflow for finetuning Speech-To-Text models using your own data or the Common Voice dataset.

Enables users to fine-tune a STT model based on their own needs or for low resource languages, while keeping their data private. Also enables users to use the model as a STT service locally and privately.

Models fine-tuned on low resource languages already exist on HuggingFace, the user could download these, or use another STT service/tool, and try them, instead of fine-uning on their data or Common Voice.

Existing fine-tuned models might be trained on bigger, more diverse datasets so their performance might be better across different environments/use-cases. However, not all languages have a fine-tuned model, or the models might not perform as well. Fine-tuning a model on your own voice data, produces a more personalized, use-case specific model that might perform better.

Model Selection

openai/Whisper

Open-source with MIT license and easy to implement. Big community and support around it. Top 5 in the HF ASR leaderboard as of Feb 2025. Multiple sizes available, making it easy to switch depending on available hardware.

facebook/w2v-bert-2.0, meta/mms

Whisper models, especially larger ones, require considerable computational resources and might not run efficiently on all local setups, however Whisper-tiny and small are low-compute friendly.

Voice Dataset for Fine-tuning

CommonVoice

Open-source, diverse collection of voice samples in multiple languages. One of the best STT datasets available for low-resource languages.

Didn’t consider any alternatives.

n/a

Fine-tuning Framework

hf-transformers

Hugging Face’s transformers library provides well-documented fine-tuning support that is actively maintained and supports most open-source pre-trained models.

SpeechBrain; NeMo by Nvidia

SpeechBrain and NeMo both have instructions on fine-tuning CommonVoice, however they are not as actively maintained as Transformers, they have a steeper learning curve for beginners and might not support as broad a family of models.

User Interface

Gradio

Good option for voice recording integration and integration with HF Spaces.

Streamlit

Streamlit is another option but the built-in voice recording feature doesn’t work out-of-the-box with the HF Transformers library, i.e. the audio input needs specific transformation before being fed to the STT model.

Ready? Try it yourself!