Last updated
2/18/2025
Share
Get started
Speech
Document

Fine-tune a speech recognition model for your voice

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything can run locally - even on your laptop, ensuring your data stays private. You can fine-tune a model using your own data or leverage the Common Voice dataset, a community-led project from Mozilla that supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

Mozilla.ai
Hosted demo
Hosted Demo
Step by step walkthrough
Tools used to create

Trusted open source tools used for this Blueprint

HuggingFace Transformers

Use HF Transformers to fine-tune the ASR model, and HF Hub to load Common Voice.

Use Common Voice to select a dataset tailored to your language or dialect.

Gradio

Gradio is used to build an interactive app for both voice data collection and ASR inference.

icon choices
Choices

Insights into our motivations and key technical decisions throughout the development process.

Focus
Decision
Rationale
Alternatives Considered
Trade-offs
Focus
Focus
Decision
Rationale
Alternatives Considered
Trade-offs
Overall Motivation
Overall Motivation
Build a local-focused workflow for finetuning Speech-To-Text models using your own data or the Common Voice dataset.
Enables users to fine-tune a STT model based on their own needs or for low resource languages, while keeping their data private. Also enables users to use the model as a STT service locally and privately.
Models fine-tuned on low resource languages already exist on HuggingFace, the user could download these, or use another STT service/tool, and try them, instead of fine-uning on their data or Common Voice.
Existing fine-tuned models might be trained on bigger, more diverse datasets so their performance might be better across different environments/use-cases. However, not all languages have a fine-tuned model, or the models might not perform as well. Fine-tuning a model on your own voice data, produces a more personalized, use-case specific model that might perform better.
Model Selection
Model Selection
openai/Whisper
Open-source with MIT license and easy to implement. Big community and support around it. Top 5 in the HF ASR leaderboard as of Feb 2025. Multiple sizes available, making it easy to switch depending on available hardware.
facebook/w2v-bert-2.0, meta/mms
Whisper models, especially larger ones, require considerable computational resources and might not run efficiently on all local setups, however Whisper-tiny and small are low-compute friendly.
Voice Dataset for Fine-tuning
Voice Dataset for Fine-tuning
CommonVoice
Open-source, diverse collection of voice samples in multiple languages. One of the best STT datasets available for low-resource languages.
Didn’t consider any alternatives.
n/a
Fine-tuning Framework
Fine-tuning Framework
hf-transformers
Hugging Face’s transformers library provides well-documented fine-tuning support that is actively maintained and supports most open-source pre-trained models.
SpeechBrain; NeMo by Nvidia
SpeechBrain and NeMo both have instructions on fine-tuning CommonVoice, however they are not as actively maintained as Transformers, they have a steeper learning curve for beginners and might not support as broad a family of models.
User Interface
User Interface
Gradio
Good option for voice recording integration and integration with HF Spaces.
Streamlit
Streamlit is another option but the built-in voice recording feature doesn’t work out-of-the-box with the HF Transformers library, i.e. the audio input needs specific transformation before being fed to the STT model.
Ready? Try it yourself!
icon extensions
Explore Blueprints Extensions

See examples of extended blueprints unlocking new capabilities and adjusted configurations enabling tailored solutions—or try it yourself.

Load more
Text Link
Federated AI
tags
Text Link
Image Segmentation
tags
Text Link
Object Detection
tags
Text Link
Automatic Speech Recognition
tags
Text Link
Speech-to-Text
tags
Text Link
Query structured documents Q&A
tags
Text Link
Emails
tags
Text Link
Newsletter
tags
Text Link
Podcast
tags
Text Link
Community
tags
Text Link
Events
tags
Text Link
Discord
tags
Text Link
Data Extraction
tags
Text Link
User-Interface
tags
Text Link
Performance Optimization
tags
Text Link
LLM Inference
tags
Text Link
Language Modelling
tags
Text Link
Text-to-Text
tags
Text Link
Text-to-Speech
tags
Text Link
LLM
tags
Text Link
Email
tags
Text Link
Podcast personalities
tags
Text Link
Document-to-podcast
tags
Text Link
Blueprints
tags
Text Link
Use Cases
tags
Text Link
English
tags
Text Link
General Language
tags
Text Link
Multilingual
tags
Text Link
Audio
tags
Text Link
Text
tags
Text Link
Finetuning
tags
Text Link
Local AI
tags
Text Link
Federated Learning
tags
Text Link
LLM Integration
tags