Motivations and Technical Decisions
Motivations
Turning documents into podcasts has gained attention recently, driven by tools like Google’s NotebookLM. However, many options are closed-source, rely on proprietary models, and cater more to consumers than developers looking to build their own apps. Our Blueprint focuses on:
- Fully local setup: Standalone functionality with no third-party APIs, ensuring privacy and data control.
- Low compute requirements: Models optimized for local setups, even in GitHub Codespaces, without needing high-end GPUs or API keys.
- Customization: Flexible and extensible, with guidance on building basic apps using its components.
Technical Decisions
- Document Pre-processing: Chose Python’s re library for text cleaning, optimizing token usage for smaller models with limited context windows. However we are exploring using alternatives like MarkItDown.
- Podcast Script Generation:
- Used llama_cpp for local, CPU-friendly inference, supporting the GGUF binary format for optimized model performance.
- Selected Qwen2.5-3B-Instruct-GGUF as the default model for balanced performance.
- Improved script consistency by formatting outputs as JSON using the response_format argument in llama_cpp.
- Audio Generation: Adopted OuteTTS for its consistent pronunciation of domain-specific words.
- Deployment options:
- GitHub Codespaces for quick, remote usage and no local setup.
- Local CLI and Demo app for on-device usage.
- GPU-enabled Colab notebooks for easy experimentation without setup.
- HF Spaces for the hosted demo in the Blueprints Hub.