Motivations and Technical Decisions

Tags
No items found.

Motivations

Turning documents into podcasts has gained attention recently, driven by tools like Google’s NotebookLM. However, many options are closed-source, rely on proprietary models, and cater more to consumers than developers looking to build their own apps. Our Blueprint focuses on:

  • Fully local setup: Standalone functionality with no third-party APIs, ensuring privacy and data control.
  • Low compute requirements: Models optimized for local setups, even in GitHub Codespaces, without needing high-end GPUs or API keys.
  • Customization: Flexible and extensible, with guidance on building basic apps using its components.

Technical Decisions

  • Document Pre-processing: Chose Python’s re library for text cleaning, optimizing token usage for smaller models with limited context windows. However we are exploring using alternatives like MarkItDown.
  • Podcast Script Generation:
    • Used llama_cpp for local, CPU-friendly inference, supporting the GGUF binary format for optimized model performance.
    • Selected Qwen2.5-3B-Instruct-GGUF as the default model for balanced performance.
    • Improved script consistency by formatting outputs as JSON using the response_format argument in llama_cpp.
  • Audio Generation: Adopted OuteTTS for its consistent pronunciation of domain-specific words.
  • Deployment options:
    • GitHub Codespaces for quick, remote usage and no local setup.
    • Local CLI and Demo app for on-device usage.
    • GPU-enabled Colab notebooks for easy experimentation without setup.
    • HF Spaces for the hosted demo in the Blueprints Hub.