Voxtral Documentation

Everything you need to integrate Voxtral speech understanding models into your applications. From quick start guides to advanced API reference.

Quick Start

Get up and running with Voxtral in minutes. Choose your preferred method:

API (Recommended)

Use our hosted API for the fastest setup:

curl -X POST https://api.gamealpaca.world/v1/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Local Deployment

Run Voxtral Mini locally for privacy and control:

pip install voxtral
voxtral transcribe audio.mp3

Hugging Face

Download and use models directly:

from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained("mistralai/Voxtral-Mini")

Installation

Install Voxtral using your preferred package manager:

Python (pip)

pip install voxtral

Node.js (npm)

npm install @voxtral/ai

Docker

docker pull voxtral/voxtral-mini:latest

Basic Usage

Here's how to use Voxtral for common tasks:

Transcription

import voxtral

# Transcribe audio file
result = voxtral.transcribe("audio.mp3")
print(result.text)

# Transcribe with options
result = voxtral.transcribe(
    "audio.mp3",
    language="en",
    task="transcribe",
    timestamp_granularities=["word"]
)

Audio Understanding

# Ask questions about audio content
result = voxtral.understand(
    "audio.mp3",
    question="What is the main topic discussed?"
)

# Generate summary
summary = voxtral.summarize("audio.mp3")

Multilingual Support

# Automatic language detection
result = voxtral.transcribe("spanish_audio.mp3")

# Force specific language
result = voxtral.transcribe(
    "audio.mp3",
    language="es"
)

API Reference

Authentication

All API requests require authentication using your API key:

Authorization: Bearer YOUR_API_KEY

Get your API key from the Voxtral Dashboard.

Transcription API

Endpoint: POST /v1/transcribe

Request Parameters

Parameter Type Required Description
file File Yes Audio file (MP3, WAV, M4A, etc.)
language String No Language code (auto-detected if not provided)
task String No "transcribe" or "translate"

Response

{
  "text": "Transcribed text content...",
  "language": "en",
  "duration": 120.5,
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "text": "Segment text..."
    }
  ]
}

Audio Understanding API

Endpoint: POST /v1/understand

Request Parameters

Parameter Type Required Description
file File Yes Audio file
question String Yes Question about the audio content

Deployment

Local Deployment

Deploy Voxtral Mini locally for privacy and control:

System Requirements

  • GPU: NVIDIA GPU with 10GB+ VRAM (RTX 3080 or better)
  • RAM: 16GB+ system memory
  • Storage: 20GB+ free space
  • OS: Linux, macOS, or Windows

Installation Steps

# Clone repository
git clone https://github.com/mistralai/voxtral.git
cd voxtral

# Install dependencies
pip install -r requirements.txt

# Download model
python -c "from voxtral import VoxtralMini; VoxtralMini.download()"

# Start server
python -m voxtral.server --port 8000

Docker Deployment

Use Docker for easy deployment:

# Pull image
docker pull voxtral/voxtral-mini:latest

# Run container
docker run -d \
  --name voxtral \
  --gpus all \
  -p 8000:8000 \
  voxtral/voxtral-mini:latest

Code Examples

Python Examples

Basic Transcription

import voxtral

# Initialize client
client = voxtral.Client(api_key="your-api-key")

# Transcribe file
with open("audio.mp3", "rb") as f:
    result = client.transcribe(f)
    print(result.text)

Streaming Transcription

# Real-time transcription
for chunk in client.transcribe_stream(audio_stream):
    print(chunk.text, end="", flush=True)

JavaScript Examples

Browser Usage

import { Voxtral } from '@voxtral/ai';

const client = new Voxtral('your-api-key');

// Transcribe audio file
const file = document.getElementById('audio-file').files[0];
const result = await client.transcribe(file);
console.log(result.text);

Ready to Get Started?

Join thousands of developers building with Voxtral. Get your API key and start integrating speech AI today.