Back to skills
SkillHub ClubAnalyze Data & AIFull StackBackendData / AI

gemini-voice-assistant

Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
3,030
Hot score
99
Updated
March 20, 2026
Overall rating
C4.0
Composite score
4.0
Best-practice grade
A88.4

Install command

npx @skill-hub/cli install openclaw-skills-gemini-voice-assistant

Repository

openclaw/skills

Skill path: skills/alimostafaradwan/gemini-voice-assistant

Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Backend, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install gemini-voice-assistant into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/openclaw/skills before adding gemini-voice-assistant to shared team environments
  • Use gemini-voice-assistant for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: gemini-voice-assistant
description: Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.
metadata:
  openclaw:
    emoji: "🎙️"
---

# Gemini Voice Assistant

A voice-to-voice AI assistant powered by Google's Gemini Live API. Speak to the AI and it responds with natural-sounding voice.

## Usage

### Text Mode

```bash
cd ~/.openclaw/agents/kashif/skills/gemini-assistant && python3 handler.py "Your question or message"
```

### Voice Mode

```bash
cd ~/.openclaw/agents/kashif/skills/gemini-assistant && python3 handler.py --audio /path/to/audio.ogg "optional context"
```

## Response Format

The handler returns a JSON response:

```json
{
  "message": "[[audio_as_voice]]\nMEDIA:/tmp/gemini_voice_xxx.ogg",
  "text": "Text response from Gemini"
}
```

## Configuration

Set your Gemini API key:

```bash
export GEMINI_API_KEY="your-api-key-here"
```

Or create a `.env` file in the skill directory:

```
GEMINI_API_KEY=your-api-key-here
```

## Model Options

The default model is `gemini-2.5-flash-native-audio-preview-12-2025` for audio support.

To use a different model, edit `handler.py`:

```python
MODEL = "gemini-2.0-flash-exp"  # For text-only
```

## Requirements

- `google-genai>=1.0.0`
- `numpy>=1.24.0`
- `soundfile>=0.12.0`
- `librosa>=0.10.0` (for audio input)
- FFmpeg (for audio conversion)

## Features

- 🎙️ Voice input/output support
- 💬 Text conversations
- 🔧 Configurable system instructions
- ⚡ Fast responses with Gemini Flash


---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### _meta.json

```json
{
  "owner": "alimostafaradwan",
  "slug": "gemini-voice-assistant",
  "displayName": "Gemini Voice Assistant",
  "latest": {
    "version": "1.0.0",
    "publishedAt": 1771753000088,
    "commit": "https://github.com/openclaw/skills/commit/df66f86f89d9483c15b0556dbce43f9e38973872"
  },
  "history": []
}

```

### scripts/handler.py

```python
#!/usr/bin/env python3
"""
Gemini Assistant - General purpose AI assistant using Gemini API
Supports text and voice interactions
"""

import asyncio
import os
import subprocess
import tempfile
import json
import argparse
from pathlib import Path
from datetime import datetime

import numpy as np
import soundfile as sf
from google import genai
from google.genai import types

# Load .env file manually if present
env_path = Path(__file__).parent / ".env"
if env_path.exists():
    with open(env_path) as f:
        for line in f:
            if line.strip() and not line.startswith('#') and '=' in line:
                key, val = line.strip().split('=', 1)
                if not os.environ.get(key):
                    os.environ[key] = val.strip().strip("'").strip('"')

# Configuration
MODEL = "gemini-2.5-flash-native-audio-preview-12-2025"
SYSTEM_INSTRUCTION = """You are a helpful, friendly AI assistant. You can:
- Answer questions on any topic
- Help with explanations and clarifications
- Assist with general tasks and problem-solving
- Have natural conversations

Keep responses concise but informative. If the user speaks in Arabic, respond in Arabic. If English, respond in English."""

SAMPLE_RATE_IN = 16000
SAMPLE_RATE_OUT = 24000
FFMPEG = "/usr/bin/ffmpeg"


async def _process_with_gemini(audio_path: str = None, text_input: str = None, system_instruction: str = None) -> dict:
    """
    Process input using Gemini Live API.
    Returns both audio and text response.
    """
    api_key = os.environ.get("GEMINI_API_KEY")
    if not api_key:
        raise ValueError("GEMINI_API_KEY environment variable not set")

    client = genai.Client(api_key=api_key)

    config = {
        "response_modalities": ["AUDIO"],
        "system_instruction": system_instruction or SYSTEM_INSTRUCTION,
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {"voice_name": "Puck"}
            }
        }
    }

    chunks = []

    async with client.aio.live.connect(model=MODEL, config=config) as session:
        # Send input
        if text_input:
            await session.send_client_content(turns={"parts": [{"text": text_input}]})
        elif audio_path:
            # Convert audio to PCM and send
            import librosa
            y, sr = librosa.load(audio_path, sr=SAMPLE_RATE_IN)
            
            # Send as realtime audio input
            await session.send_realtime_input(
                audio=types.Blob(
                    data=y.tobytes(),
                    mime_type=f"audio/pcm;rate={SAMPLE_RATE_IN}"
                )
            )
        else:
            raise ValueError("Either text_input or audio_path must be provided")

        # Receive responses
        async for response in session.receive():
            if response.data is not None:
                chunks.append(response.data)
            elif response.server_content and response.server_content.turn_complete:
                break

    raw_pcm = b"".join(chunks) if chunks else b""
    
    return {
        "raw_pcm": raw_pcm
    }


def _pcm_to_ogg_opus(raw_pcm: bytes, output_path: str) -> str:
    """Convert raw PCM to OGG Opus."""
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        wav_path = tmp.name

    audio_np = np.frombuffer(raw_pcm, dtype=np.int16)
    sf.write(wav_path, audio_np, SAMPLE_RATE_OUT, format="WAV", subtype="PCM_16")

    try:
        env = os.environ.copy()
        env["LD_LIBRARY_PATH"] = "/usr/lib/x86_64-linux-gnu"
        result = subprocess.run(
            [
                FFMPEG, "-i", wav_path,
                "-c:a", "libopus",
                "-b:a", "32k",
                "-ar", "48000",
                "-ac", "1",
                output_path, "-y"
            ],
            capture_output=True,
            timeout=30,
            env=env,
        )
        if result.returncode != 0:
            raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()[-300:]}")
    finally:
        if os.path.exists(wav_path):
            os.unlink(wav_path)

    size = os.path.getsize(output_path)
    if size == 0:
        raise RuntimeError("ffmpeg produced an empty OGG file")

    print(f"[gemini-assistant] OGG Opus written: {output_path} ({size} bytes)")
    return output_path


def handle_request(request_data: dict) -> dict:
    """Main entry point for Gemini Assistant."""
    chat_id = request_data.get("chat_id", "unknown")
    text_input = request_data.get("text")
    audio_path = request_data.get("audio_path")
    system_instruction = request_data.get("system_instruction")
    
    safe_id = str(chat_id).replace("@", "_").replace("+", "").replace(".", "_")
    voice_output_path = f"/tmp/gemini_voice_{safe_id}.ogg"
    
    try:
        # Process with Gemini
        result = asyncio.run(_process_with_gemini(
            audio_path=audio_path,
            text_input=text_input,
            system_instruction=system_instruction
        ))
        
        raw_pcm = result.get("raw_pcm", b"")
        
        response = {}
        
        # Convert voice to OGG if we have audio
        if raw_pcm:
            _pcm_to_ogg_opus(raw_pcm, voice_output_path)
            response["message"] = f"[[audio_as_voice]]\nMEDIA:{voice_output_path}"
        
        return response
        
    except Exception as e:
        error_msg = str(e)
        print(f"[gemini-assistant] Error: {error_msg}")
        import traceback
        traceback.print_exc()
        
        return {
            "message": f"Sorry, an error occurred: {str(e)}"
        }


def main():
    """CLI entry point."""
    parser = argparse.ArgumentParser(description="Gemini Assistant")
    parser.add_argument("input_text", nargs="?", help="Text input to send to Gemini")
    parser.add_argument("--audio", "-a", help="Path to audio file for voice input")
    parser.add_argument("--system", "-s", help="Custom system instruction")
    
    args = parser.parse_args()
    
    request_data = {
        "chat_id": "cli",
        "text": args.input_text,
        "audio_path": args.audio,
        "system_instruction": args.system
    }
    
    result = handle_request(request_data)
    print(json.dumps(result, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()

```

gemini-voice-assistant | SkillHub