gemini-vision
Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install mrgoonie-xxxnaper-gemini-vision
Repository
Skill path: .claude/skills/gemini-vision
Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack, Backend, Designer, Testing, Integration.
Target audience: everyone.
License: MIT.
Original source
Catalog source: SkillHub Club.
Repository owner: mrgoonie.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install gemini-vision into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/mrgoonie/xxxnaper before adding gemini-vision to shared team environments
- Use gemini-vision for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: gemini-vision
description: Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
license: MIT
allowed-tools:
- Bash
- Read
- Write
- Edit
---
# Gemini Vision API Skill
This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.
## Quick Start
### Prerequisites
1. **Get API Key**: Obtain from [Google AI Studio](https://aistudio.google.com/apikey)
2. **Install SDK**: `pip install google-genai` (Python 3.9+)
### API Key Configuration
The skill checks for `GEMINI_API_KEY` in this order:
1. **Process environment variable** (recommended)
```bash
export GEMINI_API_KEY="your-api-key"
```
2. **Skill directory**: `.claude/skills/gemini-vision/.env`
```
GEMINI_API_KEY=your-api-key
```
3. **Project directory**: `.env` or `.gemini_api_key` in project root
**Security**: Never commit API keys to version control. Add `.env` to `.gitignore`.
## Core Capabilities
### Image Analysis
- **Captioning**: Generate descriptive text for images
- **Classification**: Categorize and identify image content
- **Visual QA**: Answer questions about image content
- **Multi-image**: Compare and analyze up to 3,600 images
### Advanced Features (Model-Specific)
- **Object Detection**: Identify and locate objects with bounding boxes (Gemini 2.0+)
- **Segmentation**: Create pixel-level masks for objects (Gemini 2.5+)
- **Document Understanding**: Process PDFs with vision (up to 1,000 pages)
## Supported Formats
- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
- **Documents**: PDF (up to 1,000 pages)
- **Size Limits**:
- Inline: 20MB max total request size
- File API: For larger files
- Max images: 3,600 per request
## Available Models
- **gemini-2.5-pro**: Most capable, segmentation + detection
- **gemini-2.5-flash**: Fast, efficient, segmentation + detection
- **gemini-2.5-flash-lite**: Lightweight, segmentation + detection
- **gemini-2.0-flash**: Object detection support
- **gemini-1.5-pro/flash**: Previous generation
## Usage Examples
### Basic Image Analysis
```bash
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro
```
### Object Detection (2.0+)
```bash
python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash
```
### Multi-Image Comparison
```bash
python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"
```
### File Upload (for large files or reuse)
```bash
# Upload file
python scripts/upload-file.py path/to/large-image.jpg
# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"
```
### File Management
```bash
# List uploaded files
python scripts/manage-files.py list
# Get file info
python scripts/manage-files.py get file-id
# Delete file
python scripts/manage-files.py delete file-id
```
## Token Costs
Images consume tokens based on size:
- **Small** (≤384px both dimensions): 258 tokens
- **Large**: Tiled into 768×768 chunks, 258 tokens each
**Token Formula**:
```
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
```
**Example**: 960×540 image = 6 tiles = 1,548 tokens
## Rate Limits
Limits vary by tier (Free, Tier 1, 2, 3):
- Measured in RPM (requests/min), TPM (tokens/min), RPD (requests/day)
- Applied per project, not per API key
- RPD resets at midnight Pacific
## Best Practices
### Image Quality
- Use clear, non-blurry images
- Verify correct image rotation
- Consider token costs when sizing
### Prompting
- Be specific in instructions
- Place text after image for single-image prompts
- Use few-shot examples for better accuracy
- Specify output format (JSON, markdown, etc.)
### File Management
- Use File API for files >20MB
- Use File API for repeated usage (saves tokens)
- Files auto-delete after 48 hours
- Clean up manually when done
### Security
- Never expose API keys in code
- Use environment variables
- Add API key restrictions in Google Cloud Console
- Monitor usage regularly
- Rotate keys periodically
## Error Handling
Common errors:
- **401**: Invalid API key
- **429**: Rate limit exceeded
- **400**: Invalid request (check file size, format)
- **403**: Permission denied (check API key restrictions)
## Additional Resources
See the `references/` directory for:
- **api-reference.md**: Detailed API methods and endpoints
- **examples.md**: Comprehensive code examples
- **best-practices.md**: Advanced tips and optimization strategies
## Implementation Guide
When implementing Gemini vision features:
1. **Check API key availability** using the 3-step lookup
2. **Choose appropriate model** based on requirements:
- Need segmentation? Use 2.5+ models
- Need detection? Use 2.0+ models
- Need speed? Use Flash variants
- Need quality? Use Pro variants
3. **Validate inputs**:
- Check file format (PNG, JPEG, WEBP, HEIC, HEIF, PDF)
- Verify file size (<20MB for inline, >20MB use File API)
- Count images (max 3,600)
4. **Handle responses** appropriately:
- Parse structured output if requested
- Extract bounding boxes for object detection
- Process segmentation masks if applicable
5. **Manage files** efficiently:
- Upload large files via File API
- Reuse uploaded files when possible
- Clean up after use
## Scripts Overview
All scripts support the 3-step API key lookup:
- **analyze-image.py**: Main script for image analysis, supports inline and File API
- **upload-file.py**: Upload files to Gemini File API
- **manage-files.py**: List, get metadata, and delete uploaded files
Run any script with `--help` for detailed usage instructions.
---
**Official Documentation**: https://ai.google.dev/gemini-api/docs/image-understanding
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/analyze-image.py
```python
#!/usr/bin/env python3
"""
Gemini Vision API - Image Analysis Script
This script analyzes images using Google's Gemini API with support for:
- Single or multiple images
- Inline data or File API uploads
- Object detection and segmentation
- Custom prompts and models
API Key Lookup Order:
1. Process environment variable (GEMINI_API_KEY)
2. Skill directory (.claude/skills/gemini-vision/.env)
3. Project directory (.env or .gemini_api_key)
"""
import argparse
import os
import sys
from pathlib import Path
from typing import List, Optional
def find_api_key() -> Optional[str]:
"""
Find GEMINI_API_KEY using 3-step lookup:
1. Process environment variable
2. Skill directory (.env)
3. Project directory (.env or .gemini_api_key)
"""
# Step 1: Check process environment
api_key = os.environ.get('GEMINI_API_KEY')
if api_key:
return api_key
# Step 2: Check skill directory
skill_dir = Path(__file__).parent.parent # .claude/skills/gemini-vision/
skill_env = skill_dir / '.env'
if skill_env.exists():
with open(skill_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Step 3: Check project directory
# Try to find project root (go up from skill dir)
project_dir = skill_dir.parent.parent.parent # Back to project root
# Check .env in project root
project_env = project_dir / '.env'
if project_env.exists():
with open(project_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Check .gemini_api_key in project root
api_key_file = project_dir / '.gemini_api_key'
if api_key_file.exists():
with open(api_key_file, 'r') as f:
return f.read().strip()
return None
def analyze_image(
image_paths: List[str],
prompt: str,
model: str = "gemini-2.5-flash",
output_format: Optional[str] = None
) -> str:
"""
Analyze one or more images with Gemini API.
Args:
image_paths: List of image file paths or URLs
prompt: Question or instruction for the model
model: Gemini model to use
output_format: Optional output format (json, markdown, etc.)
Returns:
Model response text
"""
try:
from google import genai
from google.genai import types
except ImportError:
print("Error: google-genai package not installed.", file=sys.stderr)
print("Install with: pip install google-genai", file=sys.stderr)
sys.exit(1)
# Find API key
api_key = find_api_key()
if not api_key:
print("Error: GEMINI_API_KEY not found.", file=sys.stderr)
print("Set it using one of these methods:", file=sys.stderr)
print(" 1. export GEMINI_API_KEY='your-key'", file=sys.stderr)
print(" 2. Create .claude/skills/gemini-vision/.env", file=sys.stderr)
print(" 3. Create .env or .gemini_api_key in project root", file=sys.stderr)
sys.exit(1)
# Initialize client
client = genai.Client(api_key=api_key)
# Prepare content parts
contents = []
# Add prompt first if single image (best practice)
if len(image_paths) == 1:
contents.append(prompt)
# Add images
for path in image_paths:
if path.startswith('file://'):
# File API reference
file_id = path[7:] # Remove 'file://' prefix
contents.append(types.File(name=file_id))
elif path.startswith('http://') or path.startswith('https://'):
# URL - download and convert to bytes
import requests
response = requests.get(path)
response.raise_for_status()
# Detect MIME type from content-type header or extension
content_type = response.headers.get('content-type', '').split(';')[0]
if not content_type or content_type == 'application/octet-stream':
# Fallback to extension
ext = Path(path).suffix.lower()
mime_types = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.webp': 'image/webp',
'.heic': 'image/heic',
'.heif': 'image/heif',
'.pdf': 'application/pdf'
}
content_type = mime_types.get(ext, 'image/jpeg')
contents.append(types.Part.from_bytes(
data=response.content,
mime_type=content_type
))
else:
# Local file
path_obj = Path(path)
if not path_obj.exists():
print(f"Error: File not found: {path}", file=sys.stderr)
sys.exit(1)
# Read file
with open(path, 'rb') as f:
image_bytes = f.read()
# Detect MIME type from extension
ext = path_obj.suffix.lower()
mime_types = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.webp': 'image/webp',
'.heic': 'image/heic',
'.heif': 'image/heif',
'.pdf': 'application/pdf'
}
mime_type = mime_types.get(ext, 'image/jpeg')
contents.append(types.Part.from_bytes(
data=image_bytes,
mime_type=mime_type
))
# Add prompt after images for multi-image (best practice)
if len(image_paths) > 1:
contents.append(prompt)
# Generate response
try:
response = client.models.generate_content(
model=model,
contents=contents
)
return response.text
except Exception as e:
print(f"Error calling Gemini API: {e}", file=sys.stderr)
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description='Analyze images using Google Gemini API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Analyze single image
%(prog)s image.jpg "What's in this image?"
# Multiple images
%(prog)s img1.jpg img2.jpg "What's different?"
# From URL
%(prog)s https://example.com/img.jpg "Describe this"
# Use uploaded file
%(prog)s file://file-id "Caption this"
# Specify model
%(prog)s image.jpg "Detect objects" --model gemini-2.0-flash
# Request JSON output
%(prog)s image.jpg "List objects as JSON" --format json
"""
)
parser.add_argument(
'images',
nargs='+',
help='Image file paths, URLs, or file:// references'
)
parser.add_argument(
'prompt',
help='Question or instruction for the model'
)
parser.add_argument(
'--model',
default='gemini-2.5-flash',
choices=[
'gemini-2.5-pro',
'gemini-2.5-flash',
'gemini-2.5-flash-lite',
'gemini-2.0-flash',
'gemini-2.0-flash-lite',
'gemini-1.5-pro',
'gemini-1.5-flash'
],
help='Gemini model to use (default: gemini-2.5-flash)'
)
parser.add_argument(
'--format',
choices=['json', 'markdown', 'plain'],
help='Preferred output format'
)
args = parser.parse_args()
# Enhance prompt with format request if specified
prompt = args.prompt
if args.format:
format_instructions = {
'json': ' Return the response as valid JSON.',
'markdown': ' Return the response as markdown.',
'plain': ' Return the response as plain text.'
}
prompt += format_instructions.get(args.format, '')
# Analyze images
result = analyze_image(
image_paths=args.images,
prompt=prompt,
model=args.model,
output_format=args.format
)
print(result)
if __name__ == '__main__':
main()
```
### scripts/upload-file.py
```python
#!/usr/bin/env python3
"""
Gemini Vision API - File Upload Script
Upload files to the Gemini File API for reuse across multiple requests.
Files uploaded via the API are automatically deleted after 48 hours.
API Key Lookup Order:
1. Process environment variable (GEMINI_API_KEY)
2. Skill directory (.claude/skills/gemini-vision/.env)
3. Project directory (.env or .gemini_api_key)
"""
import argparse
import os
import sys
from pathlib import Path
from typing import Optional
def find_api_key() -> Optional[str]:
"""
Find GEMINI_API_KEY using 3-step lookup:
1. Process environment variable
2. Skill directory (.env)
3. Project directory (.env or .gemini_api_key)
"""
# Step 1: Check process environment
api_key = os.environ.get('GEMINI_API_KEY')
if api_key:
return api_key
# Step 2: Check skill directory
skill_dir = Path(__file__).parent.parent
skill_env = skill_dir / '.env'
if skill_env.exists():
with open(skill_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Step 3: Check project directory
project_dir = skill_dir.parent.parent.parent
# Check .env in project root
project_env = project_dir / '.env'
if project_env.exists():
with open(project_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Check .gemini_api_key in project root
api_key_file = project_dir / '.gemini_api_key'
if api_key_file.exists():
with open(api_key_file, 'r') as f:
return f.read().strip()
return None
def upload_file(file_path: str, display_name: Optional[str] = None) -> dict:
"""
Upload a file to Gemini File API.
Args:
file_path: Path to the file to upload
display_name: Optional display name for the file
Returns:
Dictionary with file metadata
"""
try:
from google import genai
except ImportError:
print("Error: google-genai package not installed.", file=sys.stderr)
print("Install with: pip install google-genai", file=sys.stderr)
sys.exit(1)
# Find API key
api_key = find_api_key()
if not api_key:
print("Error: GEMINI_API_KEY not found.", file=sys.stderr)
print("Set it using one of these methods:", file=sys.stderr)
print(" 1. export GEMINI_API_KEY='your-key'", file=sys.stderr)
print(" 2. Create .claude/skills/gemini-vision/.env", file=sys.stderr)
print(" 3. Create .env or .gemini_api_key in project root", file=sys.stderr)
sys.exit(1)
# Check file exists
path = Path(file_path)
if not path.exists():
print(f"Error: File not found: {file_path}", file=sys.stderr)
sys.exit(1)
# Initialize client
client = genai.Client(api_key=api_key)
try:
# Upload file
print(f"Uploading {file_path}...", file=sys.stderr)
uploaded_file = client.files.upload(
file=file_path,
name=display_name
)
# Return file metadata
return {
'name': uploaded_file.name,
'display_name': uploaded_file.display_name,
'mime_type': uploaded_file.mime_type,
'size_bytes': uploaded_file.size_bytes,
'uri': uploaded_file.uri,
'state': uploaded_file.state,
}
except Exception as e:
print(f"Error uploading file: {e}", file=sys.stderr)
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description='Upload files to Gemini File API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Upload a file
%(prog)s image.jpg
# Upload with custom display name
%(prog)s image.jpg --name "My Image"
# Upload PDF
%(prog)s document.pdf --name "Report"
Notes:
- Files are automatically deleted after 48 hours
- Use the returned file ID with analyze-image.py: file://file-id
- Maximum file size depends on your API tier
"""
)
parser.add_argument(
'file',
help='Path to file to upload'
)
parser.add_argument(
'--name',
help='Display name for the uploaded file'
)
parser.add_argument(
'--json',
action='store_true',
help='Output as JSON instead of human-readable format'
)
args = parser.parse_args()
# Upload file
file_info = upload_file(args.file, args.name)
if args.json:
import json
print(json.dumps(file_info, indent=2))
else:
print(f"\nFile uploaded successfully!", file=sys.stderr)
print(f"File ID: {file_info['name']}")
print(f"Display Name: {file_info['display_name']}")
print(f"MIME Type: {file_info['mime_type']}")
print(f"Size: {file_info['size_bytes']} bytes")
print(f"State: {file_info['state']}")
print(f"\nUse with analyze-image.py:")
print(f" python analyze-image.py file://{file_info['name']} \"Your prompt\"")
if __name__ == '__main__':
main()
```
### scripts/manage-files.py
```python
#!/usr/bin/env python3
"""
Gemini Vision API - File Management Script
Manage files uploaded to the Gemini File API:
- List all uploaded files
- Get file metadata
- Delete files
API Key Lookup Order:
1. Process environment variable (GEMINI_API_KEY)
2. Skill directory (.claude/skills/gemini-vision/.env)
3. Project directory (.env or .gemini_api_key)
"""
import argparse
import os
import sys
from pathlib import Path
from typing import Optional
def find_api_key() -> Optional[str]:
"""
Find GEMINI_API_KEY using 3-step lookup:
1. Process environment variable
2. Skill directory (.env)
3. Project directory (.env or .gemini_api_key)
"""
# Step 1: Check process environment
api_key = os.environ.get('GEMINI_API_KEY')
if api_key:
return api_key
# Step 2: Check skill directory
skill_dir = Path(__file__).parent.parent
skill_env = skill_dir / '.env'
if skill_env.exists():
with open(skill_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Step 3: Check project directory
project_dir = skill_dir.parent.parent.parent
# Check .env in project root
project_env = project_dir / '.env'
if project_env.exists():
with open(project_env, 'r') as f:
for line in f:
line = line.strip()
if line.startswith('GEMINI_API_KEY='):
return line.split('=', 1)[1].strip().strip('"\'')
# Check .gemini_api_key in project root
api_key_file = project_dir / '.gemini_api_key'
if api_key_file.exists():
with open(api_key_file, 'r') as f:
return f.read().strip()
return None
def get_client():
"""Get authenticated Gemini client."""
try:
from google import genai
except ImportError:
print("Error: google-genai package not installed.", file=sys.stderr)
print("Install with: pip install google-genai", file=sys.stderr)
sys.exit(1)
# Find API key
api_key = find_api_key()
if not api_key:
print("Error: GEMINI_API_KEY not found.", file=sys.stderr)
print("Set it using one of these methods:", file=sys.stderr)
print(" 1. export GEMINI_API_KEY='your-key'", file=sys.stderr)
print(" 2. Create .claude/skills/gemini-vision/.env", file=sys.stderr)
print(" 3. Create .env or .gemini_api_key in project root", file=sys.stderr)
sys.exit(1)
return genai.Client(api_key=api_key)
def list_files(json_output: bool = False):
"""List all uploaded files."""
client = get_client()
try:
files = client.files.list()
if json_output:
import json
file_list = []
for file in files:
file_list.append({
'name': file.name,
'display_name': file.display_name,
'mime_type': file.mime_type,
'size_bytes': file.size_bytes,
'state': file.state,
})
print(json.dumps(file_list, indent=2))
else:
file_count = 0
for file in files:
file_count += 1
print(f"\n{file_count}. {file.display_name or file.name}")
print(f" ID: {file.name}")
print(f" MIME: {file.mime_type}")
print(f" Size: {file.size_bytes} bytes")
print(f" State: {file.state}")
if file_count == 0:
print("No files found.")
else:
print(f"\nTotal files: {file_count}")
except Exception as e:
print(f"Error listing files: {e}", file=sys.stderr)
sys.exit(1)
def get_file(file_id: str, json_output: bool = False):
"""Get file metadata."""
client = get_client()
try:
file = client.files.get(name=file_id)
if json_output:
import json
file_info = {
'name': file.name,
'display_name': file.display_name,
'mime_type': file.mime_type,
'size_bytes': file.size_bytes,
'uri': file.uri,
'state': file.state,
}
print(json.dumps(file_info, indent=2))
else:
print(f"File: {file.display_name or file.name}")
print(f"ID: {file.name}")
print(f"MIME Type: {file.mime_type}")
print(f"Size: {file.size_bytes} bytes")
print(f"URI: {file.uri}")
print(f"State: {file.state}")
except Exception as e:
print(f"Error getting file: {e}", file=sys.stderr)
sys.exit(1)
def delete_file(file_id: str):
"""Delete a file."""
client = get_client()
try:
client.files.delete(name=file_id)
print(f"File deleted successfully: {file_id}")
except Exception as e:
print(f"Error deleting file: {e}", file=sys.stderr)
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description='Manage files in Gemini File API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# List all files
%(prog)s list
# Get file metadata
%(prog)s get files/abc123
# Delete a file
%(prog)s delete files/abc123
# Output as JSON
%(prog)s list --json
%(prog)s get files/abc123 --json
"""
)
parser.add_argument(
'action',
choices=['list', 'get', 'delete'],
help='Action to perform'
)
parser.add_argument(
'file_id',
nargs='?',
help='File ID (required for get and delete)'
)
parser.add_argument(
'--json',
action='store_true',
help='Output as JSON'
)
args = parser.parse_args()
# Validate file_id for get and delete
if args.action in ['get', 'delete'] and not args.file_id:
parser.error(f"file_id is required for {args.action} action")
# Execute action
if args.action == 'list':
list_files(args.json)
elif args.action == 'get':
get_file(args.file_id, args.json)
elif args.action == 'delete':
delete_file(args.file_id)
if __name__ == '__main__':
main()
```