Back to skills
SkillHub ClubWrite Technical DocsFull StackBackendData / AI

ai-multimodal

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
0
Hot score
74
Updated
March 20, 2026
Overall rating
C3.1
Composite score
3.1
Best-practice grade
S96.0

Install command

npx @skill-hub/cli install namdfang-claudfang-engineer-ai-multimodal

Repository

namdfang/claudfang-engineer

Skill path: .claude/skills/ai-multimodal

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Backend, Data / AI, Tech Writer, Designer.

Target audience: Development teams looking for install-ready agent workflows..

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: namdfang.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install ai-multimodal into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/namdfang/claudfang-engineer before adding ai-multimodal to shared team environments
  • Use ai-multimodal for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

ai-multimodal | SkillHub