SkillHub ClubShip Full StackFull Stack

pr-walkthrough

Create a narrated video walkthrough of a pull request with code slides and audio narration. Use when asked to create a PR walkthrough, PR video, or walkthrough video.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

45,919

Hot score

Updated

March 20, 2026

Overall rating

C4.0

Composite score

4.0

Best-practice grade

B77.6

Install command

npx @skill-hub/cli install tldraw-tldraw-pr-walkthrough

Repository

tldraw/tldraw

Skill path: .claude/skills/pr-walkthrough

Create a narrated video walkthrough of a pull request with code slides and audio narration. Use when asked to create a PR walkthrough, PR video, or walkthrough video.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: tldraw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install pr-walkthrough into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/tldraw/tldraw before adding pr-walkthrough to shared team environments
Use pr-walkthrough for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: pr-walkthrough
description: Create a narrated video walkthrough of a pull request with code slides and audio narration. Use when asked to create a PR walkthrough, PR video, or walkthrough video.
argument-hint: <pr-url>
disable-model-invocation: true
---

# PR walkthrough video

Create a narrated walkthrough video for a pull request. This is designed to be an internal artifact, providing the same benefit as would a loom video created by the pull request's author — walking through the code changes, explaining what was done and why, so that anyone watching can understand the PR quickly.

**Input:** A GitHub pull request URL (e.g., `https://github.com/tldraw/tldraw/pull/7924`). If given just a PR number or other description, assume that the PR is on the tldraw/tldraw repository.

**Output:** An MP4 video at 1600x900 with audio narration and standardized intro / outro slides, saved to `.claude/skills/pr-walkthrough/out/pr-<number>-walkthrough.mp4`.

All intermediate files (audio, manifest, scripts) go in `.claude/skills/pr-walkthrough/tmp/pr-<number>/`. This directory is gitignored. Only the final `.mp4` lives at `.claude/skills/pr-walkthrough/out/`.

## Philosophy

**This is a walkthrough from the author's perspective.** The goal is the same as if the PR author sat down with someone and walked them through the changes — showing specific code, explaining what changed and why, in an order that builds understanding. The viewer should come away understanding both _what the code does_ and _how to think about the changes_.

This means:

- **The narration drives everything.** Write the walkthrough narration first, as a continuous explanation of the PR. Then figure out what should be on screen at each moment to support what's being said.
- **Show the code.** The default visual is a code diff or source file. Text slides are the exception (intro, brief transitions, outro), not the rule. When the narration talks about a function, the viewer should be looking at that function.
- **Walk through changes in a logical order**, not necessarily file order or commit order — but always anchored to concrete code, not abstract descriptions.
- **Explain the "why", not just the "what".** The code on screen shows what changed. The narration adds the reasoning — why this approach, what problem it solves, what edge cases it handles.

## Workflow

### Step 1: Understand the PR

Read the PR commits, diff, and description. Understand the narrative arc:

- What problem does this solve?
- What's the approach?
- What are the key mechanisms?

```bash
gh pr view <number> --json title,body,commits
git log main..HEAD --oneline
git diff main..HEAD --stat
```

### Step 2: Write the narration

Write the narration as continuous text, broken into logical segments. Each segment is a beat of the walkthrough — a concept, a change, or a group of related changes. Save this as `.claude/skills/pr-walkthrough/tmp/pr-<number>/SCRIPT.md`.

The narration should read like the author explaining the PR to a colleague: "So here's what we're doing... The core problem was X... The approach I took was Y... If you look at this function here..."

Structure: intro → context/problem → code walkthrough → summary. See **Script structure** below.

If the commits are simple and organized well (often on a branch with `-clean` in its name), you can follow their commit messages and descriptions to guide your narration. Otherwise, examine the code and create your own narrative. Introduce concepts in an order that builds on previous ones.

Avoid redundancy, especially between intro and first content segment.

### Step 3: Generate audio and timestamps

Generate all narration as a **single audio file**, then split it into per-segment clips. This produces consistent voice, volume, and pacing across the entire walkthrough.

Write a `narration.json` file, then run the `generate-audio.sh` CLI tool:

```bash
.claude/skills/pr-walkthrough/scripts/generate-audio.sh narration.json .claude/skills/pr-walkthrough/tmp/pr-<number>/
```

**API key:** Sourced automatically from the repo `.env` file (`GEMINI_API_KEY`).

#### Narration JSON format

```json
{
	"style": "Read the following walkthrough narration in a calm, steady, professional tone. Speak at a measured pace as if the author of a pull request were walking a colleague through the code changes. Between each numbered section, leave a brief pause — no more than one second of silence.",
	"voice": "Iapetus",
	"slides": [
		"This pull request adds group-aware binding resolution to the arrow tool...",
		"The core problem was that arrow bindings broke when the target shape...",
		"If you look at the getBindingTarget method in ArrowBindingUtil.ts..."
	]
}
```

- **`style`** — Voice persona and pacing instructions. Keep it short and specific.
- **`voice`** — Gemini voice name (default: `Iapetus`).
- **`slides`** — Array of narration text, one entry per segment. The script adds `[1]`, `[2]` section markers automatically.

#### How it works

1. The script builds a single prompt: style preamble + numbered sections with all segment narrations.
2. One API call to `gemini-2.5-pro-tts` generates the full narration as a single WAV. The 32k-token context window is plenty for 5-7 minutes.
3. The WAV is uploaded to the Gemini Files API, then a `gemini-2.5-flash` call listens to the audio alongside the segment texts and returns the start timestamp (in seconds) of each segment. The script splits at those boundaries.

**Output:** Per-segment audio clips (`audio-00.wav`, ...) and a `durations.json` file mapping each audio filename to its duration in seconds.

**Dependencies:** ffmpeg / ffprobe. No Python packages required beyond the standard library.

**Do NOT use** `[pause long]` or `[pause medium]` markup tags in the narration text — the model may read them aloud literally.

**TTS truncation:** If `generate-audio.sh` fails because the TTS output was truncated (zero-length clips at the end), **do not shorten the narration**. Instead, reduce `MAX_WORDS_PER_CHUNK` in the script (e.g., from 600 to 400) so the narration is split across more TTS API calls. The script already supports multi-chunk generation — it generates each chunk separately and concatenates the results. The fix is always to split into more chunks, never to cut content from the script.

### Step 4: Write the manifest

The manifest is a JSON file that describes every slide in the video. It bridges the narration/audio step and the Remotion renderer.

Read the `durations.json` from step 3 to get the duration (in seconds) for each audio clip. Then write a `manifest.json` alongside the audio files:

```json
{
	"pr": 7865,
	"slides": [
		{
			"type": "intro",
			"title": "Fix canvas-in-front z-index layering #7865",
			"date": "February 14, 2026",
			"audio": "audio-00.wav",
			"durationInSeconds": 3.2
		},
		{
			"type": "diff",
			"filename": "packages/editor/editor.css",
			"language": "css",
			"diff": "@@ -12,7 +12,7 @@\n   --tl-z-canvas: 100;\n-  --tl-z-canvas-in-front: 600;\n+  --tl-z-canvas-in-front: 250;\n   --tl-z-shapes: 300;",
			"audio": "audio-01.wav",
			"durationInSeconds": 25.8
		},
		{
			"type": "code",
			"filename": "packages/editor/src/lib/Editor.ts",
			"language": "typescript",
			"code": "function getZIndex() {\n  return 250\n}",
			"audio": "audio-02.wav",
			"durationInSeconds": 13.5
		},
		{
			"type": "text",
			"title": "Summary",
			"subtitle": "Moved canvas-in-front from z-index 600 to 250.",
			"audio": "audio-07.wav",
			"durationInSeconds": 7.4
		},
		{
			"type": "list",
			"title": "Key changes",
			"items": ["Lowered z-index", "Updated tests", "Added migration"],
			"audio": "audio-06.wav",
			"durationInSeconds": 10.2
		},
		{
			"type": "outro",
			"durationInSeconds": 3
		}
	]
}
```

#### Slide types

| Type      | Required fields                                              | Description                        |
| --------- | ------------------------------------------------------------ | ---------------------------------- |
| `intro`   | `title`, `date`, `audio`, `durationInSeconds`                | Logo + title + date                |
| `diff`    | `filename`, `language`, `diff`, `audio`, `durationInSeconds` | Syntax-highlighted unified diff    |
| `code`    | `filename`, `language`, `code`, `audio`, `durationInSeconds` | Syntax-highlighted source code     |
| `text`    | `title`, `audio`, `durationInSeconds`                        | Title + optional `subtitle`        |
| `list`    | `title`, `items`, `audio`, `durationInSeconds`               | Title + numbered items             |
| `image`   | `src`, `audio`, `durationInSeconds`                          | Pre-rendered image (fallback)      |
| `segment` | `title`, `durationInSeconds`                                 | Silent title card between segments |
| `outro`   | `durationInSeconds`                                          | Logo only, no audio                |

#### Animated scroll with `focus`

For longer diffs or code (more than ~30 lines), the renderer keeps the font at a readable 16px and uses an animated viewport that scrolls between focus points. Add a `focus` array to `diff` or `code` slides:

```json
{
	"type": "diff",
	"filename": "packages/editor/src/lib/Editor.ts",
	"language": "typescript",
	"diff": "... 60-line diff ...",
	"focus": [
		{ "line": 3, "at": 0 },
		{ "line": 25, "at": 0.4 },
		{ "line": 50, "at": 0.8 }
	],
	"audio": "audio-03.wav",
	"durationInSeconds": 30
}
```

- **`line`** — The line number (0-indexed into the parsed diff/code lines) to center on screen.
- **`at`** — When to arrive at this position, as a fraction of the slide's duration (0 = start, 1 = end).

The viewport smoothly eases between focus points. Before the first point, it holds at the first position; after the last, it holds there.

**When to use focus:** Any diff or code slide with more than ~30 lines. Without focus, long content starts at the top and stays static — the viewer can't see the bottom. With focus, you guide the viewer's eye to the code being discussed at each moment.

**When to omit focus:** Short diffs (≤30 lines) fit on screen at 16px and don't need scrolling.

#### Writing diff fields

For `diff` slides, paste the **unified diff** for the relevant hunk(s). This is the output of `git diff` for that section of the file — including the `@@` hunk header and `+`/`-`/` ` line prefixes. The renderer parses these prefixes to apply green/red backgrounds and syntax highlighting.

To get a diff for a specific file:

```bash
git diff main..HEAD -- path/to/file.ts
```

Include only the relevant hunks, not the entire file diff. Strip the `diff --git` and `---`/`+++` header lines — start from the `@@` hunk header.

For `code` slides, paste the relevant source code (a function, a class, a section). No diff prefixes needed.

#### Segment title slides

Insert a **`segment` slide** before each content segment to introduce it — except before the intro and context/overview segments. This includes code walkthrough segments and the summary/conclusion. Each segment slide is **3 seconds of silence** with the segment title centered on screen.

```json
{
	"type": "segment",
	"title": "Zoom state machine",
	"durationInSeconds": 3
}
```

These provide clear visual breaks between sections and give the viewer a moment to orient before each new topic.

#### Segment title labels on code/diff slides

Add a `title` field to `code` and `diff` slides to show a small label in the top-left corner identifying which segment the viewer is in. Use the same title as the preceding `segment` slide. This helps orient viewers, especially when a segment spans multiple slides.

```json
{
	"type": "diff",
	"title": "Zoom state machine",
	"filename": "packages/editor/src/lib/ZoomTool.ts",
	...
}
```

### Step 5: Render the video

Run the `render.sh` script:

```bash
.claude/skills/pr-walkthrough/video/render.sh \
  .claude/skills/pr-walkthrough/tmp/pr-<number>/manifest.json \
  .claude/skills/pr-walkthrough/out/pr-<number>-walkthrough.mp4
```

The script copies manifest + audio files into the Remotion project's `public/` directory, installs npm dependencies if needed, and renders the video.

**Dependencies:** Node.js 18+, ffmpeg (for final encoding). The first run installs Remotion (~50MB).

## File organization

Final output lives in `.claude/skills/pr-walkthrough/`. All intermediate files go in `.claude/skills/pr-walkthrough/tmp/` (gitignored):

```
.claude/skills/pr-walkthrough/
├── SKILL.md                    # This file
├── scripts/                    # CLI tools (checked in)
│   └── generate-audio.sh       # narration.json → per-slide WAVs + durations.json
├── video/                      # Remotion project (checked in)
│   ├── package.json
│   ├── tsconfig.json
│   ├── remotion.config.ts
│   ├── render.sh               # manifest.json → MP4
│   ├── public/                 # Auto-populated at render time
│   └── src/                    # React components for each slide type
├── out/                        # Final outputs (gitignored)
│   └── pr-XXXX-walkthrough.mp4
└── tmp/                        # Intermediate files (gitignored)
    └── pr-XXXX/
        ├── SCRIPT.md           # Narration script
        ├── narration.json      # Input to generate-audio.sh
        ├── full-narration.wav  # Full TTS output before splitting
        ├── durations.json      # Audio filename → duration in seconds
        ├── manifest.json       # Input to render.sh
        └── audio-XX.wav        # Per-segment audio clips
```

## API configuration

- **Gemini API key:** Stored as `GEMINI_API_KEY` in the project root `.env` file. Used for TTS and audio alignment.
- **TTS model:** `gemini-2.5-pro-tts`
- **TTS voice:** `Iapetus` (always)

## Script structure

The walkthrough follows a consistent narrative arc. Not every section needs its own segment — combine or skip sections based on the PR's complexity. The goal is 8-12 segments total, with the vast majority showing code.

### Intro (1 segment)

The intro card: tldraw logo + PR title + date. The narration should be a single sentence that frames what this PR does at a high level. Don't go into detail yet.

Manifest slide type: `intro`.

### Context (0-1 segments)

Brief orientation before diving into code. What was the situation before this PR? What problem or need motivated the work? Keep this short — just enough framing that the code walkthrough makes sense.

- Be concrete: "Arrow bindings broke when the target shape was inside a group" not "There were issues with bindings"
- Name the area of the codebase affected

If the context can be explained while showing the first piece of relevant code, skip the standalone context segment and fold it into the first code segment.

Manifest slide type: `text` or `diff` (if showing the problematic code).

### Code walkthrough (6-10 segments)

The bulk of the video. Walk through the actual code changes, showing specific diffs and files while explaining what was done and why.

**Every segment should show code.** Use `diff` slides for changes and `code` slides for unchanged reference code.

Guidelines:

- **Name files and functions.** Every narrated segment should reference at least one specific file or function.
- **Show the diff.** The visual for each segment should be the actual diff being discussed. Use `git diff main..HEAD -- path/to/file` to get the diff, then extract the relevant hunks.
- **Order by understanding, not by file.** Present changes in the order that builds comprehension. If a new type is defined in one file and consumed in another, show the definition first.
- **Explain the "why", not just the "what".** The diff shows _what_ changed — the narration adds the reasoning, the edge cases it handles, the alternatives that were considered.
- **Skip boilerplate, but mention it.** Don't dedicate a segment to every import change or type export, but do mention in passing: "There are also some type exports added in `index.ts` — those are just re-exports of the new types we'll see next."
- **Group related small changes.** If three files all got the same one-line fix, one segment can cover all three. Mention each file by name.

### Summary (1 segment)

Briefly recap what the PR accomplished. This is a short wrap-up — a sentence or two summarizing the overall change, mentioning any known limitations or follow-up work if relevant.

Manifest slide type: `text`.

### Outro (1 segment, silent)

The tldraw logo, 3 seconds of silence. Always include this as the final slide.

Manifest slide type: `outro` with `durationInSeconds: 3`.

## Narration writing tips

- **Be specific about code.** Say "In `BindingUtil.ts`, the `onAfterChange` handler now checks for group ancestors" — not "The binding system was updated." Name files and functions so the viewer can connect the narration to what's on screen.
- **Each segment = one change or closely related group of changes.** If you can't point to a specific diff for the segment, it's probably too abstract.
- **Write as the author.** The tone should be explanatory and natural — like walking someone through your work. "So the main thing here is..." or "The tricky part was..." are fine.
- **Avoid redundancy** between intro and first content segment.
- **Mention files that aren't shown.** If a PR touches 15 files but only 6 are interesting, briefly acknowledge the others: "The remaining changes are type exports and test fixtures."
- Aim for **5-7 minutes** total narration.

## Checklist

- [ ] Read all PR commits and understand the full diff
- [ ] Write narration in SCRIPT.md (8-12 segments)
- [ ] Generate per-segment audio (Iapetus voice)
- [ ] Read durations.json to get per-segment durations
- [ ] Write manifest.json with slide types, diffs/code, and audio references
- [ ] Render video with render.sh
- [ ] Verify final output: 1600x900, audio synced, outro present


---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### scripts/generate-audio.sh

```bash
#!/bin/bash
# generate-audio.sh — Generate walkthrough narration audio from a JSON script.
#
# Generates all narration as a single TTS call for consistent voice, then
# splits into per-slide clips using Gemini audio understanding for alignment.
#
# Usage:
#   ./generate-audio.sh <script.json> [output-dir]
#
# Input JSON format:
#   {
#     "style": "Read in a calm, steady, professional tone...",
#     "voice": "Iapetus",           (optional, default: Iapetus)
#     "slides": [
#       "Intro narration text...",
#       "Problem slide narration...",
#       "Approach narration...",
#       ...
#     ]
#   }
#
# Output:
#   <output-dir>/audio-00.wav, audio-01.wav, ...
#   <output-dir>/full-narration.wav (kept for debugging)
#
# Dependencies:
#   ffmpeg / ffprobe
#
# Environment:
#   GEMINI_API_KEY — required. Auto-sourced from .env if not set.
#
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

# --- Args ---
SCRIPT_JSON="${1:?Usage: generate-audio.sh <script.json> [output-dir]}"
OUTPUT_DIR="${2:-.}"

# Resolve relative paths
[[ "$SCRIPT_JSON" != /* ]] && SCRIPT_JSON="$(pwd)/$SCRIPT_JSON"
[[ "$OUTPUT_DIR" != /* ]] && OUTPUT_DIR="$(pwd)/$OUTPUT_DIR"

if [ ! -f "$SCRIPT_JSON" ]; then
  echo "Error: ${SCRIPT_JSON} not found"
  exit 1
fi

mkdir -p "$OUTPUT_DIR"

PYTHON="python3"

# --- API key ---
REPO_ROOT=$(git rev-parse --show-toplevel 2>/dev/null || echo ".")

if [ -z "${GEMINI_API_KEY:-}" ]; then
  if [ -f "${REPO_ROOT}/.env" ]; then
    export $(grep '^GEMINI_API_KEY=' "${REPO_ROOT}/.env" | xargs) 2>/dev/null || true
  fi
fi
GEMINI_API_KEY="${GEMINI_API_KEY:?Set GEMINI_API_KEY environment variable or add it to .env}"

# --- Config ---
TTS_MODEL="gemini-2.5-pro-preview-tts"
TTS_ENDPOINT="https://generativelanguage.googleapis.com/v1beta/models/${TTS_MODEL}:generateContent"
ALIGN_MODEL="gemini-2.5-flash"
ALIGN_ENDPOINT="https://generativelanguage.googleapis.com/v1beta/models/${ALIGN_MODEL}:generateContent"
UPLOAD_ENDPOINT="https://generativelanguage.googleapis.com/upload/v1beta/files"
FILES_ENDPOINT="https://generativelanguage.googleapis.com/v1beta/files"
SPEED=1.2  # Speed up narration (1.0 = no change)

# --- Run everything in Python for reliability ---
"$PYTHON" - "$SCRIPT_JSON" "$OUTPUT_DIR" "$GEMINI_API_KEY" "$TTS_MODEL" "$TTS_ENDPOINT" "$SPEED" "$ALIGN_ENDPOINT" "$UPLOAD_ENDPOINT" "$FILES_ENDPOINT" <<'PYTHON_SCRIPT'
import json, sys, os, subprocess, base64, time, urllib.request, re, atexit

script_json = sys.argv[1]
output_dir = sys.argv[2]
api_key = sys.argv[3]
tts_model = sys.argv[4]
tts_endpoint = sys.argv[5]
speed = float(sys.argv[6])
align_endpoint = sys.argv[7]
upload_endpoint = sys.argv[8]
files_endpoint = sys.argv[9]

def api_call(endpoint, body_dict, method="POST"):
    body = json.dumps(body_dict).encode()
    req = urllib.request.Request(
        f"{endpoint}?key={api_key}",
        data=body,
        headers={"Content-Type": "application/json"},
        method=method,
    )
    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read())

# --- Load narration ---
with open(script_json) as f:
    data = json.load(f)

voice = data.get("voice", "Iapetus")
slides = data["slides"]
style = data.get("style",
    "Read the following in a calm, steady, professional tone. "
    "Speak at a measured pace. Between each numbered section, pause briefly.")

word_count = sum(len(s.split()) for s in slides)
print(f"=== Generating narration audio ===")
print(f"  Voice: {voice}")
print(f"  Slides: {len(slides)}")
print(f"  Words: {word_count}")

# --- Chunking ---
# The Gemini TTS model has a max output audio duration. At ~150 wpm, 800 words
# produces ~5.3 min of speech which is safely within the limit. For longer
# narrations, we split into chunks, generate audio per chunk, then concatenate.
MAX_WORDS_PER_CHUNK = 600

def build_prompt(segment_texts, start_index):
    """Build a TTS prompt for a subset of segments."""
    parts = [style, ""]
    for j, text in enumerate(segment_texts):
        parts.append(f"[{start_index + j + 1}]")
        parts.append(text)
        parts.append("")
    return "\n".join(parts)

def call_tts(prompt_text, label=""):
    """Make a single TTS API call and return raw PCM bytes."""
    print(f"  [tts] Calling {tts_model}{label}...")
    try:
        response = api_call(tts_endpoint, {
            "contents": [{"parts": [{"text": prompt_text}]}],
            "generationConfig": {
                "responseModalities": ["AUDIO"],
                "speechConfig": {
                    "voiceConfig": {
                        "prebuiltVoiceConfig": {
                            "voiceName": voice
                        }
                    }
                }
            }
        })
    except urllib.error.HTTPError as e:
        error_body = e.read().decode()
        print(f"  [error] TTS failed ({e.code}): {error_body[:500]}")
        sys.exit(1)

    error_msg = response.get("error", {}).get("message", "")
    if error_msg:
        print(f"  [error] TTS failed: {error_msg}")
        sys.exit(1)

    return base64.b64decode(response["candidates"][0]["content"]["parts"][0]["inlineData"]["data"])

def pcm_to_wav(pcm_bytes, out_wav):
    """Convert raw 24kHz PCM to 48kHz WAV with speed adjustment."""
    pcm_tmp = out_wav + ".pcm"
    with open(pcm_tmp, "wb") as f:
        f.write(pcm_bytes)
    subprocess.run([
        "ffmpeg", "-y", "-f", "s16le", "-ar", "24000", "-ac", "1",
        "-i", pcm_tmp, "-af", f"atempo={speed}", "-ar", "48000", out_wav
    ], capture_output=True, check=True)
    os.remove(pcm_tmp)

# --- Split slides into chunks by word count ---
chunks = []  # list of (start_index, [segment_texts])
current_chunk = []
current_words = 0
current_start = 0

for i, text in enumerate(slides):
    wc = len(text.split())
    if current_chunk and current_words + wc > MAX_WORDS_PER_CHUNK:
        chunks.append((current_start, current_chunk))
        current_start = i
        current_chunk = [text]
        current_words = wc
    else:
        current_chunk.append(text)
        current_words += wc
if current_chunk:
    chunks.append((current_start, current_chunk))

wav_path = os.path.join(output_dir, "full-narration.wav")

if len(chunks) == 1:
    # Single chunk — same as before
    prompt = build_prompt(slides, 0)
    pcm_data = call_tts(prompt)
    pcm_to_wav(pcm_data, wav_path)
else:
    # Multiple chunks — generate each, then concatenate
    print(f"  [tts] Narration is {word_count} words — splitting into {len(chunks)} chunks")
    chunk_wavs = []
    for ci, (start_idx, chunk_slides) in enumerate(chunks):
        chunk_wc = sum(len(s.split()) for s in chunk_slides)
        label = f" (chunk {ci + 1}/{len(chunks)}, segments {start_idx}-{start_idx + len(chunk_slides) - 1}, {chunk_wc} words)"
        prompt = build_prompt(chunk_slides, start_idx)
        pcm_data = call_tts(prompt, label)
        chunk_wav = os.path.join(output_dir, f"chunk-{ci:02d}.wav")
        pcm_to_wav(pcm_data, chunk_wav)
        chunk_wavs.append(chunk_wav)

    # Concatenate chunk WAVs
    concat_path = os.path.join(output_dir, "chunk-concat.txt")
    with open(concat_path, "w") as f:
        for cw in chunk_wavs:
            f.write(f"file '{cw}'\n")
    subprocess.run([
        "ffmpeg", "-y", "-f", "concat", "-safe", "0",
        "-i", concat_path, "-c", "copy", wav_path
    ], capture_output=True, check=True)

    # Clean up chunk files
    os.remove(concat_path)
    for cw in chunk_wavs:
        os.remove(cw)

# Get duration
dur_result = subprocess.run(
    ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "csv=p=0", wav_path],
    capture_output=True, text=True
)
total_dur = float(dur_result.stdout.strip())
print(f"  [done] full-narration.wav ({total_dur:.1f}s, {speed}x speed)")

# --- Duration sanity check ---
expected_dur = word_count / 2.5  # ~150 wpm
if total_dur > expected_dur * 3:
    print(f"  [warn] Audio is {total_dur:.0f}s but expected ~{expected_dur:.0f}s for {word_count} words")
    print(f"  [warn] TTS may have added excessive pauses or extra content")

# --- Alignment via Gemini ---
print()
print("=== Aligning segments via Gemini ===")

# Upload the WAV file to Gemini Files API
print(f"  [upload] Uploading full-narration.wav...")
wav_size = os.path.getsize(wav_path)
with open(wav_path, "rb") as f:
    wav_bytes = f.read()

upload_req = urllib.request.Request(
    f"{upload_endpoint}?key={api_key}",
    data=wav_bytes,
    headers={
        "Content-Type": "audio/wav",
        "Content-Length": str(wav_size),
        "X-Goog-Upload-Protocol": "raw",
        "X-Goog-Upload-Command": "upload, finalize",
        "X-Goog-Upload-Header-Content-Type": "audio/wav",
    },
    method="POST",
)
with urllib.request.urlopen(upload_req) as resp:
    upload_response = json.loads(resp.read())

file_uri = upload_response["file"]["uri"]
file_name = upload_response["file"]["name"]  # e.g. "files/abc123"
# Extract just the ID for URL construction
file_id = file_name.split("/")[-1] if "/" in file_name else file_name
print(f"  [upload] Done: {file_name}")

# Register cleanup so the remote file is deleted even on early exit
def _cleanup_uploaded_file():
    try:
        req = urllib.request.Request(
            f"{files_endpoint}/{file_id}?key={api_key}",
            method="DELETE",
        )
        urllib.request.urlopen(req)
    except Exception:
        pass
atexit.register(_cleanup_uploaded_file)

# Wait for file to be processed
for attempt in range(30):
    status_req = urllib.request.Request(
        f"{files_endpoint}/{file_id}?key={api_key}",
        method="GET",
    )
    with urllib.request.urlopen(status_req) as resp:
        file_status = json.loads(resp.read())
    state = file_status.get("state", "")
    if state == "ACTIVE":
        break
    print(f"  [upload] Waiting for processing... ({state})")
    time.sleep(2)
else:
    print(f"  [error] File not ready after 60s")
    sys.exit(1)

# Ask Gemini to find segment start times
segments_list = "\n".join(f"  Segment {i}: {text[:80]}{'...' if len(text) > 80 else ''}"
                          for i, text in enumerate(slides))

align_prompt = f"""Listen to this audio narration and identify where each of the following segments begins, and where the last segment ends.

The audio contains {len(slides)} segments read sequentially. For each segment, tell me the start time in seconds. Also include one final entry for the end time of the last segment (where speech stops, ignoring any trailing silence).

{segments_list}

Return ONLY a JSON array of numbers representing timestamps in seconds. The array must have exactly {len(slides) + 1} entries: one start time per segment, plus the end time of the final segment. For example: [0.0, 12.5, 28.3, ..., 45.0]

The first entry should be 0.0 or close to it. The last entry should be where the narration ends, not the total audio length."""

print(f"  [align] Asking Gemini to find segment boundaries...")
try:
    align_response = api_call(align_endpoint, {
        "contents": [{
            "parts": [
                {"fileData": {"mimeType": "audio/wav", "fileUri": file_uri}},
                {"text": align_prompt},
            ]
        }],
        "generationConfig": {
            "temperature": 0,
            "responseMimeType": "application/json",
        }
    })
except urllib.error.HTTPError as e:
    error_body = e.read().decode()
    print(f"  [error] Alignment failed ({e.code}): {error_body[:500]}")
    sys.exit(1)

# Parse response
align_text = align_response["candidates"][0]["content"]["parts"][0]["text"]
print(f"  [align] Raw response: {align_text.strip()}")

parsed = json.loads(align_text)

# Accept a bare list or an object with a list value (e.g. {"timestamps": [...]})
if isinstance(parsed, list):
    timestamps = parsed
elif isinstance(parsed, dict):
    # Use the first list-valued field
    timestamps = next((v for v in parsed.values() if isinstance(v, list)), None)
    if timestamps is None:
        print(f"  [error] Alignment returned object with no list field: {align_text[:200]}")
        sys.exit(1)
else:
    print(f"  [error] Unexpected alignment response type: {type(parsed).__name__}")
    sys.exit(1)

# We expect N+1 entries (N start times + 1 end time for the last segment).
# If we got exactly N, the model omitted the end time — use total_dur.
# If we got N+1, the last entry is the end time.
expected = len(slides) + 1
if len(timestamps) == len(slides):
    print(f"  [info] Got {len(timestamps)} timestamps (no end time), using audio duration")
    timestamps.append(total_dur)
elif len(timestamps) != expected:
    print(f"  [warn] Expected {expected} timestamps, got {len(timestamps)}")
    while len(timestamps) < expected:
        timestamps.append(timestamps[-1] if timestamps else 0.0)
    timestamps = timestamps[:expected]

# Ensure timestamps are sorted and within bounds
timestamps = [max(0.0, min(float(t), total_dur)) for t in timestamps]
for i in range(1, len(timestamps)):
    if timestamps[i] < timestamps[i - 1]:
        timestamps[i] = timestamps[i - 1]

cut_points = timestamps[:len(slides)]  # start times
# Use full audio duration for the last segment's end — the alignment's end-time
# estimate often clips the final word.
end_time = total_dur

print(f"  [align] Segment starts: {[f'{t:.1f}s' for t in cut_points]}")
print(f"  [align] Narration ends: {end_time:.1f}s (audio total: {total_dur:.1f}s)")

# --- Split and collect durations ---
print()
print("=== Splitting into per-slide audio ===")
durations = {}

SPLIT_PAD = 0.4  # seconds to start before detected boundary to avoid clipping speech onset

for i in range(len(slides)):
    num = f"{i:02d}"
    raw_start = cut_points[i]
    # Pad earlier to avoid clipping the first syllable (alignment timestamps
    # can be slightly late). Don't pad before the previous segment's *padded* start,
    # and shorten the previous segment's end to match so clips never overlap.
    prev_start = cut_points[i - 1] if i > 0 else 0.0
    start = max(prev_start, raw_start - SPLIT_PAD) if i > 0 else raw_start
    raw_end = cut_points[i + 1] if i + 1 < len(cut_points) else end_time
    # Pull end back by SPLIT_PAD so the next segment's padded start doesn't
    # overlap with this segment's end.
    if i + 1 < len(cut_points):
        end = max(start, raw_end - SPLIT_PAD)
    else:
        end = raw_end
    dur = end - start

    out_path = os.path.join(output_dir, f"audio-{num}.wav")
    print(f"  audio-{num}.wav: {start:.1f}s -> {end:.1f}s ({dur:.1f}s)")

    subprocess.run([
        "ffmpeg", "-y", "-i", wav_path,
        "-ss", str(start), "-to", str(end), "-c", "copy", out_path
    ], capture_output=True)

    durations[f"audio-{num}.wav"] = round(dur, 2)

# --- Check for zero-length clips (TTS truncation) ---
empty_clips = [k for k, v in durations.items() if v <= 0.01]
if empty_clips:
    print(f"\n  [warn] TTS truncated audio — {len(empty_clips)} segment(s) have no audio: {empty_clips}")
    print(f"  [warn] Reduce narration length or number of segments and retry.")
    sys.exit(1)

# --- Trim silence from each clip ---
MAX_SILENCE = 0.15  # strip nearly all silence — segment title slides handle inter-slide pauses
SILENCE_THRESHOLD = "-40dB"
print()
print("=== Trimming silence ===")

for i in range(len(slides)):
    num = f"{i:02d}"
    clip_path = os.path.join(output_dir, f"audio-{num}.wav")

    # Detect silence at start and end using silencedetect
    detect = subprocess.run([
        "ffmpeg", "-i", clip_path, "-af",
        f"silencedetect=noise={SILENCE_THRESHOLD}:d=0.1",
        "-f", "null", "-"
    ], capture_output=True, text=True)
    stderr = detect.stderr

    # Get clip duration
    dur_res = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "csv=p=0", clip_path],
        capture_output=True, text=True
    )
    clip_dur = float(dur_res.stdout.strip())

    # Parse silence periods
    silence_starts = re.findall(r'silence_start: ([\d.]+)', stderr)
    silence_ends = re.findall(r'silence_end: ([\d.]+)', stderr)

    # Find leading silence: silence that starts at ~0
    trim_start = 0.0
    if silence_starts and float(silence_starts[0]) < 0.05:
        if silence_ends:
            leading_silence = float(silence_ends[0])
            if leading_silence > MAX_SILENCE:
                trim_start = leading_silence - MAX_SILENCE

    # Find trailing silence: silence that starts near the end and extends to clip end.
    # Skip trailing trim on last segment — it's followed by the outro and
    # the silence detector often clips the final word.
    trim_end = clip_dur
    is_last = (i == len(slides) - 1)
    if not is_last and silence_starts:
        last_silence_start = float(silence_starts[-1])
        # Verify this silence actually extends to the clip end (not an internal pause).
        # If there's a corresponding silence_end after last_silence_start that is before
        # clip_dur, this is an internal pause — skip it.
        last_silence_is_trailing = True
        for se in silence_ends:
            se_val = float(se)
            if se_val > last_silence_start and se_val < clip_dur - 0.05:
                last_silence_is_trailing = False
                break
        if last_silence_is_trailing and last_silence_start > 0.05:  # not the leading silence
            trailing_silence = clip_dur - last_silence_start
            if trailing_silence > MAX_SILENCE:
                trim_end = last_silence_start + MAX_SILENCE

    if trim_start > 0 or trim_end < clip_dur:
        trimmed_path = clip_path + ".tmp.wav"
        subprocess.run([
            "ffmpeg", "-y", "-i", clip_path,
            "-ss", str(trim_start), "-to", str(trim_end),
            "-c", "copy", trimmed_path
        ], capture_output=True)
        os.replace(trimmed_path, clip_path)
        new_dur = trim_end - trim_start
        durations[f"audio-{num}.wav"] = round(new_dur, 2)
        print(f"  audio-{num}.wav: {clip_dur:.1f}s -> {new_dur:.1f}s (trimmed {clip_dur - new_dur:.1f}s)")
    else:
        print(f"  audio-{num}.wav: {clip_dur:.1f}s (no trim needed)")

# --- Write durations.json ---
durations_path = os.path.join(output_dir, "durations.json")
with open(durations_path, "w") as f:
    json.dump(durations, f, indent=2)
print(f"\n  Wrote durations.json ({len(durations)} entries)")

print()
print("=== Done ===")
PYTHON_SCRIPT

echo ""
echo "Output:"
ls -la "${OUTPUT_DIR}"/audio-*.wav 2>/dev/null || echo "  (no files generated)"

```

### scripts/make-video.sh

```bash
#!/bin/bash
# make-video.sh — Assemble walkthrough slides + audio into a final MP4.
#
# Usage:
#   ./make-video.sh <slide-dir> <output.mp4> [outro-duration]
#
# Expects in <slide-dir>:
#   slide-00.png, slide-01.png, ...  (one per segment, including outro)
#   audio-00.wav, audio-01.wav, ...  (one per narrated segment)
#
# The last slide PNG without a matching audio WAV is the silent outro.
#
set -euo pipefail

SLIDE_DIR="${1:?Usage: make-video.sh <slide-dir> <output.mp4> [outro-duration]}"
OUTPUT="${2:?Usage: make-video.sh <slide-dir> <output.mp4> [outro-duration]}"
OUTRO_DUR="${3:-3}"

# Resolve relative paths
[[ "$SLIDE_DIR" != /* ]] && SLIDE_DIR="$(pwd)/$SLIDE_DIR"
[[ "$OUTPUT" != /* ]] && OUTPUT="$(pwd)/$OUTPUT"

mkdir -p "$(dirname "$OUTPUT")"

TMPDIR_WORK=$(mktemp -d)
trap "rm -rf $TMPDIR_WORK" EXIT

echo "=== Assembling video ==="
echo "  Slides: $SLIDE_DIR"
echo "  Output: $OUTPUT"

# Count slides and audio
SLIDE_COUNT=$(ls "$SLIDE_DIR"/slide-*.png 2>/dev/null | wc -l | tr -d ' ')
AUDIO_COUNT=$(ls "$SLIDE_DIR"/audio-*.wav 2>/dev/null | wc -l | tr -d ' ')

echo "  Found $SLIDE_COUNT slides, $AUDIO_COUNT audio clips"
echo "  Last slide (no audio) = outro (${OUTRO_DUR}s)"

# Create per-segment videos
CONCAT_LIST="$TMPDIR_WORK/concat.txt"
> "$CONCAT_LIST"

for i in $(seq 0 $((SLIDE_COUNT - 1))); do
  NUM=$(printf "%02d" $i)
  SLIDE="$SLIDE_DIR/slide-${NUM}.png"
  AUDIO="$SLIDE_DIR/audio-${NUM}.wav"
  SEGMENT="$TMPDIR_WORK/segment-${NUM}.mp4"

  if [ -f "$AUDIO" ]; then
    # Get audio duration
    DUR=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$AUDIO")
    echo "  segment-${NUM}: slide + audio (${DUR}s)"

    # Create video segment: static image for duration of audio, with audio
    ffmpeg -y -loop 1 -i "$SLIDE" -i "$AUDIO" \
      -c:v libx264 -tune stillimage -pix_fmt yuv420p \
      -vf "scale=1600:900:force_original_aspect_ratio=decrease,pad=1600:900:(ow-iw)/2:(oh-ih)/2" \
      -c:a aac -b:a 192k -ar 48000 \
      -shortest -movflags +faststart \
      "$SEGMENT" 2>/dev/null
  else
    # Silent outro slide
    echo "  segment-${NUM}: silent outro (${OUTRO_DUR}s)"
    ffmpeg -y -loop 1 -i "$SLIDE" -f lavfi -i anullsrc=r=48000:cl=mono \
      -c:v libx264 -tune stillimage -pix_fmt yuv420p \
      -vf "scale=1600:900:force_original_aspect_ratio=decrease,pad=1600:900:(ow-iw)/2:(oh-ih)/2" \
      -c:a aac -b:a 192k -ar 48000 \
      -t "$OUTRO_DUR" -movflags +faststart \
      "$SEGMENT" 2>/dev/null
  fi

  echo "file '$SEGMENT'" >> "$CONCAT_LIST"
done

# Concatenate all segments
echo ""
echo "  Concatenating ${SLIDE_COUNT} segments..."
ffmpeg -y -f concat -safe 0 -i "$CONCAT_LIST" \
  -c copy -movflags +faststart \
  "$OUTPUT" 2>/dev/null

# Report
FINAL_DUR=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$OUTPUT")
FINAL_SIZE=$(ls -lh "$OUTPUT" | awk '{print $5}')
echo ""
echo "=== Done ==="
echo "  Output: $OUTPUT"
echo "  Duration: ${FINAL_DUR}s"
echo "  Size: $FINAL_SIZE"

```