content-ingestion
Ingest web content into Kurt. Map sitemaps to discover URLs, then fetch content selectively.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install boringdata-kurt-demo-ingest-content-skill
Repository
Skill path: _archived/ingest-content-skill
Ingest web content into Kurt. Map sitemaps to discover URLs, then fetch content selectively.
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, Tech Writer.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: boringdata.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install content-ingestion into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/boringdata/kurt-demo before adding content-ingestion to shared team environments
- Use content-ingestion for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: content-ingestion
description: Ingest web content into Kurt. Map sitemaps to discover URLs, then fetch content selectively.
---
# Content Ingestion
## Overview
This skill enables efficient web content ingestion with a map-then-fetch workflow. Discover URLs from sitemaps first, review what was found, then selectively download only the content you need. Supports single document fetching and fast parallel batch downloads.
Content is extracted as markdown with metadata (title, author, date, categories) and stored in the `sources/` directory.
## Quick Start
```bash
# 1. Discover URLs from sitemap (fast, no downloads)
kurct content map https://www.anthropic.com
# 1a. OR discover with publish dates (slower, extracts dates from blogrolls)
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates
# 2. Review what was found
kurt content list --status NOT_FETCHED
# 3. Fetch content (parallel batch)
kurct content fetch --url-prefix https://www.anthropic.com/
```
## Map-Then-Fetch Workflow
**Why two steps?**
- Sitemaps often contain hundreds or thousands of URLs
- Map step is fast (no downloads) - lets you review before committing
- Fetch step is slow (downloads + extraction) - run selectively
- Saves time, bandwidth, and storage
**Three-step process:**
1. **Map**: Discover URLs and create `NOT_FETCHED` records
2. **Review**: Examine discovered URLs using document management commands
3. **Fetch**: Download content selectively (single or batch)
### Integration with Iterative Source Gathering
**When invoked from `/create-project` or `/resume-project`**, use map-then-fetch to provide preview:
1. **Map first** - Show user what URLs were discovered
```bash
kurct content map https://docs.example.com
echo "Discovered X URLs. Preview:"
kurt content list --url-prefix https://docs.example.com --status NOT_FETCHED | head -10
```
2. **Get approval** - Ask if user wants to fetch all or selective
```
Found 150 URLs from docs.example.com
Preview (first 10):
1. https://docs.example.com/quickstart
2. https://docs.example.com/api/authentication
...
Fetch all 150 pages? Or selective? (all/selective/cancel)
```
3. **Fetch approved content**
```bash
# If all:
kurct content fetch --url-prefix https://docs.example.com
# If selective (user specifies path pattern):
kurct content fetch --url-prefix https://docs.example.com/api/
```
This provides **Checkpoint 1** (preview) for the iterative source gathering pattern.
## Core Operations
### Map Sitemap URLs
Discover URLs from sitemaps without downloading content.
```bash
# Discover all URLs from sitemap
kurct content map https://www.anthropic.com
# Discover with publish dates from blogrolls (recommended for blogs/docs)
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates
# Limit discovery (useful for testing)
kurct content map https://example.com --limit 10
# Map and fetch immediately
kurct content map https://example.com --fetch
# Discover dates with custom blogroll limit
kurct content map https://example.com --discover-dates --max-blogrolls 5
# JSON output for scripts
kurct content map https://example.com --output json
```
**What happens:**
- Automatically finds sitemap URLs (checks `/sitemap.xml`, `robots.txt`, etc.)
- Creates database records with `NOT_FETCHED` status
- Skips duplicate URLs gracefully
- Returns list of discovered documents
**Example output:**
```
✓ Found 317 URLs from sitemap
Created: 317 new documents
✓ https://www.anthropic.com
ID: 6203468a | Status: NOT_FETCHED
✓ https://www.anthropic.com/news/claude-3-7-sonnet
ID: bc2bcf48 | Status: NOT_FETCHED
```
### Discovering Publish Dates from Blogrolls
**When to use `--discover-dates`:**
- Content freshness is critical (tutorials, documentation, blog posts)
- You need to identify outdated content for updates
- Building a chronological content inventory
- Working with blogs, changelogs, or release notes
**How it works:**
1. Maps sitemap normally (all URLs)
2. Uses LLM to identify blog indexes, changelogs, release notes pages
3. Scrapes those pages to extract individual post URLs + dates
4. Creates/updates document records with `published_date` populated
5. Marks discovered posts with `is_chronological=True` and `discovery_method="blogroll"`
**Example:**
```bash
# Discover docs.getdbt.com with date extraction
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates
# Output shows date discovery:
# ✓ Found 500 URLs from sitemap
# --- Discovering blogroll/changelog pages ---
# Found 8 potential blogroll/changelog pages
#
# Scraping https://docs.getdbt.com/blog...
# Found 45 posts with dates
# Scraping https://docs.getdbt.com/docs/dbt-versions/...
# Found 23 posts with dates
#
# ✓ Total documents discovered from blogrolls: 68
# New: 12 (not in sitemap)
# Existing: 56 (enriched with dates)
```
**Performance:**
- Regular mapping: ~2-5 seconds (just sitemap parsing)
- With `--discover-dates`: ~5-15 minutes (includes LLM analysis + page scraping)
- Controlled with `--max-blogrolls` (default: 10 pages)
**Benefits:**
- Dates captured upfront (before fetch)
- Discovers additional posts not in sitemap
- Enriches existing records with publish dates
- Enables date-based filtering and relevance tracking
### Troubleshooting: Sitemap Discovery Failures
If automatic discovery fails with "No sitemap found," use this fallback workflow:
#### Tier 1: Try Common Sitemap Paths Directly
Most sites use standard paths. Try these **directly** (not base URL):
```bash
# Standard path (most common)
kurct content map https://docs.getdbt.com/sitemap.xml
# Alternative paths
kurct content map https://docs.getdbt.com/sitemap_index.xml
kurct content map https://docs.getdbt.com/sitemap/sitemap.xml
kurct content map https://docs.getdbt.com/sitemaps/sitemap.xml
```
**Success if:** Kurt finds and processes the sitemap
**Next if fails:** Try Tier 2
#### Tier 2: Check robots.txt
Use WebFetch to find sitemap URL in robots.txt:
```
WebFetch URL: https://docs.getdbt.com/robots.txt
Prompt: "Extract all Sitemap URLs from this robots.txt file"
```
Then try the discovered sitemap URLs:
```bash
kurct content map <sitemap-url-from-robots>
```
**Success if:** Sitemap URL found in robots.txt works
**Next if fails:** Try Tier 3
#### Tier 3: Search for Sitemap
Use WebSearch to discover sitemap location:
```
WebSearch: "site:docs.getdbt.com sitemap"
WebSearch: "docs.getdbt.com sitemap.xml"
```
Or check common documentation pages for sitemap links.
Then try discovered URLs with `kurct content map`
**Success if:** Sitemap found via search
**Next if fails:** Try Tier 4
#### Tier 4: Manual URL Collection
If no sitemap exists or is inaccessible, manually collect URLs:
**Option A: Use WebSearch to find pages**
```
WebSearch: "site:docs.getdbt.com tutorial"
WebSearch: "site:docs.getdbt.com guide"
```
**Option B: Crawl from homepage**
- Use WebFetch on homepage
- Extract navigation links
- Add each URL manually
**Option C: User provides URL list**
- Ask user for key URLs to ingest
- Import from CSV or list
Then add URLs manually:
```bash
kurct content add https://docs.getdbt.com/page1
kurct content add https://docs.getdbt.com/page2
kurct content add https://docs.getdbt.com/page3
```
#### Real Example: docs.getdbt.com
```bash
# ❌ Automatic discovery fails
kurct content map https://docs.getdbt.com
# Error: No sitemap found
# ✅ Direct sitemap URL works!
kurct content map https://docs.getdbt.com/sitemap.xml --limit 100
# Success: Found 100 URLs from sitemap
```
#### Quick Diagnostic Commands
**Test if sitemap exists:**
```bash
# Use WebFetch to check
WebFetch URL: https://example.com/sitemap.xml
Prompt: "Does this sitemap exist? If yes, describe its structure."
```
**Why automatic discovery fails:**
- Anti-bot protection (site blocks trafilatura but not WebFetch)
- Sitemap not in robots.txt
- Non-standard sitemap location
- Dynamic/JavaScript-rendered sitemaps
**When to use each tier:**
- **Tier 1**: Always try first (5 seconds to test)
- **Tier 2**: Standard sites with robots.txt (1 minute)
- **Tier 3**: Unusual configurations (2-3 minutes)
- **Tier 4**: No sitemap or heavily protected sites (ongoing)
### Fetch Single Document
Download content for a specific document.
```bash
# Fetch by document ID
kurct content fetch 6203468a-e3dc-48f2-8e1f-6e1da34dab05
# Fetch by URL (creates document if needed)
kurct content fetch https://www.anthropic.com/company
```
**What happens:**
- Downloads HTML content
- Extracts markdown with trafilatura
- Saves to `sources/{domain}/{path}/page.md`
- Updates database: `FETCHED` status, content metadata
- Returns document details
### Batch Fetch Documents
Download multiple documents in parallel (5-10x faster than sequential).
```bash
# Fetch all from domain
kurct content fetch --url-prefix https://www.anthropic.com/
# Fetch all blog posts
kurct content fetch --url-contains /blog/
# Fetch everything NOT_FETCHED
kurct content fetch --all
# Increase parallelism (default: 5)
kurct content fetch --url-prefix https://example.com/ --max-concurrent 10
# Retry failed documents
kurct content fetch --status ERROR --url-prefix https://example.com/
```
**What happens:**
- Fetches documents concurrently (default: 5 parallel)
- Uses async httpx for fast downloads
- Extracts metadata: title, author, date, categories, language
- Stores content fingerprint for deduplication
- Updates all document records in batch
**Performance:**
- Sequential: ~2-3 seconds per document
- Parallel (5 concurrent): ~0.4-0.6 seconds per document
- Example: 82 documents in ~10 seconds vs ~3 minutes
**File structure after fetch:**
```
sources/
└── www.anthropic.com/
├── news/
│ └── claude-3-7-sonnet.md
└── company.md
```
## Alternative: Manual URL Addition
When sitemap discovery fails or you want to add specific URLs.
### Add Single URLs
```bash
# Add URL without fetching
kurct content add https://example.com/page1
kurct content add https://example.com/page2
# Then fetch when ready
kurct content fetch https://example.com/page1
```
### Direct Fetch (Add + Fetch)
Create document record and fetch content in one step.
```bash
# Creates document if doesn't exist, then fetches
kurct content fetch https://example.com/specific-page
```
## WebFetch Fallback (When Kurt Fetch Fails)
If `kurct content fetch` fails due to anti-bot protection or other issues, use WebFetch as a fallback with automatic import.
### Workflow
1. **Attempt Kurt fetch first** (creates ERROR record if it fails)
2. **Use WebFetch to retrieve content with metadata**
3. **Save with YAML frontmatter**
4. **Auto-import hook handles the rest**
### WebFetch with Metadata Extraction
When using WebFetch, extract FULL metadata to preserve in the file:
```
Use WebFetch with this prompt:
"Extract ALL metadata from this page including:
- Full page title
- Description / meta description
- Author(s)
- Published date or last modified date
- Any structured data or meta tags
- The complete page content as markdown
Format the response to clearly show:
1. All metadata fields found
2. The markdown content"
```
### Save Content with Frontmatter
Save the fetched content with YAML frontmatter:
```markdown
---
title: "Full Page Title from WebFetch"
url: https://example.com/page
description: "Meta description or summary"
author: "Author Name or Organization"
published_date: "2025-10-22"
last_modified: "2025-10-22"
fetched_via: WebFetch
fetched_at: "2025-10-23"
---
# Page Content
[Markdown content from WebFetch...]
```
### Auto-Import Process
Once the file is saved to `/sources/`, the PostToolUse hook automatically:
1. Detects the new .md file
2. Finds the ERROR record in Kurt DB
3. Updates status to FETCHED
4. **Parses frontmatter and populates metadata fields**
5. Links the file to the database record
6. Runs metadata extraction (kurt index)
7. Shows confirmation message
**Result:** File is fully indexed in Kurt with proper title, author, dates, and description!
### Example: Complete WebFetch Fallback
```bash
# 1. Kurt fetch fails
kurct content fetch https://docs.example.com/guide
# → Creates ERROR record
# 2. Use WebFetch (Claude does this automatically)
# Extracts metadata + content
# 3. Save with frontmatter to sources/
# File: sources/docs.example.com/guide.md
# 4. Auto-import hook triggers
# → Updates ERROR record to FETCHED
# → Populates title, author, date from frontmatter
# → Links file to database
# 5. Verify
kurt content get-metadata <doc-id>
# Shows proper title, metadata, FETCHED status
```
### Benefits of WebFetch with Frontmatter
- ✅ Preserves all page metadata (title, author, dates)
- ✅ Automatic import via hook
- ✅ No manual database updates needed
- ✅ Content is queryable and searchable
- ✅ Metadata extraction works same as native fetch
- ✅ Transparent fallback - just works!
## Advanced Usage
For custom extraction behavior beyond the CLI, use trafilatura Python library directly.
### Custom Crawling
Control crawl depth, URL patterns, language filters, and more.
```python
# See scripts/advanced_crawl_and_import.py
from trafilatura.spider import focused_crawler
todo = focused_crawler(
homepage,
max_seen_urls=100,
max_known_urls=50
)
```
[Trafilatura Crawls Documentation](https://trafilatura.readthedocs.io/en/latest/crawls.html)
### Custom Extraction Settings
Fine-tune extraction: precision vs recall, include comments, handle tables.
```python
# See scripts/advanced_fetch_custom_extraction.py
from trafilatura import extract
content = extract(
html,
include_comments=False,
include_tables=True,
favor_precision=True
)
```
[Trafilatura Core Functions](https://trafilatura.readthedocs.io/en/latest/corefunctions.html)
### Custom Extraction Config
Configure timeouts, minimum text size, date extraction.
```python
# See scripts/custom_extraction_config.py
from trafilatura.settings import use_config
config = use_config()
config.set('DEFAULT', 'MIN_EXTRACTED_SIZE', '500')
```
[Trafilatura Settings](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extraction-settings)
## Quick Reference
| Task | Command | Performance |
|------|---------|-------------|
| Map sitemap | `kurct content map <url>` | Fast (no downloads) |
| Map with dates | `kurct content map <url> --discover-dates` | ~5-15 min (LLM scraping) |
| Fetch single | `kurct content fetch <id\|url>` | ~2-3s per doc |
| Batch fetch | `kurct content fetch --url-prefix <url>` | ~0.4-0.6s per doc |
| Add URL | `kurct content add <url>` | Instant |
| Review discovered | `kurt content list --status NOT_FETCHED` | Instant |
| Retry failures | `kurct content fetch --status ERROR` | Varies |
## Python API
```python
# URL Discovery
from kurt.ingest_map import (
map_sitemap, # Discover URLs from sitemap
map_blogrolls, # Discover from blogroll/changelog pages
identify_blogroll_candidates, # Find potential blogroll pages
extract_chronological_content, # Extract posts with dates
)
# Content Fetching
from kurt.ingest_fetch import (
add_document, # Add single URL
fetch_document, # Fetch single document
fetch_documents_batch, # Batch fetch (async parallel)
)
# Map sitemap
docs = map_sitemap("https://example.com", limit=100)
# Add document
doc = add_document("https://example.com/page")
# Fetch single
result = fetch_document(document_id="abc-123")
# Batch fetch
results = fetch_documents_batch(
document_ids=["abc-123", "def-456"],
max_concurrent=10
)
```
See:
- [ingest_map.py](https://github.com/boringdata/kurt-core/blob/main/src/kurt/ingest_map.py) - URL discovery
- [ingest_fetch.py](https://github.com/boringdata/kurt-core/blob/main/src/kurt/ingest_fetch.py) - Content fetching
## Troubleshooting
| Issue | Solution |
|-------|----------|
| "No sitemap found" | **See detailed guide above**: Try direct sitemap URLs (`.../sitemap.xml`), check robots.txt, or use WebSearch. Full 4-tier fallback documented in "Troubleshooting: Sitemap Discovery Failures" section. |
| Slow batch fetch | Increase `--max-concurrent` (default: 5, try 10) |
| Extraction quality low | See advanced extraction scripts for custom settings |
| Duplicate content | Kurt automatically deduplicates using content hashes |
| Rate limiting | Reduce `--max-concurrent` or add delays |
| Content fetch fails | Use WebFetch fallback with frontmatter (see "WebFetch Fallback" section) |
## Next Steps
- For document management, see **document-management-skill**
- For custom extraction, see [scripts/](scripts/) directory
- For trafilatura details, see [Trafilatura Documentation](https://trafilatura.readthedocs.io/)
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/advanced_crawl_and_import.py
```python
"""
Advanced URL Discovery using Trafilatura Focused Crawler
Use when you need fine-grained control over crawling:
- Filter by URL patterns
- Depth limits
- Language filtering
- Custom crawl parameters
Trafilatura Documentation:
- Crawls: https://trafilatura.readthedocs.io/en/latest/crawls.html
- Focused Crawler API: https://trafilatura.readthedocs.io/en/latest/crawls.html#focused-crawler
- Language-aware crawling: https://trafilatura.readthedocs.io/en/latest/crawls.html#language-aware-crawling
"""
from trafilatura.spider import focused_crawler
from kurt.source import add_document
# 1. Discover URLs with custom parameters
to_visit, known_links = focused_crawler(
homepage='https://example.com',
max_seen_urls=200, # Max URLs to discover
max_known_urls=10000, # Max URLs to track
# language='en', # Optional: language filter
)
# 2. Save to temporary file
with open('/tmp/discovered_urls.txt', 'w') as f:
for url in to_visit:
f.write(f"{url}\n")
print(f"Saved {len(to_visit)} URLs to /tmp/discovered_urls.txt")
# 3. Import into kurt database
with open('/tmp/discovered_urls.txt', 'r') as f:
for line in f:
url = line.strip()
if url:
doc_id = add_document(url)
print(f"✓ {url}")
```
### scripts/advanced_fetch_custom_extraction.py
```python
"""
Advanced Content Fetching with Custom Trafilatura Settings
Use when you need custom extraction behavior:
- Include/exclude comments
- Favor precision vs recall
- Custom output formats
- Extraction timeouts
Trafilatura Documentation:
- Core Functions: https://trafilatura.readthedocs.io/en/latest/corefunctions.html
- Extract Function: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract
- Extraction Settings: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extraction-settings
- Metadata Extraction: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract-metadata
"""
import trafilatura
from pathlib import Path
import sqlite3
from urllib.parse import urlparse
# 1. Get NOT_FETCHED documents from database
conn = sqlite3.connect('.kurt/kurt.sqlite')
cursor = conn.cursor()
cursor.execute("""
SELECT id, source_url
FROM documents
WHERE ingestion_status = 'NOT_FETCHED'
""")
docs = cursor.fetchall()
# 2. Custom trafilatura extraction
for doc_id, url in docs:
try:
# Download
downloaded = trafilatura.fetch_url(url)
# Extract with custom settings
content = trafilatura.extract(
downloaded,
output_format='markdown',
include_comments=False, # Exclude comments
include_tables=True, # Include tables
include_links=True, # Preserve links
favor_precision=True, # Favor precision over recall
favor_recall=False,
)
if not content:
raise ValueError("No content extracted")
# Get metadata
metadata = trafilatura.extract_metadata(downloaded)
title = metadata.title if metadata else url.split('/')[-1]
# 3. Save to filesystem (same structure as kurt)
parsed = urlparse(url)
domain = parsed.netloc
path = parsed.path.strip('/') or 'index'
if not path.endswith('.md'):
path = path + '.md'
filepath = Path(f"sources/{domain}/{path}")
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(content)
# 4. Update database
relative_path = f"{domain}/{path}"
cursor.execute("""
UPDATE documents
SET title = ?,
content_path = ?,
ingestion_status = 'FETCHED',
updated_at = CURRENT_TIMESTAMP
WHERE id = ?
""", (title, relative_path, doc_id))
conn.commit()
print(f"✓ {url} ({len(content)} chars)")
except Exception as e:
# Mark as ERROR
cursor.execute("""
UPDATE documents
SET ingestion_status = 'ERROR',
updated_at = CURRENT_TIMESTAMP
WHERE id = ?
""", (doc_id,))
conn.commit()
print(f"✗ {url}: {e}")
conn.close()
```
### scripts/custom_extraction_config.py
```python
"""
Custom Extraction Configuration
Use custom trafilatura settings for extraction timeout, minimum size, etc.
Trafilatura Documentation:
- Extraction Settings: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extraction-settings
- Configuration Options: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#options
- Settings Module: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#settings
"""
import trafilatura
from trafilatura.settings import use_config
# Configure custom settings
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "30")
config.set("DEFAULT", "MIN_EXTRACTED_SIZE", "200")
# Download and extract with custom config
url = "https://example.com/page"
downloaded = trafilatura.fetch_url(url)
content = trafilatura.extract(
downloaded,
config=config,
output_format='markdown'
)
print(content)
```