SkillHub ClubAnalyze Data & AIFull StackData / AI

Qwen-Ollama

Using Qwen 2.5 models via Ollama for local LLM inference, text analysis, and AI-powered automation

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C3.7

Composite score

3.7

Best-practice grade

D47.9

Install command

npx @skill-hub/cli install lawless-m-claude-skills-qwen-ollama

Repository

lawless-m/claude-skills

Skill path: .claude/skills/Qwen-Ollama

Using Qwen 2.5 models via Ollama for local LLM inference, text analysis, and AI-powered automation

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: lawless-m.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install Qwen-Ollama into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/lawless-m/claude-skills before adding Qwen-Ollama to shared team environments
Use Qwen-Ollama for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: Qwen-Ollama
description: Using Qwen 2.5 models via Ollama for local LLM inference, text analysis, and AI-powered automation
---

# Qwen via Ollama

Local LLM inference using Qwen 2.5 models through Ollama. Enables text analysis, summarization, code generation, and structured data analysis without cloud dependencies.

## Instructions

When helping users with Ollama and Qwen models, follow these guidelines:

1. **Installation First**: Always verify Ollama is installed and the desired model is pulled before attempting API calls
2. **Use Appropriate Model Size**: Recommend qwen2.5:7b for balanced performance (4.7 GB), or smaller/larger based on available resources
3. **Set Proper Timeouts**: Default 120s timeout for analysis tasks, longer for complex generation
4. **Handle Streaming**: Use `"stream": false` for simple cases, streaming for real-time feedback
5. **System Prompts**: Define personality and role in system message for consistent behavior
6. **Validate Responses**: Always check the `done` field and handle partial responses appropriately

## Examples

### Example 1: Basic Installation and Setup
```
User: I want to use Qwen locally for text analysis

Claude: I'll help you set up Ollama with Qwen 2.5:

1. Install Ollama service
2. Pull the qwen2.5:7b model (4.7 GB)
3. Verify it's working

[Provides installation commands from Quick Setup pattern below]
```

### Example 2: Rust Integration for Analysis
```
User: I need to analyze server logs with an LLM in my Rust application

Claude: I'll create an Ollama client using reqwest to send logs to Qwen 2.5:
- Async client with timeout
- JSON serialization for requests/responses
- Error handling for network issues

[Provides implementation using Rust Client pattern below]
```

### Example 3: System Prompt for Personality
```
User: How do I make the LLM respond with a specific personality?

Claude: Use the chat API with a system message to define behavior:
- System role sets the personality/instructions
- User role contains the actual query
- The model maintains this context throughout

[Provides chat completion example from System Prompts pattern below]
```

---

# Reference Implementation Details

The sections below contain proven working patterns from production systems.

**Reference Files in This Folder**:
- `OllamaClient.rs` - Complete Rust client implementation from Marvinous project

## Quick Setup

**Installation:**
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen 2.5 model (7B variant, 4.7 GB)
ollama pull qwen2.5:7b

# Verify installation
ollama list
systemctl status ollama
```

**Model Variants:**
- `qwen2.5:0.5b` - Tiny (500 MB) for testing
- `qwen2.5:7b` - Balanced (4.7 GB) **recommended**
- `qwen2.5:14b` - Better quality (8.7 GB)
- `qwen2.5:32b` - Highest quality (19 GB)

## Basic API Patterns

### Generate Completion (Simple Text)

**Endpoint:** `POST http://localhost:11434/api/generate`

```bash
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "prompt": "Explain RAID levels in servers",
    "stream": false
  }'
```

**Response:**
```json
{
  "model": "qwen2.5:7b",
  "response": "RAID (Redundant Array of Independent Disks) provides...",
  "done": true
}
```

### Chat Completion (With Context)

**Endpoint:** `POST http://localhost:11434/api/chat`

```bash
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [
      {"role": "system", "content": "You are a server monitoring expert."},
      {"role": "user", "content": "What does IPMI provide?"}
    ],
    "stream": false
  }'
```

## Rust Client Pattern

**Location:** `Marvinous/src/llm/client.rs`
**Purpose:** Async Ollama client with timeout and error handling

### Dependencies

```toml
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["full"] }
```

### Client Implementation

```rust
use reqwest::Client;
use serde::{Deserialize, Serialize};
use std::time::Duration;

#[derive(Serialize)]
struct GenerateRequest {
    model: String,
    prompt: String,
    stream: bool,
}

#[derive(Deserialize)]
struct GenerateResponse {
    response: String,
    done: bool,
}

pub struct OllamaClient {
    client: Client,
    endpoint: String,
    model: String,
}

impl OllamaClient {
    pub fn new(endpoint: &str, model: &str, timeout_secs: u64) -> Self {
        let client = Client::builder()
            .timeout(Duration::from_secs(timeout_secs))
            .build()
            .expect("Failed to create HTTP client");

        Self {
            client,
            endpoint: endpoint.to_string(),
            model: model.to_string(),
        }
    }

    pub async fn generate(&self, prompt: &str) -> Result<String, Box<dyn std::error::Error>> {
        let request = GenerateRequest {
            model: self.model.clone(),
            prompt: prompt.to_string(),
            stream: false,
        };

        let response = self.client
            .post(format!("{}/api/generate", self.endpoint))
            .json(&request)
            .send()
            .await?
            .json::<GenerateResponse>()
            .await?;

        Ok(response.response)
    }
}
```

**Usage:**
```rust
#[tokio::main]
async fn main() {
    let client = OllamaClient::new("http://localhost:11434", "qwen2.5:7b", 120);
    let result = client.generate("Analyze this system log").await.unwrap();
    println!("{}", result);
}
```

**Key Points:**
- Timeout prevents hanging on long generations (default 120s)
- Non-streaming mode returns complete response
- Error handling for network and parsing failures

## System Prompts Pattern

**Location:** `Marvinous/src/llm/prompt.rs`
**Purpose:** Define LLM personality and behavior

### Chat API with System Message

```rust
#[derive(Serialize)]
struct ChatMessage {
    role: String,
    content: String,
}

#[derive(Serialize)]
struct ChatRequest {
    model: String,
    messages: Vec<ChatMessage>,
    stream: bool,
}

let messages = vec![
    ChatMessage {
        role: "system".to_string(),
        content: "You are Marvin, the Paranoid Android. Respond with existential dread.".to_string(),
    },
    ChatMessage {
        role: "user".to_string(),
        content: "How are the servers?".to_string(),
    },
];

let request = ChatRequest {
    model: "qwen2.5:7b".to_string(),
    messages,
    stream: false,
};
```

**System Prompt Best Practices:**
1. Define the role clearly ("You are X")
2. Specify output format expectations
3. Include personality traits if desired
4. Set constraints (length, tone, structure)
5. Provide domain context

## Common Use Cases

### 1. Log Analysis

**Prompt Pattern:**
```
Analyze these server logs and identify issues:

[log entries]

Focus on:
- Error patterns
- Security events
- Performance anomalies
```

### 2. Data Summarization

**Prompt Pattern:**
```
Summarize this IPMI sensor data:

{json_data}

Highlight:
- Anomalies or concerning values
- Temperature trends
- Fan speed issues
```

### 3. Code Generation

**Prompt Pattern:**
```
Write a Rust function that:
- Parses smartctl JSON output
- Extracts drive health metrics
- Returns structured data

Use serde for JSON parsing.
```

## Model Parameters

### Temperature Control

```json
{
  "model": "qwen2.5:7b",
  "prompt": "...",
  "options": {
    "temperature": 0.7
  }
}
```

**Temperature Values:**
- `0.0` - Deterministic (same output every time)
- `0.3-0.7` - Balanced creativity/consistency
- `1.0+` - Maximum creativity/randomness

### Context Window

Qwen 2.5 supports **32K tokens** context window (approximately 24K words).

## Troubleshooting

### "Connection Refused" Error

**Cause:** Ollama service not running

**Solution:**
```bash
sudo systemctl start ollama
sudo systemctl status ollama
```

### "Model Not Found" Error

**Cause:** Model not pulled locally

**Solution:**
```bash
ollama list           # Check available models
ollama pull qwen2.5:7b  # Pull the model
```

### Timeout Errors

**Cause:** Generation taking longer than client timeout

**Solution:**
```rust
// Increase timeout for complex tasks
let client = Client::builder()
    .timeout(Duration::from_secs(300))  // 5 minutes
    .build()?;
```

### Out of Memory

**Cause:** Model too large for available RAM/VRAM

**Solution:**
```bash
# Use smaller model
ollama pull qwen2.5:0.5b

# Or check memory usage
free -h
nvidia-smi  # For GPU memory
```

## VRAM Management

### Model Auto-Unloading

Ollama automatically unloads models from VRAM after inactivity to free GPU memory:

**Check Current Status:**
```bash
ollama ps                  # List loaded models
nvidia-smi                 # Check VRAM usage
```

**Model Lifecycle:**
```
1. Request arrives → Model loads to VRAM (~4.7 GB for qwen2.5:7b)
2. Inference runs → GPU processes prompt (5-10 seconds)
3. Response sent → Model stays in VRAM (configured timeout)
4. Timeout expires → Model automatically unloaded (VRAM freed)
```

**Configure Auto-Unload Timeout:**

Edit `/etc/systemd/system/ollama.service.d/models-location.conf`:

```ini
[Service]
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=30s"
```

**Timeout Options:**
- `OLLAMA_KEEP_ALIVE=0` - Unload immediately after each request
- `OLLAMA_KEEP_ALIVE=30s` - Keep for 30 seconds (recommended for shared GPU)
- `OLLAMA_KEEP_ALIVE=5m` - Keep for 5 minutes (default)
- `OLLAMA_KEEP_ALIVE=-1` - Keep loaded indefinitely

After changes:
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```

### Manual VRAM Control

**Immediately Unload Model:**
```bash
ollama stop qwen2.5:7b
# Frees VRAM instantly for other GPU work
```

**Pre-Load Model (Warm Start):**
```bash
echo "test" | ollama run qwen2.5:7b >/dev/null 2>&1
# Loads model into VRAM before batch jobs
```

**VRAM Management Script:**

Create `~/scripts/ollama-vram.sh`:
```bash
#!/bin/bash
case "$1" in
    status)
        nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
        echo ""
        ollama ps
        ;;
    unload)
        ollama ps --format json | grep -o '"name":"[^"]*"' | cut -d'"' -f4 | \
        while read model; do ollama stop "$model"; done
        echo "VRAM freed"
        ;;
    load)
        echo "test" | ollama run qwen2.5:7b >/dev/null 2>&1
        ollama ps
        ;;
esac
```

**Usage:**
```bash
chmod +x ~/scripts/ollama-vram.sh

# Check VRAM status
~/scripts/ollama-vram.sh status

# Free VRAM before CUDA work
~/scripts/ollama-vram.sh unload
python train_model.py  # Your GPU work here

# Pre-warm for batch inference
~/scripts/ollama-vram.sh load
```

### Performance Optimization

**GPU Acceleration:**

Ollama automatically uses NVIDIA GPUs if available:

```bash
# Monitor GPU during generation
nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv -l 1
```

**Concurrent Requests:**

```bash
# Increase max loaded models for high concurrency
OLLAMA_MAX_LOADED_MODELS=2 ollama serve
```

**Force CPU-Only:**

```bash
# Disable GPU (use CPU inference)
CUDA_VISIBLE_DEVICES="" ollama serve
```

## Production Deployment

### Systemd Service

Ollama installs as a systemd service automatically:

```bash
# Check status
systemctl status ollama

# View logs
journalctl -u ollama -f

# Restart service
sudo systemctl restart ollama
```

### Configuration

Edit `/etc/systemd/system/ollama.service` to set environment variables:

```ini
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
```

Then reload:
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```

## Best Practices Summary

1. **Validate Installation**: Always check `ollama list` before assuming model availability
2. **Set Appropriate Timeouts**: 120s for analysis, 300s for complex generation
3. **Use System Prompts**: Define behavior in system message for consistency
4. **Handle Errors**: Network issues, timeouts, and parsing failures are common
5. **Monitor Resources**: Watch GPU/CPU memory during sustained workloads
6. **Cache Results**: Store frequently-used completions to reduce inference time
7. **Keep Prompts Focused**: Clear, specific instructions produce better results

## Reference Implementation

See **Marvinous** project for complete production example:
- `/home/matt/Marvinous/src/llm/client.rs` - Ollama API client
- `/home/matt/Marvinous/src/llm/prompt.rs` - Prompt building
- `/etc/marvinous/system-prompt.txt` - System prompt with Marvin's personality
- `/etc/marvinous/marvinous.toml` - Configuration with model/endpoint settings