# Model Comparison Tool

A standalone CLI tool to compare the tagging performance of AI models using your existing Karakeep bookmarks.

## Features

- **Two comparison modes:**
  - **Model vs Model**: Compare two AI models against each other
  - **Model vs Existing**: Compare a new model against existing AI-generated tags on your bookmarks
- Fetches existing bookmarks from your Karakeep instance
- Runs tagging inference with AI models
- **Random shuffling**: Models/tags are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias
- Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B")
- Interactive voting interface
- Shows final results with winner

## Setup

### Environment Variables

Required environment variables:

```bash
# Karakeep API configuration
KARAKEEP_API_KEY=your_api_key_here
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com

# Comparison mode (default: model-vs-model)
# - "model-vs-model": Compare two models against each other
# - "model-vs-existing": Compare a model against existing AI tags
COMPARISON_MODE=model-vs-model

# Models to compare
# MODEL1_NAME: The new model to test (always required)
# MODEL2_NAME: The second model to compare against (required only for model-vs-model mode)
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

# OpenAI/OpenRouter API configuration (for running inference)
OPENAI_API_KEY=your_openai_or_openrouter_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1  # Optional, defaults to OpenAI

# Optional: Number of bookmarks to test (default: 10)
COMPARE_LIMIT=10
```

### Using OpenRouter

For OpenRouter, set:
```bash
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key
```

### Using OpenAI Directly

For OpenAI directly:
```bash
OPENAI_API_KEY=your_openai_key
# OPENAI_BASE_URL can be omitted for direct OpenAI
```

## Usage

### Run with pnpm (Recommended)

```bash
cd tools/compare-models
pnpm install
pnpm run
```

### Run with environment file

Create a `.env` file:

```env
KARAKEEP_API_KEY=your_api_key
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
OPENAI_API_KEY=your_openai_key
COMPARE_LIMIT=10
```

Then run:
```bash
pnpm run
```

### Using directly with node

If you prefer to run the compiled JavaScript directly:

```bash
pnpm build
export KARAKEEP_API_KEY=your_api_key
export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
export MODEL1_NAME=gpt-4o-mini
export MODEL2_NAME=claude-3-5-sonnet
export OPENAI_API_KEY=your_openai_key
node dist/index.js
```

## Comparison Modes

### Model vs Model Mode

Compare two different AI models against each other:

```bash
COMPARISON_MODE=model-vs-model
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
```

This mode runs inference with both models on each bookmark and lets you choose which tags are better.

### Model vs Existing Mode

Compare a new model against existing AI-generated tags on your bookmarks:

```bash
COMPARISON_MODE=model-vs-existing
MODEL1_NAME=gpt-4o-mini
# MODEL2_NAME is not required in this mode
```

This mode is useful for:
- Testing if a new model produces better tags than your current model
- Evaluating whether to switch from one model to another
- Quality assurance on existing AI tags

**Note:** This mode only compares bookmarks that already have AI-generated tags (tags with `attachedBy: "ai"`). Bookmarks without AI tags are automatically filtered out.

## Usage Flow

1. The tool fetches your latest link bookmarks from Karakeep
   - In **model-vs-existing** mode, only bookmarks with existing AI tags are included
2. For each bookmark, it randomly assigns the options to "Model A" or "Model B" and runs tagging
3. You'll see a side-by-side comparison (randomly shuffled each time):
   ```
   === Bookmark 1/10 ===
   How to Build Better AI Systems
   https://example.com/article
   This article explores modern approaches to...

   ─────────────────────────────────────

   Model A (blind):
     • ai
     • machine-learning
     • engineering

   Model B (blind):
     • artificial-intelligence
     • ML
     • software-development

   ─────────────────────────────────────

   Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] >
   ```

4. Choose your preference:
   - `1` - Vote for Model A
   - `2` - Vote for Model B
   - `s` or `skip` - Skip this comparison
   - `q` or `quit` - Exit early and show current results

5. After completing all comparisons (or quitting early), results are displayed:
   ```
   ───────────────────────────────────────
   === FINAL RESULTS ===
   ───────────────────────────────────────
   gpt-4o-mini: 6 votes
   claude-3-5-sonnet: 3 votes
   Skipped: 1
   Errors: 0
   ───────────────────────────────────────
   Total bookmarks tested: 10

   🏆 WINNER: gpt-4o-mini
   ───────────────────────────────────────
   ```

6. The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B"

## Bookmark Filtering

The tool currently tests only:
- **Link-type bookmarks** (not text notes or assets)
- **Non-archived** bookmarks
- **Latest N bookmarks** (where N is COMPARE_LIMIT)
- **In model-vs-existing mode**: Only bookmarks with existing AI tags (tags with `attachedBy: "ai"`)

## Architecture

This tool leverages Karakeep's shared infrastructure:
- **API Client**: Uses `@karakeep/sdk` for type-safe API interactions with proper authentication
- **Inference**: Reuses `@karakeep/shared/inference` for OpenAI client with structured output support
- **Prompts**: Uses `@karakeep/shared/prompts` for consistent tagging prompt generation with token management
- No code duplication - all core functionality is shared with the main Karakeep application


## Error Handling

- If a model fails to generate tags for a bookmark, an error is shown and comparison continues
- Errors are counted separately in final results
- Missing required environment variables will cause the tool to exit with a clear error message

## Build

To build a standalone binary:

```bash
pnpm build
```

The built binary will be in `dist/index.js`.

## Notes

- The tool is designed for manual, human-in-the-loop evaluation
- No results are persisted - they're only displayed in console
- Content is fetched with `includeContent=true` from Karakeep API
- Uses Karakeep SDK (`@karakeep/sdk`) for type-safe API interactions
- Inference runs sequentially to keep state management simple
- Recommended to use `pnpm run` for the best experience (uses tsx for development)
- **Random shuffling**: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.