tools/compare-models/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186

# Model Comparison Tool

A standalone CLI tool to compare the tagging performance of two AI models using your existing Karakeep bookmarks.

## Features

- Fetches existing bookmarks from your Karakeep instance
- Runs tagging inference on each bookmark with two different models
- **Random shuffling**: Models are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias
- Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B")
- Interactive voting interface
- Shows final results with winner

## Setup

### Environment Variables

Required environment variables:

```bash
# Karakeep API configuration
KARAKEEP_API_KEY=your_api_key_here
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com

# Models to compare
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

# OpenAI/OpenRouter API configuration (for running inference)
OPENAI_API_KEY=your_openai_or_openrouter_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1  # Optional, defaults to OpenAI

# Optional: Number of bookmarks to test (default: 10)
COMPARE_LIMIT=10
```

### Using OpenRouter

For OpenRouter, set:
```bash
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key
```

### Using OpenAI Directly

For OpenAI directly:
```bash
OPENAI_API_KEY=your_openai_key
# OPENAI_BASE_URL can be omitted for direct OpenAI
```

## Usage

### Run with pnpm (Recommended)

```bash
cd tools/compare-models
pnpm install
pnpm run
```

### Run with environment file

Create a `.env` file:

```env
KARAKEEP_API_KEY=your_api_key
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
OPENAI_API_KEY=your_openai_key
COMPARE_LIMIT=10
```

Then run:
```bash
pnpm run
```

### Using directly with node

If you prefer to run the compiled JavaScript directly:

```bash
pnpm build
export KARAKEEP_API_KEY=your_api_key
export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
export MODEL1_NAME=gpt-4o-mini
export MODEL2_NAME=claude-3-5-sonnet
export OPENAI_API_KEY=your_openai_key
node dist/index.js
```

## Usage Flow

1. The tool fetches your latest link bookmarks from Karakeep
2. For each bookmark, it randomly assigns your two models to "Model A" or "Model B" and runs tagging with both
3. You'll see a side-by-side comparison (models are randomly shuffled each time):
   ```
   === Bookmark 1/10 ===
   How to Build Better AI Systems
   https://example.com/article
   This article explores modern approaches to...

   ─────────────────────────────────────

   Model A (blind):
     • ai
     • machine-learning
     • engineering

   Model B (blind):
     • artificial-intelligence
     • ML
     • software-development

   ─────────────────────────────────────

   Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] >
   ```

4. Choose your preference:
   - `1` - Vote for Model A
   - `2` - Vote for Model B
   - `s` or `skip` - Skip this comparison
   - `q` or `quit` - Exit early and show current results

5. After completing all comparisons (or quitting early), results are displayed:
   ```
   ───────────────────────────────────────
   === FINAL RESULTS ===
   ───────────────────────────────────────
   gpt-4o-mini: 6 votes
   claude-3-5-sonnet: 3 votes
   Skipped: 1
   Errors: 0
   ───────────────────────────────────────
   Total bookmarks tested: 10

   🏆 WINNER: gpt-4o-mini
   ───────────────────────────────────────
   ```

6. The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B"

## Bookmark Filtering

The tool currently tests only:
- **Link-type bookmarks** (not text notes or assets)
- **Non-archived** bookmarks
- **Latest N bookmarks** (where N is COMPARE_LIMIT)

## SDK Usage

This tool uses the Karakeep SDK for all API interactions:
- Type-safe requests using `@karakeep/sdk`
- Proper authentication handling via Bearer token
- Pagination support for fetching multiple bookmarks


## Error Handling

- If a model fails to generate tags for a bookmark, an error is shown and comparison continues
- Errors are counted separately in final results
- Missing required environment variables will cause the tool to exit with a clear error message

## Build

To build a standalone binary:

```bash
pnpm build
```

The built binary will be in `dist/index.js`.

## Notes

- The tool is designed for manual, human-in-the-loop evaluation
- No results are persisted - they're only displayed in console
- Content is fetched with `includeContent=true` from Karakeep API
- Uses Karakeep SDK (`@karakeep/sdk`) for type-safe API interactions
- Inference runs sequentially to keep state management simple
- Recommended to use `pnpm run` for the best experience (uses tsx for development)
- **Random shuffling**: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.