diff options
| author | Mohamed Bassem <me@mbassem.com> | 2025-12-26 11:14:17 +0000 |
|---|---|---|
| committer | Mohamed Bassem <me@mbassem.com> | 2025-12-26 11:14:17 +0000 |
| commit | 1dfa5d12f6af6ca964bdfa911809a061ffdf36c2 (patch) | |
| tree | 87c734eaa5395051a0a46972ca575f2866c73dd5 /tools/compare-models/README.md | |
| parent | ecb7a710ca7ec22aa3304b8d1f6b603bb60874bc (diff) | |
| download | karakeep-1dfa5d12f6af6ca964bdfa911809a061ffdf36c2.tar.zst | |
chore: add a tool for comparing perf of different models
Diffstat (limited to 'tools/compare-models/README.md')
| -rw-r--r-- | tools/compare-models/README.md | 186 |
1 files changed, 186 insertions, 0 deletions
diff --git a/tools/compare-models/README.md b/tools/compare-models/README.md new file mode 100644 index 00000000..b8ef5138 --- /dev/null +++ b/tools/compare-models/README.md @@ -0,0 +1,186 @@ +# Model Comparison Tool + +A standalone CLI tool to compare the tagging performance of two AI models using your existing Karakeep bookmarks. + +## Features + +- Fetches existing bookmarks from your Karakeep instance +- Runs tagging inference on each bookmark with two different models +- **Random shuffling**: Models are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias +- Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B") +- Interactive voting interface +- Shows final results with winner + +## Setup + +### Environment Variables + +Required environment variables: + +```bash +# Karakeep API configuration +KARAKEEP_API_KEY=your_api_key_here +KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com + +# Models to compare +MODEL1_NAME=gpt-4o-mini +MODEL2_NAME=claude-3-5-sonnet + +# OpenAI/OpenRouter API configuration (for running inference) +OPENAI_API_KEY=your_openai_or_openrouter_key +OPENAI_BASE_URL=https://openrouter.ai/api/v1 # Optional, defaults to OpenAI + +# Optional: Number of bookmarks to test (default: 10) +COMPARE_LIMIT=10 +``` + +### Using OpenRouter + +For OpenRouter, set: +```bash +OPENAI_BASE_URL=https://openrouter.ai/api/v1 +OPENAI_API_KEY=your_openrouter_key +``` + +### Using OpenAI Directly + +For OpenAI directly: +```bash +OPENAI_API_KEY=your_openai_key +# OPENAI_BASE_URL can be omitted for direct OpenAI +``` + +## Usage + +### Run with pnpm (Recommended) + +```bash +cd tools/compare-models +pnpm install +pnpm run +``` + +### Run with environment file + +Create a `.env` file: + +```env +KARAKEEP_API_KEY=your_api_key +KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com +MODEL1_NAME=gpt-4o-mini +MODEL2_NAME=claude-3-5-sonnet +OPENAI_API_KEY=your_openai_key +COMPARE_LIMIT=10 +``` + +Then run: +```bash +pnpm run +``` + +### Using directly with node + +If you prefer to run the compiled JavaScript directly: + +```bash +pnpm build +export KARAKEEP_API_KEY=your_api_key +export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com +export MODEL1_NAME=gpt-4o-mini +export MODEL2_NAME=claude-3-5-sonnet +export OPENAI_API_KEY=your_openai_key +node dist/index.js +``` + +## Usage Flow + +1. The tool fetches your latest link bookmarks from Karakeep +2. For each bookmark, it randomly assigns your two models to "Model A" or "Model B" and runs tagging with both +3. You'll see a side-by-side comparison (models are randomly shuffled each time): + ``` + === Bookmark 1/10 === + How to Build Better AI Systems + https://example.com/article + This article explores modern approaches to... + + ───────────────────────────────────── + + Model A (blind): + • ai + • machine-learning + • engineering + + Model B (blind): + • artificial-intelligence + • ML + • software-development + + ───────────────────────────────────── + + Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] > + ``` + +4. Choose your preference: + - `1` - Vote for Model A + - `2` - Vote for Model B + - `s` or `skip` - Skip this comparison + - `q` or `quit` - Exit early and show current results + +5. After completing all comparisons (or quitting early), results are displayed: + ``` + ─────────────────────────────────────── + === FINAL RESULTS === + ─────────────────────────────────────── + gpt-4o-mini: 6 votes + claude-3-5-sonnet: 3 votes + Skipped: 1 + Errors: 0 + ─────────────────────────────────────── + Total bookmarks tested: 10 + + 🏆 WINNER: gpt-4o-mini + ─────────────────────────────────────── + ``` + +6. The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B" + +## Bookmark Filtering + +The tool currently tests only: +- **Link-type bookmarks** (not text notes or assets) +- **Non-archived** bookmarks +- **Latest N bookmarks** (where N is COMPARE_LIMIT) + +## SDK Usage + +This tool uses the Karakeep SDK for all API interactions: +- Type-safe requests using `@karakeep/sdk` +- Proper authentication handling via Bearer token +- Pagination support for fetching multiple bookmarks + + +## Error Handling + +- If a model fails to generate tags for a bookmark, an error is shown and comparison continues +- Errors are counted separately in final results +- Missing required environment variables will cause the tool to exit with a clear error message + +## Build + +To build a standalone binary: + +```bash +pnpm build +``` + +The built binary will be in `dist/index.js`. + +## Notes + +- The tool is designed for manual, human-in-the-loop evaluation +- No results are persisted - they're only displayed in console +- Content is fetched with `includeContent=true` from Karakeep API +- Uses Karakeep SDK (`@karakeep/sdk`) for type-safe API interactions +- Inference runs sequentially to keep state management simple +- Recommended to use `pnpm run` for the best experience (uses tsx for development) +- **Random shuffling**: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results. |
