# Model Comparison Tool A standalone CLI tool to compare the tagging performance of AI models using your existing Karakeep bookmarks. ## Features - **Two comparison modes:** - **Model vs Model**: Compare two AI models against each other - **Model vs Existing**: Compare a new model against existing AI-generated tags on your bookmarks - Fetches existing bookmarks from your Karakeep instance - Runs tagging inference with AI models - **Random shuffling**: Models/tags are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias - Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B") - Interactive voting interface - Shows final results with winner ## Setup ### Environment Variables Required environment variables: ```bash # Karakeep API configuration KARAKEEP_API_KEY=your_api_key_here KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com # Comparison mode (default: model-vs-model) # - "model-vs-model": Compare two models against each other # - "model-vs-existing": Compare a model against existing AI tags COMPARISON_MODE=model-vs-model # Models to compare # MODEL1_NAME: The new model to test (always required) # MODEL2_NAME: The second model to compare against (required only for model-vs-model mode) MODEL1_NAME=gpt-4o-mini MODEL2_NAME=claude-3-5-sonnet # OpenAI/OpenRouter API configuration (for running inference) OPENAI_API_KEY=your_openai_or_openrouter_key OPENAI_BASE_URL=https://openrouter.ai/api/v1 # Optional, defaults to OpenAI # Optional: Number of bookmarks to test (default: 10) COMPARE_LIMIT=10 ``` ### Using OpenRouter For OpenRouter, set: ```bash OPENAI_BASE_URL=https://openrouter.ai/api/v1 OPENAI_API_KEY=your_openrouter_key ``` ### Using OpenAI Directly For OpenAI directly: ```bash OPENAI_API_KEY=your_openai_key # OPENAI_BASE_URL can be omitted for direct OpenAI ``` ## Usage ### Run with pnpm (Recommended) ```bash cd tools/compare-models pnpm install pnpm run ``` ### Run with environment file Create a `.env` file: ```env KARAKEEP_API_KEY=your_api_key KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com MODEL1_NAME=gpt-4o-mini MODEL2_NAME=claude-3-5-sonnet OPENAI_API_KEY=your_openai_key COMPARE_LIMIT=10 ``` Then run: ```bash pnpm run ``` ### Using directly with node If you prefer to run the compiled JavaScript directly: ```bash pnpm build export KARAKEEP_API_KEY=your_api_key export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com export MODEL1_NAME=gpt-4o-mini export MODEL2_NAME=claude-3-5-sonnet export OPENAI_API_KEY=your_openai_key node dist/index.js ``` ## Comparison Modes ### Model vs Model Mode Compare two different AI models against each other: ```bash COMPARISON_MODE=model-vs-model MODEL1_NAME=gpt-4o-mini MODEL2_NAME=claude-3-5-sonnet ``` This mode runs inference with both models on each bookmark and lets you choose which tags are better. ### Model vs Existing Mode Compare a new model against existing AI-generated tags on your bookmarks: ```bash COMPARISON_MODE=model-vs-existing MODEL1_NAME=gpt-4o-mini # MODEL2_NAME is not required in this mode ``` This mode is useful for: - Testing if a new model produces better tags than your current model - Evaluating whether to switch from one model to another - Quality assurance on existing AI tags **Note:** This mode only compares bookmarks that already have AI-generated tags (tags with `attachedBy: "ai"`). Bookmarks without AI tags are automatically filtered out. ## Usage Flow 1. The tool fetches your latest link bookmarks from Karakeep - In **model-vs-existing** mode, only bookmarks with existing AI tags are included 2. For each bookmark, it randomly assigns the options to "Model A" or "Model B" and runs tagging 3. You'll see a side-by-side comparison (randomly shuffled each time): ``` === Bookmark 1/10 === How to Build Better AI Systems https://example.com/article This article explores modern approaches to... ───────────────────────────────────── Model A (blind): • ai • machine-learning • engineering Model B (blind): • artificial-intelligence • ML • software-development ───────────────────────────────────── Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] > ``` 4. Choose your preference: - `1` - Vote for Model A - `2` - Vote for Model B - `s` or `skip` - Skip this comparison - `q` or `quit` - Exit early and show current results 5. After completing all comparisons (or quitting early), results are displayed: ``` ─────────────────────────────────────── === FINAL RESULTS === ─────────────────────────────────────── gpt-4o-mini: 6 votes claude-3-5-sonnet: 3 votes Skipped: 1 Errors: 0 ─────────────────────────────────────── Total bookmarks tested: 10 🏆 WINNER: gpt-4o-mini ─────────────────────────────────────── ``` 6. The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B" ## Bookmark Filtering The tool currently tests only: - **Link-type bookmarks** (not text notes or assets) - **Non-archived** bookmarks - **Latest N bookmarks** (where N is COMPARE_LIMIT) - **In model-vs-existing mode**: Only bookmarks with existing AI tags (tags with `attachedBy: "ai"`) ## Architecture This tool leverages Karakeep's shared infrastructure: - **API Client**: Uses `@karakeep/sdk` for type-safe API interactions with proper authentication - **Inference**: Reuses `@karakeep/shared/inference` for OpenAI client with structured output support - **Prompts**: Uses `@karakeep/shared/prompts` for consistent tagging prompt generation with token management - No code duplication - all core functionality is shared with the main Karakeep application ## Error Handling - If a model fails to generate tags for a bookmark, an error is shown and comparison continues - Errors are counted separately in final results - Missing required environment variables will cause the tool to exit with a clear error message ## Build To build a standalone binary: ```bash pnpm build ``` The built binary will be in `dist/index.js`. ## Notes - The tool is designed for manual, human-in-the-loop evaluation - No results are persisted - they're only displayed in console - Content is fetched with `includeContent=true` from Karakeep API - Uses Karakeep SDK (`@karakeep/sdk`) for type-safe API interactions - Inference runs sequentially to keep state management simple - Recommended to use `pnpm run` for the best experience (uses tsx for development) - **Random shuffling**: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.