aboutsummaryrefslogtreecommitdiffstats
path: root/apps/workers (follow)
Commit message (Collapse)AuthorAgeFilesLines
* feat(ai): Support restricting AI tags to a subset of existing tags (#2444)Mohamed Bassem3 days1-1/+33
| | | | | * feat(ai): Support restricting AI tags to a subset of existing tags Co-authored-by: Claude <noreply@anthropic.com>
* feat(crawler): Split bookmark metadata updates into two phases for faster ↵Mohamed Bassem3 days1-22/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | feedback (#2467) * feat(crawler): write metadata to DB early for faster user feedback Split the single DB transaction in crawlAndParseUrl into two phases: - Phase 1: Write metadata (title, description, favicon, author, etc.) immediately after extraction, before downloading assets - Phase 2: Write content and asset references after all assets are stored (banner image, screenshot, pdf, html content) This gives users near-instant feedback with bookmark metadata while the slower asset downloads and uploads happen in the background. https://claude.ai/code/session_013vKTXDcb5CEve3WMszQJmZ * fix(crawler): move crawledAt to phase 2 DB write crawledAt should only be set once all assets are fully stored, not during the early metadata write. https://claude.ai/code/session_013vKTXDcb5CEve3WMszQJmZ --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix: treat bookmark not found as a no-op in rule engine instead of a failure ↵Mohamed Bassem3 days1-2/+9
| | | | | | | | | | | (#2464) When a bookmark is deleted before the rule engine worker processes its event, the worker would throw an error, triggering failure metrics, error logging, and retries. This changes both the worker and RuleEngine.forBookmark to gracefully skip processing with an info log instead. Co-authored-by: Claude <noreply@anthropic.com>
* feat: Add separate queue for import link crawling (#2452)Mohamed Bassem4 days3-35/+51
| | | | | | | * feat: add separate queue for import link crawling --------- Co-authored-by: Claude <noreply@anthropic.com>
* feat(metrics): add prometheus metric for bookmark crawl latency (#2461)Mohamed Bassem4 days3-2/+28
| | | | | | | | | | Track the time from bookmark creation to crawl completion as a histogram (karakeep_bookmark_crawl_latency_seconds). This measures the end-to-end latency users experience when adding bookmarks via extension, web, etc. Excludes recrawls (crawledAt already set) and imports (low priority jobs). https://claude.ai/code/session_019jTGGXGWzK9C5aTznQhdgz Co-authored-by: Claude <noreply@anthropic.com>
* fix(ci): fix missing format errorMohamed Bassem7 days1-1/+1
|
* feat: add extra instrumentation in the otel traces (#2453)Mohamed Bassem7 days4-26/+178
|
* fix(import): sanitize error messages to prevent backend detail leakage (#2455)Mohamed Bassem7 days1-1/+26
| | | | | | | | | | | | | | | | | | The catch block in processOneBookmark was storing raw error strings via String(error) in the resultReason field, which is exposed to users through the getImportSessionResults tRPC route. This could leak internal details like database constraint errors, file paths, stack traces, or connection strings. Replace String(error) with getSafeErrorMessage() that only allows through: - TRPCError client errors (designed to be user-facing) - Known safe validation messages from the import worker - A generic fallback for all other errors The full error is still logged server-side for debugging. https://claude.ai/code/session_01F1NHE9dqio5LJ177vmSCvt Co-authored-by: Claude <noreply@anthropic.com>
* fix(import): skip counting pending items for paushed sessionsMohamedBassem7 days1-7/+16
|
* fix(import): register improt metrics to the prom registryMohamed Bassem7 days2-1/+9
|
* fix(import): propagate crawling/tagging failure to import statusMohamed Bassem7 days1-18/+50
|
* fix: extra logging for the import workerMohamed Bassem7 days1-13/+39
|
* fix: backfill old sessions and do queue backpressure (#2449)Mohamed Bassem7 days1-21/+54
| | | | | * fix: backfill old sessions and do queue backpressure * fix typo
* feat: Import workflow v3 (#2378)Mohamed Bassem7 days2-1/+579
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * feat: import workflow v3 * batch stage * revert migration * cleanups * pr comments * move to models * add allowed workers * e2e tests * import list ids * add missing indicies * merge test * more fixes * add resume/pause to UI * fix ui states * fix tests * simplify progress tracking * remove backpressure * fix list imports * fix race on claiming bookmarks * remove the codex file
* feat: Add LLM-based OCR as alternative to Tesseract (#2442)Mohamed Bassem10 days1-9/+58
| | | | | | | | | | | | | | | | | | | | | | | | | * feat(ocr): add LLM-based OCR support alongside Tesseract Add support for using configured LLM inference providers (OpenAI or Ollama) for OCR text extraction from images as an alternative to Tesseract. Changes: - Add OCR_USE_LLM environment variable flag (default: false) - Add buildOCRPrompt function for LLM-based text extraction - Add readImageTextWithLLM function in asset preprocessing worker - Update extractAndSaveImageText to route between Tesseract and LLM OCR - Update documentation with the new configuration option When OCR_USE_LLM is enabled, the system uses the configured inference model to extract text from images. If no inference provider is configured, it falls back to Tesseract. https://claude.ai/code/session_01Y7h7kDAmqXKXEWDmWbVkDs * format --------- Co-authored-by: Claude <noreply@anthropic.com>
* feat: batch meilisearch requests (#2441)Mohamed Bassem10 days1-7/+18
| | | | | * feat: batch meilisearch requests * more fixes
* fix(web): don't bundle tiktoken in client bundlesMohamed Bassem10 days2-2/+3
|
* refactor: lazy init background queuesMohamed Bassem10 days1-10/+50
|
* fix: Accept more permissive RSS feed content types and Fix User-Agent key ↵E.T.2026-01-111-2/+2
| | | | | | | | | | | | | | | (#2353) * Fix User-Agent key and accept more permissive content types Some feeds are Content-Type application/xml only and will respond with a 406 error to responses with a header of content type application/rss+xml. This change allows for the more permissive content types application/xml and text/xml to be accepted Also fixes UserAgent with correct User-Agent * Fix: Remote trailing whitespace in feedWorker.ts Fix formatting on HTTP Header for RSS Accceptable Content-Types introduced in commit 6896392 * format
* chore: add a note about hostname allowlists in the validation error messageMohamed Bassem2026-01-021-1/+1
|
* chore: worker tracing (#2321)Mohamed Bassem2025-12-3012-821/+1030
|
* fix: reset tagging status on crawl failure (#2316)Mohamed Bassem2025-12-291-15/+37
| | | | | | | * feat: add the ability to specify a different changelog version * fix: reset tagging status on crawl failure * fix missing crawlStatus in loadMulti
* feat: add customizable tag styles (#2312)Mohamed Bassem2025-12-272-7/+37
| | | | | | | | | | | | | | | * feat: add customizable tag styles * add tag lang setting * ui settings cleanup * fix migration * change look of the field * more fixes * fix tests
* feat: support archiving as pdf (#2309)Mohamed Bassem2025-12-272-3/+110
| | | | | | | | | | | * feat: support archiving as pdf * add supprot for manually triggering pdf downloads * fix submenu * menu cleanup * fix store pdf
* deps: upgrade tesseract to v7Mohamed Bassem2025-12-261-1/+1
|
* fix: preserve failure count when rescheduling rate limited domains (#2303)Mohamed Bassem2025-12-251-38/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * fix: preserve retry count when rate-limited jobs are rescheduled Previously, when a domain was rate-limited in the crawler worker, the job would be re-enqueued as a new job, which reset the failure count. This meant rate-limited jobs could retry indefinitely without respecting the max retry limit. This commit introduces a RateLimitRetryError exception that signals the queue system to retry the job after a delay without counting it as a failed attempt. The job is retried within the same invocation, preserving the original retry count. Changes: - Add RateLimitRetryError class to shared/queueing.ts - Update crawler worker to throw RateLimitRetryError instead of re-enqueuing - Update Restate queue service to handle RateLimitRetryError with delay - Update Liteque queue wrapper to handle RateLimitRetryError with delay This ensures that rate-limited jobs respect the configured retry limits while still allowing for delayed retries when domains are rate-limited. * refactor: use liteque's native RetryAfterError for rate limiting Instead of manually handling retries in a while loop, translate RateLimitRetryError to liteque's native RetryAfterError. This is cleaner and lets liteque handle the retry logic using its built-in mechanism. * test: add tests for RateLimitRetryError handling in restate queue Added comprehensive tests to verify that: 1. RateLimitRetryError delays retry appropriately 2. Rate-limited retries don't count against the retry limit 3. Jobs can be rate-limited more times than the retry limit 4. Regular errors still respect the retry limit These tests ensure the queue correctly handles rate limiting without exhausting retry attempts. * lint & format * fix: prevent onError callback for RateLimitRetryError Fixed two issues with RateLimitRetryError handling in restate queue: 1. RateLimitRetryError now doesn't trigger the onError callback since it's not a real error - it's an expected rate limiting behavior 2. Check for RateLimitRetryError in runWorkerLogic before calling onError, ensuring the instanceof check works correctly before the error gets further wrapped by restate Updated tests to verify onError is not called for rate limit retries. * fix: catch RateLimitRetryError before ctx.run wraps it Changed approach to use a discriminated union instead of throwing and catching RateLimitRetryError. Now we catch the error inside the ctx.run callback before it gets wrapped by restate's TerminalError, and return a RunResult type that indicates success, rate limit, or error. This fixes the issue where instanceof checks would fail because ctx.run wraps all errors in TerminalError. * more fixes * rename error name --------- Co-authored-by: Claude <noreply@anthropic.com>
* feat: Add user settings to disable auto tagging/summarization (#2275)Mohamed Bassem2025-12-222-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | * feat: Add per-user settings to disable auto-tagging and auto-summarization This commit adds user-level controls for AI features when they are enabled on the server. Users can now toggle auto-tagging and auto-summarization on/off from the AI Settings page. Changes: - Added autoTaggingEnabled and autoSummarizationEnabled fields to user table - Updated user settings schemas and API endpoints to handle new fields - Modified inference workers to check user preferences before processing - Added toggle switches to AI Settings page (only visible when server has features enabled) - Generated database migration for new fields - Exposed enableAutoTagging and enableAutoSummarization in client config The settings default to null (use server default). When explicitly set to false, the user's bookmarks will skip the respective AI processing. * revert migration * i18n --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix: optimize tagging db queries (#2287)Mohamed Bassem2025-12-221-18/+18
| | | | | | | | | * fix: optimize tagging db queries * review * parallel queries * refactoring
* fix: Fix Amazon product image extraction on amazon.com URLs (#2108)Randall Hand2025-12-142-0/+79
| | | | | | | | | | | | | | | | The metascraper-amazon package extracts the first .a-dynamic-image element, which on amazon.com is often the Prime logo instead of the product image. This works fine on amazon.co.uk where the product image appears first in the DOM. Created a custom metascraper plugin that uses more specific selectors (#landingImage, #imgTagWrapperId, #imageBlock) to target the actual product image. By placing this plugin before metascraperAmazon() in the chain, we fix image extraction while preserving all other Amazon metadata (title, brand, description). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
* feat: use reddit API for metadata extraction. Fixes #1853 #1883Mohamed Bassem2025-12-133-33/+343
|
* fix: use GET requests for the content type requestMohamed Bassem2025-12-131-1/+1
|
* feat: make asset preprocessing worker timeout configurableClaude2025-12-101-1/+1
| | | | | | - Added ASSET_PREPROCESSING_JOB_TIMEOUT_SEC environment variable with default of 60 seconds (increased from hardcoded 30 seconds) - Updated worker to use the configurable timeout from serverConfig - Added documentation for the new configuration option
* fix: migrate to metascraper-x from metascraper-twitterMohamed Bassem2025-12-082-3/+3
|
* feat: spread feed fetch scheduling deterministically over the hour (#2227)Mohamed Bassem2025-12-081-0/+31
| | | | | | | | | | Previously, all RSS feeds were fetched at the top of each hour (minute 0), which could cause load spikes. This change spreads feed fetches evenly throughout the hour using a deterministic hash of the feed ID. Each feed is assigned a target minute (0-59) based on its ID hash, ensuring consistent scheduling across restarts while distributing the load evenly. Co-authored-by: Claude <noreply@anthropic.com>
* fix: better extraction for youtube thumbnails. #2204Mohamed Bassem2025-12-072-0/+14
|
* feat: Add automated bookmark backup feature (#2182)Mohamed Bassem2025-11-294-0/+573
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * feat: Add automated bookmark backup system Implements a comprehensive automated backup feature for user bookmarks with the following capabilities: Database Schema: - Add backupSettings table to store user backup preferences (enabled, frequency, retention) - Add backups table to track backup records with status and metadata - Add BACKUP asset type for storing compressed backup files - Add migration 0066_add_backup_tables.sql Background Workers: - Implement BackupSchedulingWorker cron job (runs daily at midnight UTC) - Create BackupWorker to process individual backup jobs - Deterministic scheduling spreads backup jobs across 24 hours based on user ID hash - Support for daily and weekly backup frequencies - Automated retention cleanup to delete old backups based on user settings Export & Compression: - Reuse existing export functionality for bookmark data - Compress exports using Node.js built-in zlib (gzip level 9) - Store compressed backups as assets with proper metadata - Track backup size and bookmark count for statistics tRPC API: - backups.getSettings - Retrieve user backup configuration - backups.updateSettings - Update backup preferences - backups.list - List all user backups with metadata - backups.get - Get specific backup details - backups.delete - Delete a backup - backups.download - Download backup file (base64 encoded) - backups.triggerBackup - Manually trigger backup creation UI Components: - BackupSettings component with configuration form - Enable/disable automatic backups toggle - Frequency selection (daily/weekly) - Retention period configuration (1-365 days) - Backup list table with download and delete actions - Manual backup trigger button - Display backup stats (size, bookmark count, status) - Added backups page to settings navigation Technical Details: - Uses Restate queue system for distributed job processing - Implements idempotency keys to prevent duplicate backups - Background worker concurrency: 2 jobs at a time - 10-minute timeout for large backup exports - Proper error handling and logging throughout - Type-safe implementation with Zod schemas * refactor: simplify backup settings and asset handling - Move backup settings from separate table to user table columns - Update BackupSettings model to use static methods with users table - Remove download mutation in favor of direct asset links - Implement proper quota checks using QuotaService.checkStorageQuota - Update UI to use new property names and direct asset downloads - Update shared types to match new schema Key changes: - backupSettingsTable removed, settings now in users table - Backup downloads use direct /api/assets/{id} links - Quota properly validated before creating backup assets - Cleaner separation of concerns in tRPC models * migration * use zip instead of gzip * fix drizzle * fix settings * streaming json * remove more dead code * add e2e tests * return backup * poll for backups * more fixes * more fixes * fix test * fix UI * fix delete asset * fix ui * redirect for backup download * cleanups * fix idempotency * fix tests * add ratelimit * add error handling for background backups * i18n * model changes --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix: lazy load js-tiktoken in prompts module (#2176)Mohamed Bassem2025-11-282-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * feat: lazy load tiktoken to reduce memory footprint The js-tiktoken module loads a large encoding dictionary into memory immediately on import. This change defers the loading of the encoding until it's actually needed by using a lazy getter pattern. This reduces memory usage for processes that import this module but don't actually use the token encoding functions. * fix: use createRequire for lazy tiktoken import in ES module The previous implementation used bare require() which fails at runtime in ES modules (ReferenceError: require is not defined). This fixes it by using createRequire from Node's 'module' package, which creates a require function that works in ES module contexts. * refactor: convert tiktoken lazy loading to async dynamic imports Changed from createRequire to async import() for lazy loading tiktoken, making buildTextPrompt and buildSummaryPrompt async. This is cleaner for ES modules and properly defers the large tiktoken encoding data until it's actually needed. Updated all callers to await these async functions: - packages/trpc/routers/bookmarks.ts - apps/workers/workers/inference/tagging.ts - apps/workers/workers/inference/summarize.ts - apps/web/components/settings/AISettings.tsx (converted to useEffect) * feat: add untruncated prompt builders for UI previews Added buildTextPromptUntruncated and buildSummaryPromptUntruncated functions that don't require token counting or truncation. These are synchronous and don't load tiktoken, making them perfect for UI previews where exact token limits aren't needed. Updated AISettings.tsx to use these untruncated versions, eliminating the need for useEffect/useState and avoiding unnecessary tiktoken loading in the browser. * fix * fix --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix: Propagate group ids in queue calls (#2177)Mohamed Bassem2025-11-275-4/+18
| | | | | * fix: Propagate group ids * fix tests
* fix: add a way to allowlist all domains from ip validationMohamed Bassem2025-11-221-0/+4
|
* deps: upgrade hono and playwrightMohamed Bassem2025-11-161-2/+2
|
* deps: Upgrade typescript to 5.9Mohamed Bassem2025-11-161-1/+1
|
* feat: add Prometheus counter for HTTP status codes (#2117)Mohamed Bassem2025-11-152-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | * feat: add Prometheus counter for crawler status codes Add a new Prometheus metric to track HTTP status codes encountered during crawling operations. This helps monitor crawler health and identify patterns in response codes (e.g., 200 OK, 404 Not Found, etc.). Changes: - Add crawlerStatusCodeCounter in metrics.ts with status_code label - Instrument crawlerWorker.ts to track status codes after page crawling - Counter increments for each crawl with the corresponding HTTP status code The metric is exposed at the /metrics endpoint and follows the naming convention: karakeep_crawler_status_codes_total * fix: update counter name to follow Prometheus conventions Change metric name from "karakeep_crawler_status_codes" to "karakeep_crawler_status_codes_total" to comply with Prometheus naming best practices for counter metrics. --------- Co-authored-by: Claude <noreply@anthropic.com>
* feat: correct default prom metrics from web and worker containersMohamed Bassem2025-11-101-0/+1
|
* fix: fix crash in crawler on invalid URL in matchesNoProxyMohamed Bassem2025-11-101-3/+9
|
* feat: add crawler domain rate limiting (#2115)Mohamed Bassem2025-11-091-4/+80
|
* refactor: Allow runner functions to return results to onCompleteMohamed Bassem2025-11-091-1/+1
|
* feat: add failed_permanent metric for worker monitoring (#2107)Mohamed Bassem2025-11-099-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * feat: add last failure timestamp metric for worker monitoring Add a Prometheus Gauge metric to track the timestamp of the last failure for each worker. This complements the existing failed job counter by providing visibility into when failures last occurred for monitoring and alerting purposes. Changes: - Added workerLastFailureGauge metric in metrics.ts - Updated all 9 workers to set the gauge on failure: - crawler, feed, webhook, assetPreProcessing - inference, adminMaintenance, ruleEngine - video, search * refactor: track both all failures and permanent failures with counter Remove the gauge metric and use the existing counter to track both: - All failures (including retry attempts): status="failed" - Permanent failures (retries exhausted): status="failed_permanent" This provides better visibility into retry behavior and permanent vs temporary failures without adding a separate metric. Changes: - Removed workerLastFailureGauge from metrics.ts - Updated all 9 workers to track failed_permanent when numRetriesLeft == 0 - Maintained existing failed counter for all failure attempts * style: format worker files with prettier --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix: metascraper logo to go through proxy if one configured. fixes #1863Mohamed Bassem2025-11-031-1/+14
|
* fix: fix monolith to respect crawler proxyMohamed Bassem2025-11-021-0/+9
|
* feat(rss): Add import tags from RSS feed categories (#2031)Mohamed Bassem2025-11-021-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * feat(feeds): Add import tags from RSS feed categories - Add importTags boolean field to rssFeedsTable schema (default: false) - Create database migration 0063_add_import_tags_to_feeds.sql - Update zod schemas (zFeedSchema, zNewFeedSchema, zUpdateFeedSchema) to include importTags - Update Feed model to handle importTags in create and update methods - Update feedWorker to: - Read title and categories from RSS parser - Attach categories as tags to bookmarks when importTags is enabled - Log warnings if tag attachment fails Resolves #1996 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Mohamed Bassem <MohamedBassem@users.noreply.github.com> * feat(web): Add importTags option to feed settings UI - Add importTags toggle to FeedsEditorDialog (create feed) - Add importTags toggle to EditFeedDialog (edit feed) - Display as a bordered switch control with descriptive text - Defaults to false for new feeds Co-authored-by: Mohamed Bassem <MohamedBassem@users.noreply.github.com> * fix migration * remove extra migration --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Mohamed Bassem <MohamedBassem@users.noreply.github.com>