aboutsummaryrefslogtreecommitdiffstats
path: root/apps/workers/crawlerWorker.ts (follow)
Commit message (Collapse)AuthorAgeFilesLines
* feat: Add AI auto summarization. Fixes #1163Mohamed Bassem2025-05-181-877/+0
|
* chore: rename missing files/conf from Hoarder to Karakeep (#1280)adripo2025-04-211-1/+1
| | | | | | | | | * refactor: Rename remaining project configuration from Hoarder to Karakeep * some fixes --------- Co-authored-by: Mohamed Bassem <me@mbassem.com>
* fix(workers): Fix dompurify to run on readability's input not outputMohamed Bassem2025-04-211-4/+12
|
* fix(workers): Close browser if connect on demand (#1151)Chang-Yen Tseng2025-04-161-0/+3
|
* chore: Rename hoarder packages to karakeepMohamedBassem2025-04-121-8/+8
|
* feat(workers): Add CRAWLER_SCREENSHOT_TIMEOUT_SEC (#1155)Chang-Yen Tseng2025-03-271-10/+18
|
* feat(workers): Adds publisher and author og:meta tags to Bookmark (#1141)erik-nilcoast2025-03-221-0/+24
|
* feat: Add PDF screenshot generation and display (#995)Ahmad Mujahid2025-02-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Updated pdf2json to 3.1.5 * Extract and store a screenshot from PDF files using pdf2pic * Installing graphicsmagick and ghostscript * Generate Missing PDF screenshot with tidyAssets worker for backward support * Display PDF screenshot instead of the PDF in web if it exists. * Display PDF screenshot in mobile app if exists. * Updated pnpm-lock.yaml * Removed console.log * Revert the unnecessary changes in package.json * Revert pnpm-lock changes * Prevent rendering PDF files if the screenshot is not generated * refactor: replace useEffect with useMemo for section initialization * feat: show PDF file download button and handle large PDFs by defaulting to screenshot view * feat: add file size to openapi spec * feature: Add Assets preprocessing in fix mode to admin actions * i18n: add reprocess_assets_fix_mode translation * i18n: Add missing ar translations * A bunch of fixes * Fix openspec schema --------- Co-authored-by: Mohamed Bassem <me@mbassem.com>
* fix: Dont rearchive singlefile uploads and consider them as archivesMohamed Bassem2025-02-021-2/+6
|
* fix: Abort all IO when workers timeout instead of detaching. Fixes #742Mohamed Bassem2025-02-011-13/+62
|
* feat: Change webhooks to be configurable by usersMohamed Bassem2025-01-191-2/+2
|
* feat(webhook): Implement webhook functionality for bookmark events (#852)玄猫2025-01-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | * feat(webhook): Implement webhook functionality for bookmark events - Added WebhookWorker to handle webhook requests. - Integrated webhook triggering in crawlerWorker after video processing. - Updated main worker initialization to include WebhookWorker. - Enhanced configuration to support webhook URLs, token, and timeout. - Documented webhook configuration options in the documentation. - Introduced zWebhookRequestSchema for validating webhook requests. * feat(webhook): Update webhook handling and configuration - Changed webhook operation type from "create" to "crawled" in crawlerWorker and documentation. - Enhanced webhook retry logic in WebhookWorker to support multiple attempts. - Updated Docker configuration to include new webhook environment variables. - Improved validation for webhook configuration in shared config. - Adjusted zWebhookRequestSchema to reflect the new operation type. - Updated documentation to clarify webhook configuration options and usage. * minor modifications --------- Co-authored-by: Mohamed Bassem <me@mbassem.com>
* feat: Add support for singlefile extension uploads. #172Mohamed Bassem2025-01-111-6/+30
|
* refactor: Move asset preprocessing to its own worker out of the inference workerMohamed Bassem2024-12-261-17/+18
|
* feature: Store crawling status code and allow users to find broken links. ↵Mohamed Bassem2024-12-081-4/+6
| | | | Fixes #169
* feature(workers): Allow running hoarder without chrome as a hard dependency. ↵Mohamed Bassem2024-11-301-11/+35
| | | | Fixes #650
* fix(workers): Set a timeout on the screenshot call and completely skip it if ↵Mohamed Bassem2024-11-231-13/+32
| | | | screenshotting is disabled
* fix(workers): Don't block connection to chrome when failing to download ↵Mohamed Bassem2024-11-211-6/+22
| | | | adblock list. #674
* chore(workers): Add extra logging for browser connection errorsMohamed Bassem2024-11-211-1/+1
|
* fix: Only update bookmark tagging/crawling status when worker is out of retriesMohamed Bassem2024-11-091-4/+4
|
* fix: Pass arguments to monolith and yt-dlp as array for better escapingMohamed Bassem2024-11-031-1/+1
|
* feature: Archive videos using yt-dlp. Fixes #215 (#525)kamtschatka2024-10-281-49/+10
| | | | | | | | | | | | | | | | | | | | | * Allow downloading more content from a webpage and index it #215 Added a worker that allows downloading videos depending on the environment variables refactored the code a bit added new video asset updated documentation * Some tweaks * Drop the dependency on the yt-dlp wrapper * Update openapi specs * Dont log an error when the url is not supported * Better handle supported websites that dont download anything --------- Co-authored-by: Mohamed Bassem <me@mbassem.com>
* deps: Extract the queue implementation into its own reposMohamed Bassem2024-10-271-1/+1
|
* refactor: Start tracking bookmark assets in the assets tableMohamedBassem2024-10-061-60/+83
|
* refactor: Include userId in the assets tableMohamedBassem2024-10-061-0/+5
|
* feature(web): Add ability to manually trigger full page archives. Fixes #398 ↵kamtschatka2024-09-301-3/+5
| | | | | | | | | | | | | (#418) * [Feature Request] Ability to select what to "crawl full page archive" #398 Added the ability to start a full page crawl for links and also in bulk operations added the ability to refresh links as a bulk operation as well * minor icon and wording changes --------- Co-authored-by: MohamedBassem <me@mbassem.com>
* fix(workers): Log stacktrace on worker error. #424 (#429)kamtschatka2024-09-261-1/+3
| | | extended logging when an exception occurrs, so it is possible to see the stacktrace of a failed execution
* fix(workers): Shutdown workers on SIGTERMMohamedBassem2024-07-281-0/+4
|
* fix: async/await issues with the new queue (#319)kamtschatka2024-07-211-2/+2
|
* refactor: Replace the usage of bullMQ with the hoarder sqlite-based queue (#309)Mohamed Bassem2024-07-211-31/+29
|
* fix: monolith not embedding SVG files correctly. Fixes #289 (#306)kamtschatka2024-07-141-5/+2
| | | passing in the URL of the page to have the proper URL for resolving relative paths
* refactor: added the bookmark type to the database (#256)kamtschatka2024-07-011-0/+6
| | | | | | | | | | | | | | | | | * refactoring asset types Extracted out functions to silently delete assets and to update them after crawling Generalized the mapping of assets to bookmark fields to make extending them easier * Added the bookmark type to the database Introduced an enum to have better type safety cleaned up the code and based some code on the type directly * add BookmarkType.UNKNWON * lint and remove unused function --------- Co-authored-by: MohamedBassem <me@mbassem.com>
* refactor: remove redundant code from crawler worker and refactor handling of ↵kamtschatka2024-06-291-32/+49
| | | | | | | | | | | | | asset types (#253) * refactoring asset types Extracted out functions to silently delete assets and to update them after crawling Generalized the mapping of assets to bookmark fields to make extending them easier * revert silentDeleteAsset and hide better-sqlite3 --------- Co-authored-by: MohamedBassem <me@mbassem.com>
* feature: Automatically transfer image urls into bookmared assets. Fixes #246MohamedBassem2024-06-231-6/+16
|
* refactor: extract assets into their own database table. #215 (#220)kamtschatka2024-06-231-29/+71
| | | | | | | | | | | | | | | | | | | * Allow downloading more content from a webpage and index it #215 added a new table that contains the information about assets for link bookmarks created migration code that transfers the existing data into the new table * Allow downloading more content from a webpage and index it #215 removed the old asset columns from the database updated the UI to use the data from the linkBookmarkAssets array * generalize the assets table to not be linked in particular to links * fix migrations post merge * fix missing asset ids in the getBookmarks call --------- Co-authored-by: MohamedBassem <me@mbassem.com>
* feature: add support for PDF links. Fixes #28 (#216)kamtschatka2024-06-221-57/+163
| | | | | | | | | | | | | | | | | * feature request: pdf support #28 Added a new sourceUrl column to the asset bookmarks Added transforming a link bookmark pointing at a pdf to an asset bookmark made sure the "View Original" link is also shown for asset bookmarks that have a sourceURL updated gitignore for IDEA * remove pdf parsing from the crawler * extract the http logic into its own function to avoid duplicating the post-processing actions (openai/index) * Add 5s timeout to the content type fetch --------- Co-authored-by: MohamedBassem <me@mbassem.com>
* fix: Trigger search re-index on bookmark tag manual updates. Fixes #208 (#210)kamtschatka2024-06-091-5/+2
| | | | | | | | | | | | * re-index of database is not scanning all places when bookmark tags are changed. Manual indexing is working as workaround #208 introduced a new function to trigger a reindex to reduce copy/paste added missing reindexes when tags are deleted/bookmarks are updated * give functions a bit more descriptive name --------- Co-authored-by: kamtschatka <simon.schatka@gmx.at> Co-authored-by: MohamedBassem <me@mbassem.com>
* fix(crawler): Only update the database if full page archival is enabledMohamedBassem2024-05-261-19/+19
|
* feature: Full page archival with monolith. Fixes #132MohamedBassem2024-05-261-1/+65
|
* feature(crawler): Allow connecting to browser's websocket address and ↵MohamedBassem2024-05-151-28/+55
| | | | launching the browser on demand. This enables support for browserless
* feature: Take full page screenshots #143 (#148)kamtschatka2024-05-121-1/+2
| | | | | | Added the fullPage flag to take full screen screenshots updated the UI accordingly to properly show the screenshots instead of scaling it down Co-authored-by: kamtschatka <simon.schatka@gmx.at>
* feature(crawler): Allow increasing crawler concurrency and configure storing ↵MohamedBassem2024-04-261-0/+13
| | | | images and screenshots
* fix(crawler): Better extraction for amazon imagesMohamedBassem2024-04-231-0/+2
|
* fix(workers): Set a modern user agent and update the default viewport sizeMohamedBassem2024-04-231-0/+7
|
* feature: Allow recrawling bookmarks without running inference jobsMohamedBassem2024-04-201-7/+29
|
* feature: Download images and screenshotsMohamedBassem2024-04-201-28/+130
|
* feature: Recrawl failed links from admin UI (#95)Ahmad Mujahid2024-04-111-0/+20
| | | | | * feature: Retry failed crawling URLs * fix: Enhancing visuals and some minor changes.
* fix: Increase default navigation timeout to 30s, make it configurable and ↵MohamedBassem2024-04-111-1/+1
| | | | add retries to crawling jobs
* fix(crawler): Skip validating URLs in metascrapper as it was already being ↵MohamedBassem2024-04-091-0/+3
| | | | validated. Fixes #22
* fix(workers): Increase default timeout to 60s, make it configurable and ↵MohamedBassem2024-04-061-11/+21
| | | | improve logging