karakeep — log — apps/workers/crawlerWorker.ts

follow (on) | order: default date topo

Age	Commit message	Author	Files	+/-
1 year	fix(workers): Shutdown workers on SIGTERM	MohamedBassem	2	-0/+9
1 year	fix: async/await issues with the new queue (#319)	kamtschatka	6	-25/+27
1 year	refactor: Replace the usage of bullMQ with the hoarder sqlite-based queue (#309)	Mohamed Bassem	13	-344/+128
1 year	fix: monolith not embedding SVG files correctly. Fixes #289 (#306) … passing in the URL of the page to have the proper URL for resolving relative paths	kamtschatka	1	-5/+2
1 year	refactor: added the bookmark type to the database (#256) … * refactoring asset types Extracted out functions to silently delete assets and to update them after crawling Generalized the mapping of assets to bookmark fields to make extending them easier * Added the bookmark type to the database Introduced an enum to have better type safety cleaned up the code and based some code on the type directly * add BookmarkType.UNKNWON * lint and remove unused function --------- Co-authored-by: MohamedBassem <me@mbassem.com>	kamtschatka	27	-120/+1266
1 year	refactor: remove redundant code from crawler worker and refactor handling of… … * refactoring asset types Extracted out functions to silently delete assets and to update them after crawling Generalized the mapping of assets to bookmark fields to make extending them easier * revert silentDeleteAsset and hide better-sqlite3 --------- Co-authored-by: MohamedBassem <me@mbassem.com>	kamtschatka	3	-65/+80
1 year	feature: Automatically transfer image urls into bookmared assets. Fixes #246	MohamedBassem	2	-9/+23
1 year	refactor: extract assets into their own database table. #215 (#220) … * Allow downloading more content from a webpage and index it #215 added a new table that contains the information about assets for link bookmarks created migration code that transfers the existing data into the new table * Allow downloading more content from a webpage and index it #215 removed the old asset columns from the database updated the UI to use the data from the linkBookmarkAssets array * generalize the assets table to not be linked in particular to links * fix migrations post merge * fix missing asset ids in the getBookmarks call --------- Co-authored-by: MohamedBassem <me@mbassem.com>	kamtschatka	6	-52/+1271
1 year	feature: add support for PDF links. Fixes #28 (#216) … * feature request: pdf support #28 Added a new sourceUrl column to the asset bookmarks Added transforming a link bookmark pointing at a pdf to an asset bookmark made sure the "View Original" link is also shown for asset bookmarks that have a sourceURL updated gitignore for IDEA * remove pdf parsing from the crawler * extract the http logic into its own function to avoid duplicating the post-processing actions (openai/index) * Add 5s timeout to the content type fetch --------- Co-authored-by: MohamedBassem <me@mbassem.com>	kamtschatka	10	-93/+1263
1 year	fix: Trigger search re-index on bookmark tag manual updates. Fixes #208 (#210) … * re-index of database is not scanning all places when bookmark tags are changed. Manual indexing is working as workaround #208 introduced a new function to trigger a reindex to reduce copy/paste added missing reindexes when tags are deleted/bookmarks are updated * give functions a bit more descriptive name --------- Co-authored-by: kamtschatka <simon.schatka@gmx.at> Co-authored-by: MohamedBassem <me@mbassem.com>	kamtschatka	6	-55/+41
1 year	fix(crawler): Only update the database if full page archival is enabled	MohamedBassem	1	-19/+19
1 year	feature: Full page archival with monolith. Fixes #132	MohamedBassem	14	-7/+1259
1 year	feature(crawler): Allow connecting to browser's websocket address and launching…	MohamedBassem	3	-36/+70
1 year	feature: Take full page screenshots #143 (#148) … Added the fullPage flag to take full screen screenshots updated the UI accordingly to properly show the screenshots instead of scaling it down Co-authored-by: kamtschatka <simon.schatka@gmx.at>	kamtschatka	4	-3/+9
1 year	feature(crawler): Allow increasing crawler concurrency and configure storing…	MohamedBassem	3	-4/+26
1 year	fix(crawler): Better extraction for amazon images	MohamedBassem	3	-0/+20
1 year	fix(workers): Set a modern user agent and update the default viewport size	MohamedBassem	1	-0/+7
1 year	feature: Allow recrawling bookmarks without running inference jobs	MohamedBassem	4	-9/+46
1 year	feature: Download images and screenshots	MohamedBassem	22	-135/+1373
1 year	feature: Recrawl failed links from admin UI (#95) … * feature: Retry failed crawling URLs * fix: Enhancing visuals and some minor changes.	Ahmad Mujahid	8	-25/+1067
1 year	fix: Increase default navigation timeout to 30s, make it configurable and add…	MohamedBassem	5	-6/+17
1 year	fix(crawler): Skip validating URLs in metascrapper as it was already being…	MohamedBassem	1	-0/+3
1 year	fix(workers): Increase default timeout to 60s, make it configurable and improve…	MohamedBassem	3	-11/+29
2 years	fix(workers): Add a timeout to the crawling job to prevent it from getting…	MohamedBassem	2	-1/+18
2 years	chore(workers): Remove unused configuration options	MohamedBassem	2	-6/+0
2 years	format: Add missing lint and format, and format the entire repo	MohamedBassem	57	-192/+255
2 years	refactor: Validate env variables using zod	MohamedBassem	7	-46/+91
2 years	docker: Use external chrome docker container	MohamedBassem	8	-33/+61
2 years	fix(workers): Fix the leaky browser instances in workers during development	MohamedBassem	3	-29/+46
2 years	fix: Simple validations for crawled URLs	MohamedBassem	1	-1/+17
2 years	structure: Create apps dir and copy tooling dir from t3-turbo repo	MohamedBassem	396	-9511/+10350
2 years	feature: Store html content of links in the database	MohamedBassem	6	-0/+818
2 years	fix: Use puppeteer adblocker to block cookies notices	MohamedBassem	3	-0/+120
2 years	feature: Store full link content and index them	MohamedBassem	9	-1/+878
2 years	feature: Add full text search support	MohamedBassem	17	-12/+440
2 years	db: Migrate from prisma to drizzle	MohamedBassem	41	-975/+2177
2 years	branding: Rename app to Hoarder	MohamedBassem	21	-165/+164
2 years	build: Fix docker images	MohamedBassem	7	-20/+34
2 years	fix: Let the crawler wait a bit more for page load	MohamedBassem	3	-3/+18
2 years	fix: Harden puppeteer against browser disconnections and exceptions	MohamedBassem	3	-16/+44
2 years	feature: Add ability to refresh bookmark details	MohamedBassem	5	-4/+76
2 years	fix: Fix build for workers package and add it to CI	MohamedBassem	8	-70/+106
2 years	[feature] Use puppeteer for fetching websites	MohamedBassem	3	-18/+998
2 years	[chore] Linting and formating tweaking	MohamedBassem	24	-67/+157
2 years	[refactor] Extract the bookmark model to be a high level model to support other…	MohamedBassem	22	-308/+396
2 years	[refactor] Move the different packages to the package subdir	MohamedBassem	128	-2716/+2713
2 years	[feature] Add openAI integration for extracting tags from articles	MohamedBassem	9	-19/+239
2 years	[refactor] Rename the crawlers package to workers	MohamedBassem	8	-126/+126
2 years	Implement metadata fetching logic in the crawler	MohamedBassem	29	-264/+439
2 years	Init package and start bullmq workers	MohamedBassem	12	-8/+91