aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorMohamed Bassem <me@mbassem.com>2024-12-22 11:44:18 +0000
committerMohamed Bassem <me@mbassem.com>2024-12-22 11:44:18 +0000
commit8732056fdd0444459829942735a74405dbc4725f (patch)
tree26e8c1685bf526dbe8c8867ca6da68798a93be89 /docs
parente3b8cdab187efc17465df97a21b7997f71912860 (diff)
downloadkarakeep-8732056fdd0444459829942735a74405dbc4725f.tar.zst
docs: Add minimal installation docs, and fix other docs
Diffstat (limited to 'docs')
-rw-r--r--docs/docs/02-Installation/05-pikapods.md4
-rw-r--r--docs/docs/02-Installation/07-minimal-install.md49
-rw-r--r--docs/docs/03-configuration.md35
3 files changed, 71 insertions, 17 deletions
diff --git a/docs/docs/02-Installation/05-pikapods.md b/docs/docs/02-Installation/05-pikapods.md
index aeddb5d4..f954645a 100644
--- a/docs/docs/02-Installation/05-pikapods.md
+++ b/docs/docs/02-Installation/05-pikapods.md
@@ -1,5 +1,9 @@
# PikaPods [Paid Hosting]
+:::info
+Note: PikaPods shares some of its revenue from hosting hoarder with the maintainer of this project.
+:::
+
[PikaPods](https://www.pikapods.com/) offers managed paid hosting for many open source apps, including Hoarder.
Server administration, updates, migrations and backups are all taken care of, which makes it well suited
for less technical users. As of Nov 2024, running Hoarder there will cost you ~$3 per month.
diff --git a/docs/docs/02-Installation/07-minimal-install.md b/docs/docs/02-Installation/07-minimal-install.md
new file mode 100644
index 00000000..147c1621
--- /dev/null
+++ b/docs/docs/02-Installation/07-minimal-install.md
@@ -0,0 +1,49 @@
+# Minimal Installation
+
+:::warning
+Unless necessary, prefer the [full installation](/Installation/docker) to leverage all the features of hoarder. You'll be sacrificing a lot of functionality if you go with the minimal installation route.
+:::
+
+Hoarder's default installation has a dependency on Meilisearch for the full text search, Chrome for crawling and OpenAI/Ollama for AI tagging. You can however run hoarder without those dependencies if you're willing to sacrifice those features.
+
+- If you run without meilisearch, the search functionality will be completely disabled.
+- If you run without chrome, crawling will still work, but you'll lose ability to take screenshots of websites and websites with javascript content won't get crawled correctly.
+- If you don't setup OpenAI/Ollama, AI tagging will be disabled.
+
+Those features are important for leveraging hoarder's full potential, but if you're running in constrained environments, you can use the following minimal docker compose to skip all those dependencies:
+
+```yaml
+services:
+ web:
+ image: ghcr.io/hoarder-app/hoarder:release
+ restart: unless-stopped
+ volumes:
+ - data:/data
+ ports:
+ - 3000:3000
+ environment:
+ DATA_DIR: /data
+ NEXTAUTH_SECRET: super_random_string
+volumes:
+ data:
+```
+
+Or just with the following docker command:
+
+```base
+docker run -d \
+ --restart unless-stopped \
+ -v data:/data \
+ -p 3000:3000 \
+ -e DATA_DIR=/data \
+ -e NEXTAUTH_SECRET=super_random_string \
+ ghcr.io/hoarder-app/hoarder:release
+```
+
+:::warning
+You **MUST** change the `super_random_string` to a true random string which you can generate with `openssl rand -hex 32`.
+:::
+
+Check the [configuration docs](/configuration) for extra features to enable such as full page archival, full page screenshots, inference languages, etc.
+
+
diff --git a/docs/docs/03-configuration.md b/docs/docs/03-configuration.md
index a5720092..47c3227f 100644
--- a/docs/docs/03-configuration.md
+++ b/docs/docs/03-configuration.md
@@ -61,28 +61,29 @@ Either `OPENAI_API_KEY` or `OLLAMA_BASE_URL` need to be set for automatic taggin
| INFERENCE_JOB_TIMEOUT_SEC | No | 30 | How long to wait for the inference job to finish before timing out. If you're running ollama without powerful GPUs, you might want to increase the timeout a bit. |
:::info
+
- You can append additional instructions to the prompt used for automatic tagging, in the `AI Settings` (in the `User Settings` screen)
- You can use the placeholders `$tags`, `$aiTags`, `$userTags` in the prompt. These placeholders will be replaced with all tags, ai generated tags or human created tags when automatic tagging is performed (e.g. `[hoarder, computer, ai]`)
-:::
+ :::
## Crawler Configs
-| Name | Required | Default | Description |
-| ---------------------------------- | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| CRAWLER_NUM_WORKERS | No | 1 | Number of allowed concurrent crawling jobs. By default, we're only doing one crawling request at a time to avoid consuming a lot of resources. |
-| BROWSER_WEB_URL | No | Not set | The browser's http debugging address. The worker will talk to this endpoint to resolve the debugging console's websocket address. If you already have the websocket address, use `BROWSER_WEBSOCKET_URL` instead. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will launch its own browser instance (assuming it has access to the chrome binary). |
-| BROWSER_WEBSOCKET_URL | No | Not set | The websocket address of browser's debugging console. If you want to use [browserless](https://browserless.io), use their websocket address here. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will launch its own browser instance (assuming it has access to the chrome binary). |
-| BROWSER_CONNECT_ONDEMAND | No | false | If set to false, the crawler will proactively connect to the browser instance and always maintain an active connection. If set to true, the browser will be launched on demand only whenever a crawling is requested. Set to true if you're using a service that provides you with browser instances on demand. |
-| CRAWLER_DOWNLOAD_BANNER_IMAGE | No | true | Whether to cache the banner image used in the cards locally or fetch it each time directly from the website. Caching it consumes more storage space, but is more resilient against link rot and rate limits from websites. |
-| CRAWLER_STORE_SCREENSHOT | No | true | Whether to store a screenshot from the crawled website or not. Screenshots act as a fallback for when we fail to extract an image from a website. You can also view the stored screenshots for any link. |
-| CRAWLER_FULL_PAGE_SCREENSHOT | No | false | Whether to store a screenshot of the full page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, the screenshot will only include the visible part of the page |
-| CRAWLER_FULL_PAGE_ARCHIVE | No | false | Whether to store a full local copy of the page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, only the readable text of the page is archived. |
-| CRAWLER_JOB_TIMEOUT_SEC | No | 60 | How long to wait for the crawler job to finish before timing out. If you have a slow internet connection or a low powered device, you might want to bump this up a bit |
-| CRAWLER_NAVIGATE_TIMEOUT_SEC | No | 30 | How long to spend navigating to the page (along with its redirects). Increase this if you have a slow internet connection |
-| CRAWLER_VIDEO_DOWNLOAD | No | false | Whether to download videos from the page or not (using yt-dlp) |
-| CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE | No | 50 | The maximum file size for the downloaded video. The quality will be chosen accordingly. Use -1 to disable the limit. |
-| CRAWLER_VIDEO_DOWNLOAD_TIMEOUT_SEC | No | 600 | How long to wait for the video download to finish |
-| CRAWLER_ENABLE_ADBLOCKER | No | true | Whether to enable an adblocker in the crawler or not. If you're facing troubles downloading the adblocking lists on worker startup, you can disable this. |
+| Name | Required | Default | Description |
+| ---------------------------------- | -------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| CRAWLER_NUM_WORKERS | No | 1 | Number of allowed concurrent crawling jobs. By default, we're only doing one crawling request at a time to avoid consuming a lot of resources. |
+| BROWSER_WEB_URL | No | Not set | The browser's http debugging address. The worker will talk to this endpoint to resolve the debugging console's websocket address. If you already have the websocket address, use `BROWSER_WEBSOCKET_URL` instead. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will use plain http requests skipping screenshotting and javascript execution. |
+| BROWSER_WEBSOCKET_URL | No | Not set | The websocket address of browser's debugging console. If you want to use [browserless](https://browserless.io), use their websocket address here. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will use plain http requests skipping screenshotting and javascript execution. |
+| BROWSER_CONNECT_ONDEMAND | No | false | If set to false, the crawler will proactively connect to the browser instance and always maintain an active connection. If set to true, the browser will be launched on demand only whenever a crawling is requested. Set to true if you're using a service that provides you with browser instances on demand. |
+| CRAWLER_DOWNLOAD_BANNER_IMAGE | No | true | Whether to cache the banner image used in the cards locally or fetch it each time directly from the website. Caching it consumes more storage space, but is more resilient against link rot and rate limits from websites. |
+| CRAWLER_STORE_SCREENSHOT | No | true | Whether to store a screenshot from the crawled website or not. Screenshots act as a fallback for when we fail to extract an image from a website. You can also view the stored screenshots for any link. |
+| CRAWLER_FULL_PAGE_SCREENSHOT | No | false | Whether to store a screenshot of the full page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, the screenshot will only include the visible part of the page |
+| CRAWLER_FULL_PAGE_ARCHIVE | No | false | Whether to store a full local copy of the page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, only the readable text of the page is archived. |
+| CRAWLER_JOB_TIMEOUT_SEC | No | 60 | How long to wait for the crawler job to finish before timing out. If you have a slow internet connection or a low powered device, you might want to bump this up a bit |
+| CRAWLER_NAVIGATE_TIMEOUT_SEC | No | 30 | How long to spend navigating to the page (along with its redirects). Increase this if you have a slow internet connection |
+| CRAWLER_VIDEO_DOWNLOAD | No | false | Whether to download videos from the page or not (using yt-dlp) |
+| CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE | No | 50 | The maximum file size for the downloaded video. The quality will be chosen accordingly. Use -1 to disable the limit. |
+| CRAWLER_VIDEO_DOWNLOAD_TIMEOUT_SEC | No | 600 | How long to wait for the video download to finish |
+| CRAWLER_ENABLE_ADBLOCKER | No | true | Whether to enable an adblocker in the crawler or not. If you're facing troubles downloading the adblocking lists on worker startup, you can disable this. |
## OCR Configs