feature(crawler): Allow connecting to browser's websocket address and launching the browser on demand. This enables support for browserless

author: MohamedBassem <me@mbassem.com> 2024-05-15 08:08:38 +0100
committer: MohamedBassem <me@mbassem.com> 2024-05-15 08:14:16 +0100
commit: 39025a83e041347a4c8206704e7dc2cd1e0cadd5 (patch)
tree: 53c26b0655757bdc5b5ac94ba48d24d578dc47de /docs
parent: f64a5f3237c41b600f7047c477fbf9e79eae4297 (diff)
download: karakeep-39025a83e041347a4c8206704e7dc2cd1e0cadd5.tar.zst
1 files changed, 11 insertions, 8 deletions
diff --git a/docs/docs/03-configuration.md b/docs/docs/03-configuration.md
index 83546ec8..08405a0f 100644
--- a/docs/docs/03-configuration.md
+++ b/docs/docs/03-configuration.md
@@ -38,11 +38,14 @@ Either `OPENAI_API_KEY` or `OLLAMA_BASE_URL` need to be set for automatic taggin
 
 ## Crawler Configs
 
-| Name                          | Required | Default | Description                                                                                                                                                                                                                |
-| ----------------------------- | -------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| CRAWLER_NUM_WORKERS           | No       | 1       | Number of allowed concurrent crawling jobs. By default, we're only doing one crawling request at a time to avoid consuming a lot of resources.                                                                             |
-| CRAWLER_DOWNLOAD_BANNER_IMAGE | No       | true    | Whether to cache the banner image used in the cards locally or fetch it each time directly from the website. Caching it consumes more storage space, but is more resilient against link rot and rate limits from websites. |
-| CRAWLER_STORE_SCREENSHOT      | No       | true    | Whether to store a screenshot from the crawled website or not. Screenshots act as a fallback for when we fail to extract an image from a website. You can also view the stored screenshots for any link.                   |
-| CRAWLER_FULL_PAGE_SCREENSHOT  | No       | false   | Whether to store a screenshot of the full page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, the screenshot will only include the visible part of the page                           |
-| CRAWLER_JOB_TIMEOUT_SEC       | No       | 60      | How long to wait for the crawler job to finish before timing out. If you have a slow internet connection or a low powered device, you might want to bump this up a bit                                                     |
-| CRAWLER_NAVIGATE_TIMEOUT_SEC  | No       | 30      | How long to spend navigating to the page (along with its redirects). Increase this if you have a slow internet connection                                                                                                  |
+| Name                          | Required | Default | Description                                                                                                                                                                                                                                                                                                                                                                        |
+| ----------------------------- | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| CRAWLER_NUM_WORKERS           | No       | 1       | Number of allowed concurrent crawling jobs. By default, we're only doing one crawling request at a time to avoid consuming a lot of resources.                                                                                                                                                                                                                                     |
+| BROWSER_WEB_URL               | No       | Not set | The browser's http debugging address. The worker will talk to this endpoint to resolve the debugging console's websocket address. If you already have the websocket address, use `BROWSER_WEBSOCKET_URL` instead. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will launch its own browser instance (assuming it has access to the chrome binary). |
+| BROWSER_WEBSOCKET_URL         | No       | Not set | The websocket address of browser's debugging console. If you want to use [browserless](https://browserless.io), use their websocket address here. If neither `BROWSER_WEB_URL` nor `BROWSER_WEBSOCKET_URL` are set, the worker will launch its own browser instance (assuming it has access to the chrome binary).                                                                 |
+| BROWSER_CONNECT_ONDEMAND      | No       | false   | If set to false, the crawler will proactively connect to the browser instance and always maintain an active connection. If set to true, the browser will be launched on demand only whenever a crawling is requested. Set to true if you're using a service that provides you with browser instances on demand.                                                                    |
+| CRAWLER_DOWNLOAD_BANNER_IMAGE | No       | true    | Whether to cache the banner image used in the cards locally or fetch it each time directly from the website. Caching it consumes more storage space, but is more resilient against link rot and rate limits from websites.                                                                                                                                                         |
+| CRAWLER_STORE_SCREENSHOT      | No       | true    | Whether to store a screenshot from the crawled website or not. Screenshots act as a fallback for when we fail to extract an image from a website. You can also view the stored screenshots for any link.                                                                                                                                                                           |
+| CRAWLER_FULL_PAGE_SCREENSHOT  | No       | false   | Whether to store a screenshot of the full page or not. Disabled by default, as it can lead to much higher disk usage. If disabled, the screenshot will only include the visible part of the page                                                                                                                                                                                   |
+| CRAWLER_JOB_TIMEOUT_SEC       | No       | 60      | How long to wait for the crawler job to finish before timing out. If you have a slow internet connection or a low powered device, you might want to bump this up a bit                                                                                                                                                                                                             |
+| CRAWLER_NAVIGATE_TIMEOUT_SEC  | No       | 30      | How long to spend navigating to the page (along with its redirects). Increase this if you have a slow internet connection                                                                                                                                                                                                                                                          |
author	MohamedBassem <me@mbassem.com>	2024-05-15 08:08:38 +0100
committer	MohamedBassem <me@mbassem.com>	2024-05-15 08:14:16 +0100
commit	39025a83e041347a4c8206704e7dc2cd1e0cadd5 (patch)
tree	53c26b0655757bdc5b5ac94ba48d24d578dc47de /docs
parent	f64a5f3237c41b600f7047c477fbf9e79eae4297 (diff)
download	karakeep-39025a83e041347a4c8206704e7dc2cd1e0cadd5.tar.zst