feature(inference): Upgrade the default vision model to the new gpt-4-turbo

author: MohamedBassem <me@mbassem.com> 2024-04-09 22:53:59 +0100
committer: MohamedBassem <me@mbassem.com> 2024-04-09 22:54:34 +0100
commit: 2806701318dff77b10a5574d4b26ef6032f6b9bc (patch)
tree: 4655587d97249f8b4729ab288d9924a3e8491942
parent: a9242a56d909a61ba6d51e531763294edb6f049c (diff)
download: karakeep-2806701318dff77b10a5574d4b26ef6032f6b9bc.tar.zst
4 files changed, 11 insertions, 10 deletions
diff --git a/apps/workers/inference.ts b/apps/workers/inference.ts
index 13b10aba..fa83140f 100644
--- a/apps/workers/inference.ts
+++ b/apps/workers/inference.ts
@@ -62,6 +62,7 @@ class OpenAIInferenceClient implements InferenceClient {
   ): Promise<InferenceResponse> {
     const chatCompletion = await this.openAI.chat.completions.create({
       model: serverConfig.inference.imageModel,
+      response_format: { type: "json_object" },
       messages: [
         {
           role: "user",
diff --git a/docs/docs/03-configuration.md b/docs/docs/03-configuration.md
index 1307bcfd..5bf1612c 100644
--- a/docs/docs/03-configuration.md
+++ b/docs/docs/03-configuration.md
@@ -26,14 +26,14 @@ Either `OPENAI_API_KEY` or `OLLAMA_BASE_URL` need to be set for automatic taggin
 - Running local models is a recent addition and not as battle tested as using OpenAI, so proceed with care (and potentially expect a bunch of inference failures).
   :::
 
-| Name                  | Required | Default              | Description                                                                                                                                                                                     |
-| --------------------- | -------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| OPENAI_API_KEY        | No       | Not set              | The OpenAI key used for automatic tagging. More on that in [here](/openai).                                                                                                                     |
-| OPENAI_BASE_URL       | No       | Not set              | If you just want to use OpenAI you don't need to pass this variable. If, however, you want to use some other openai compatible API (e.g. azure openai service), set this to the url of the API. |
-| OLLAMA_BASE_URL       | No       | Not set              | If you want to use ollama for local inference, set the address of ollama API here.                                                                                                              |
-| INFERENCE_TEXT_MODEL  | No       | gpt-3.5-turbo-0125   | The model to use for text inference. You'll need to change this to some other model if you're using ollama.                                                                                     |
-| INFERENCE_IMAGE_MODEL | No       | gpt-4-vision-preview | The model to use for image inference. You'll need to change this to some other model if you're using ollama and that model needs to support vision APIs (e.g. llava).                           |
-| INFERENCE_LANG        | No       | english              | The language in which the tags will be generated.                                                                                                                                               |
+| Name                  | Required | Default            | Description                                                                                                                                                                                     |
+| --------------------- | -------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| OPENAI_API_KEY        | No       | Not set            | The OpenAI key used for automatic tagging. More on that in [here](/openai).                                                                                                                     |
+| OPENAI_BASE_URL       | No       | Not set            | If you just want to use OpenAI you don't need to pass this variable. If, however, you want to use some other openai compatible API (e.g. azure openai service), set this to the url of the API. |
+| OLLAMA_BASE_URL       | No       | Not set            | If you want to use ollama for local inference, set the address of ollama API here.                                                                                                              |
+| INFERENCE_TEXT_MODEL  | No       | gpt-3.5-turbo-0125 | The model to use for text inference. You'll need to change this to some other model if you're using ollama.                                                                                     |
+| INFERENCE_IMAGE_MODEL | No       | gpt-4-turbo        | The model to use for image inference. You'll need to change this to some other model if you're using ollama and that model needs to support vision APIs (e.g. llava).                           |
+| INFERENCE_LANG        | No       | english            | The language in which the tags will be generated.                                                                                                                                               |
 
 ## Crawler Configs
 
diff --git a/docs/docs/06-openai.md b/docs/docs/06-openai.md
index 91e37c07..fa2a83ef 100644
--- a/docs/docs/06-openai.md
+++ b/docs/docs/06-openai.md
@@ -8,4 +8,4 @@ For text tagging, we use the `gpt-3.5-turbo-0125` model. This model is [extremel
 
 ## Image Tagging
 
-For image uploads, we use the `gpt-4-vision-preview` model for extracting tags from the image. You can learn more about the costs of using this model [here](https://platform.openai.com/docs/guides/vision/calculating-costs). To lower the costs, we're using the low resolution mode (fixed number of tokens regardless of image size). The gpt-4 model, however, is much more expensive than the `gpt-3.5-turbo`. Currently, we're using around 350 token per image inference which ends up costing around $0.01 per inference. So around 10x more expensive than the text tagging.
+For image uploads, we use the `gpt-4-turbo` model for extracting tags from the image. You can learn more about the costs of using this model [here](https://platform.openai.com/docs/guides/vision/calculating-costs). To lower the costs, we're using the low resolution mode (fixed number of tokens regardless of image size). The gpt-4 model, however, is much more expensive than the `gpt-3.5-turbo`. Currently, we're using around 350 token per image inference which ends up costing around $0.01 per inference. So around 10x more expensive than the text tagging.
diff --git a/packages/shared/config.ts b/packages/shared/config.ts
index 75274a4e..4e444908 100644
--- a/packages/shared/config.ts
+++ b/packages/shared/config.ts
@@ -14,7 +14,7 @@ const allEnv = z.object({
   OPENAI_BASE_URL: z.string().url().optional(),
   OLLAMA_BASE_URL: z.string().url().optional(),
   INFERENCE_TEXT_MODEL: z.string().default("gpt-3.5-turbo-0125"),
-  INFERENCE_IMAGE_MODEL: z.string().default("gpt-4-vision-preview"),
+  INFERENCE_IMAGE_MODEL: z.string().default("gpt-4-turbo"),
   REDIS_HOST: z.string().default("localhost"),
   REDIS_PORT: z.coerce.number().default(6379),
   REDIS_DB_IDX: z.coerce.number().optional(),
author	MohamedBassem <me@mbassem.com>	2024-04-09 22:53:59 +0100
committer	MohamedBassem <me@mbassem.com>	2024-04-09 22:54:34 +0100
commit	2806701318dff77b10a5574d4b26ef6032f6b9bc (patch)
tree	4655587d97249f8b4729ab288d9924a3e8491942
parent	a9242a56d909a61ba6d51e531763294edb6f049c (diff)
download	karakeep-2806701318dff77b10a5574d4b26ef6032f6b9bc.tar.zst