{"id":1723,"date":"2024-09-01T21:50:02","date_gmt":"2024-09-01T12:50:02","guid":{"rendered":"https:\/\/blog.peddals.com\/?p=1723"},"modified":"2025-01-01T15:52:39","modified_gmt":"2025-01-01T06:52:39","slug":"optimize-context-size-for-ollama-server","status":"publish","type":"post","link":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/","title":{"rendered":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Recently, the performance of open-source and open-weight LLMs has been amazing, and for coding assistance, DeepSeek Coder V2 Lite Instruct (16B) is sufficient, while for Japanese and English chat or translation, Llama 3.1 Instruct (8B) is enough. When running Ollama from the Terminal app and chatting, the generated text and response speed are truly surprising, making it feel like you can live without the internet for a while.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, when using the same model through Dify or Visual Studio Code&#8217;s LLM extension Continue, you may notice the response speed becomes extremely slow. In this post, I will introduce a solution to this problem. Your problem may be caused by something else, but since it is easy to check and fix, I recommend checking the&nbsp;<strong><a href=\"#Conclusion\">Conclusion<\/a><\/strong>&nbsp;section of this post.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Confirmed Environment<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">OS and app versions:<\/h3>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism off-numbers lang-plain\"><code>macOS: 14.5\nOllama: 0.3.8\nDify: 0.6.15\nVisual Studio Code - Insiders: 1.93.0-insider\nContinue: 0.8.47<\/code><\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">LLM and size<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>Model name<\/td><td>Model size<\/td><td>Context length<\/td><td>Ollama download command<\/td><\/tr><tr><td>llama3.1:8b-instruct-fp16<\/td><td>16 GB<\/td><td>131072<\/td><td><code>ollama pull llama3.1:8b-instruct-fp16<\/code><\/td><\/tr><tr><td>deepseek-coder-v2:16b-lite-instruct-q8_0<\/td><td>16 GB<\/td><td>163840<\/td><td><code>ollama run deepseek-coder-v2:16b-lite-instruct-q8_0<\/code><\/td><\/tr><tr><td>deepseek-coder-v2:16b-lite-instruct-q6_K<\/td><td>14 GB<\/td><td>163840<\/td><td><code>ollama pull deepseek-coder-v2:16b-lite-instruct-q6_K<\/code><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Mac with 32GB RAM is capable of running them on memory.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Check the context length and lower it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By setting &#8220;Size of context window&#8221; in Dify or Continue to a sufficiently small value, you can solve this problem. Don&#8217;t set a number just because the model supports it or for future use; instead, use the default value (2048) or 4096 and test chatting with a small number of words. If you get a response as you expect, congrats, the issue is resolved.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Context size: It is also called \"context window\" or \"context length.\" It represents the total number of tokens that an LLM can process in one interaction. Token count is approximately equal to word count in English and other supported languages. In the table above, Llama 3.1 has a context size of 131072, so it can handle approximately 65,536 words text as input and output.<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Changing Context Length<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Dify <\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Open the LLM block in the studio app and click on the model name to access detailed settings.<\/li>\n\n\n\n<li>Scroll down to find &#8220;Size of cont&#8230;&#8221; (Size of content window) and uncheck it or enter 4096.<\/li>\n\n\n\n<li>The default value is 2048 when unchecked.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"525\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-1024x525.jpg\" alt=\"\" class=\"wp-image-1694\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-1024x525.jpg 1024w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-300x154.jpg 300w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-768x394.jpg 768w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-1536x787.jpg 1536w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13.jpg 1616w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Continue (VS Code LLM extension)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Open the config.json file in the Continue pane&#8217;s gear icon.<\/li>\n\n\n\n<li>Change the <code>contextLength<\/code> and <code>maxTokens<\/code> values to <code>4096<\/code> and <code>2048<\/code>, respectively. Note that <code>maxTokens<\/code> is the maximum number of tokens generated by the LLM, so we set it half.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"418\" height=\"192\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13.png\" alt=\"\" class=\"wp-image-1695\" style=\"width:180px;height:auto\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13.png 418w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-13-300x138.png 300w\" sizes=\"auto, (max-width: 418px) 100vw, 418px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism off-numbers lang-json\" data-file=\"\/Users\/username\/.continue\/config.json\" data-lang=\"JSON\" data-line=\"6,11\"><code>    {\n      &quot;title&quot;: &quot;Chat: llama3.1:8b-instruct-fp16&quot;,\n      &quot;provider&quot;: &quot;ollama&quot;,\n      &quot;model&quot;: &quot;llama3.1:8b-instruct-fp16&quot;,\n      &quot;apiBase&quot;: &quot;http:\/\/localhost:11434&quot;,\n      &quot;contextLength&quot;: 4096,\n      &quot;completionOptions&quot;: {\n        &quot;temperature&quot;: 0.5,\n        &quot;top_p&quot;: &quot;0.5&quot;,\n        &quot;top_k&quot;: &quot;40&quot;,\n        &quot;maxTokens&quot;: 2048,\n        &quot;keepAlive&quot;: 3600\n      }\n    }<\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Checking Context Length of LLM<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The easiest way is to use the Ollama&#8217;s command&nbsp;<code>ollama show &lt;modelname&gt;<\/code>&nbsp;to display the <code>context length<\/code>. Example:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism off-numbers lang-zsh\" data-lang=\"ZSH\" data-line=\"6\"><code>% ollama show llama3.1:8b-instruct-fp16\n  Model                                          \n  \tarch            \tllama \t                         \n  \tparameters      \t8.0B  \t                         \n  \tquantization    \tF16   \t                         \n  \tcontext length  \t131072\t                         \n  \tembedding length\t4096  \t                         \n  \t                                               \n  Parameters                                     \n  \tstop\t&quot;&lt;|start_header_id|&gt;&quot;\t                      \n  \tstop\t&quot;&lt;|end_header_id|&gt;&quot;  \t                      \n  \tstop\t&quot;&lt;|eot_id|&gt;&quot;         \t                      \n  \t                                               \n  License                                        \n  \tLLAMA 3.1 COMMUNITY LICENSE AGREEMENT        \t  \n  \tLlama 3.1 Version Release Date: July 23, 2024<\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Context Length in App Settings<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Dify &gt; Model Provider &gt; Ollama<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When adding an Ollama model to Dify, you can override the default value of 4096 for Model context length and Upper bound for max tokens. Since setting a upper limit may make debugging difficult if issues arise, it&#8217;s better to set both values to the model&#8217;s context length and adjust the Size of content window in individual AI apps.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"740\" height=\"1024\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14-740x1024.png\" alt=\"\" class=\"wp-image-1702\" style=\"width:432px;height:auto\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14-740x1024.png 740w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14-217x300.png 217w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14-768x1063.png 768w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14-1110x1536.png 1110w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/08\/image-14.png 1280w\" sizes=\"auto, (max-width: 706px) 89vw, (max-width: 767px) 82vw, 740px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Continue &gt; &#8220;models&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the &#8220;<code>models<\/code>&#8221; section of the config.json, you can add multiple settings for different context length by including a description like &#8220;<code>Fastest Max Size<\/code>&#8221; or &#8220;<code>4096<\/code>&#8220;. For example, I set the title to &#8220;Chat: llama3.1:8b-instruct-fp16 (Fastest Max Size)&#8221; and changed the <code>contextLength<\/code> value to <code>24576<\/code> and <code>maxTokens<\/code> value to <code>12288<\/code>. This combination was the highest that I confirmed working perfectly on my Mac with 32 GB RAM.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism off-numbers lang-json\" data-file=\"\/Users\/username\/.continue\/config.json\" data-lang=\"JSON\" data-line=\"2,6,11\"><code>    {\n      &quot;title&quot;: &quot;Chat: llama3.1:8b-instruct-fp16 (Fastest Max Size)&quot;,\n      &quot;provider&quot;: &quot;ollama&quot;,\n      &quot;model&quot;: &quot;llama3.1:8b-instruct-fp16&quot;,\n      &quot;apiBase&quot;: &quot;http:\/\/localhost:11434&quot;,\n      &quot;contextLength&quot;: 24576,\n      &quot;completionOptions&quot;: {\n        &quot;temperature&quot;: 0.5,\n        &quot;top_p&quot;: &quot;0.5&quot;,\n        &quot;top_k&quot;: &quot;40&quot;,\n        &quot;maxTokens&quot;: 12288,\n        &quot;keepAlive&quot;: 3600\n      }\n    }<\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s happening when LLM processing is slow (based on what I see)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When using\u00a0<code>ollama run<\/code>, LLM runs quickly, but when using Ollama through Dify or Continue, it becomes slow due to large size of context length. Let&#8217;s check the process with\u00a0<code>ollama ps<\/code>. Below are examples &#8211; first one had the max context length 131072 and the second one had 24576:\u00a0<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism off-numbers lang-zsh\" data-lang=\"ZSH\"><code>% ollama ps\nNAME                     \tID          \tSIZE \tPROCESSOR      \tUNTIL               \nllama3.1:8b-instruct-fp16\ta8f4d8643bb2\t49 GB\t54%\/46% CPU\/GPU\t59 minutes from now\t\n\n% ollama ps\nNAME                     \tID          \tSIZE \tPROCESSOR\tUNTIL              \nllama3.1:8b-instruct-fp16\ta8f4d8643bb2\t17 GB\t100% GPU \t4 minutes from now<\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In the slow case, SIZE is much larger than the actual model size (16 GB), and processing occurs on CPU at 54% and GPU at 46%. It seems that Ollama processes LLM as a larger size model when a large size context length is passed via API regardless of the actual number of tokens being processed. This is only my assumption, but the above tells.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Finding a suitable size of context length<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">After understanding the situation, let&#8217;s take countermeasures. If you can live with 4096 tokens, it&#8217;s fine, but I want to process as many tokens as possible. Unfortunately, I couldn&#8217;t find Ollama&#8217;s specifications, so I tried adjusting the context length by hand and found that a value of 24576 (4096*6) works for Llama 3.1 8B F16 and DeepSeek-Coder-V2-Lite-Instruct Q6_K.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that using non-multiple-of-4096 values may cause character corruption, so be careful. Also, when using Dify, the SIZE value will be smaller than in Continue.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Ollama, I&#8217;m sorry (you can skip this)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I thought Ollama&#8217;s server processing was malfunctioning because LLM ran quickly when running on CLI but became slow when used through API. However, after trying an advice &#8220;Try setting context length to 4096&#8221; from an issue discussion about Windows + GPU, I found that it actually solved the problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ollama, I&#8217;m sorry for doubting you!<\/p>\n\n\n\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary>Image by Stable Diffusion (Mochi Diffusion)<\/summary>\n<p class=\"wp-block-paragraph\">This time I wanted an image of a small bike overtaking a luxurious van or camper, but it wasn&#8217;t as easy as I thought somehow. Most of generated images had two bikes, a bike and a van on reversing lanes, a van cut off of the sight, etc. Only this one had a bike leading a van.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Date:<br>2024-9-1 2:57:00<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Model:<br>realisticVision-v51VAE_original_768x512_cn<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Size:<br>768 x 512<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Include in Image:<br>A high-speed motorcycle overtaking a luxurious van<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Exclude from Image:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Seed:<br>2448773039<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Steps:<br>20<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Guidance Scale:<br>20.0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Scheduler:<br>DPM-Solver++<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ML Compute Unit:<br>All<\/p>\n<\/details>\n","protected":false},"excerpt":{"rendered":"<p>Recently, the performance of open-source and open-weight LLMs has been amazing, and for coding assistance, Dee &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;A solution for slow LLMs on Ollama server when accessing from Dify or Continue&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":1721,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_locale":"en_US","_original_post":"https:\/\/blog.peddals.com\/?p=1693","footnotes":""},"categories":[25,8,23],"tags":[12],"class_list":["post-1723","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","category-macos","category-troubleshooting","tag-mac","en-US"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog<\/title>\n<meta name=\"description\" content=\"To achieve optimal performance, optimize the context length of LLMs\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog\" \/>\n<meta property=\"og:description\" content=\"To achieve optimal performance, optimize the context length of LLMs\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/\" \/>\n<meta property=\"og:site_name\" content=\"Peddals Blog\" \/>\n<meta property=\"article:published_time\" content=\"2024-09-01T12:50:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-01T06:52:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"768\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Handsome\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Handsome\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/\"},\"author\":{\"name\":\"Handsome\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"headline\":\"A solution for slow LLMs on Ollama server when accessing from Dify or Continue\",\"datePublished\":\"2024-09-01T12:50:02+00:00\",\"dateModified\":\"2025-01-01T06:52:39+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/\"},\"wordCount\":867,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2024\\\/09\\\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg\",\"keywords\":[\"mac\"],\"articleSection\":[\"AI\\\/LLM\",\"macOS\",\"Troubleshooting\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/\",\"name\":\"A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2024\\\/09\\\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg\",\"datePublished\":\"2024-09-01T12:50:02+00:00\",\"dateModified\":\"2025-01-01T06:52:39+00:00\",\"description\":\"To achieve optimal performance, optimize the context length of LLMs\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#primaryimage\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2024\\\/09\\\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg\",\"contentUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2024\\\/09\\\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg\",\"width\":768,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/optimize-context-size-for-ollama-server\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u30db\u30fc\u30e0\",\"item\":\"https:\\\/\\\/blog.peddals.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A solution for slow LLMs on Ollama server when accessing from Dify or Continue\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#website\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/\",\"name\":\"Peddals Blog\",\"description\":\"AI, LLM, Python, Mac, Pythonista3, iOS, etc. in Japanese and English\",\"publisher\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/blog.peddals.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\",\"name\":\"Handsome\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"caption\":\"Handsome\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog","description":"To achieve optimal performance, optimize the context length of LLMs","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/","og_locale":"en_US","og_type":"article","og_title":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog","og_description":"To achieve optimal performance, optimize the context length of LLMs","og_url":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/","og_site_name":"Peddals Blog","article_published_time":"2024-09-01T12:50:02+00:00","article_modified_time":"2025-01-01T06:52:39+00:00","og_image":[{"width":768,"height":512,"url":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg","type":"image\/jpeg"}],"author":"Handsome","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Handsome","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#article","isPartOf":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/"},"author":{"name":"Handsome","@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"headline":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue","datePublished":"2024-09-01T12:50:02+00:00","dateModified":"2025-01-01T06:52:39+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/"},"wordCount":867,"commentCount":0,"publisher":{"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"image":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg","keywords":["mac"],"articleSection":["AI\/LLM","macOS","Troubleshooting"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/","url":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/","name":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue | Peddals Blog","isPartOf":{"@id":"https:\/\/blog.peddals.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#primaryimage"},"image":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg","datePublished":"2024-09-01T12:50:02+00:00","dateModified":"2025-01-01T06:52:39+00:00","description":"To achieve optimal performance, optimize the context length of LLMs","breadcrumb":{"@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#primaryimage","url":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg","contentUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2024\/09\/A-high-speed-motorcycle-overtaking-a-luxurious-van-on-highway.582.2181395213.jpg","width":768,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/blog.peddals.com\/en\/optimize-context-size-for-ollama-server\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u30db\u30fc\u30e0","item":"https:\/\/blog.peddals.com\/"},{"@type":"ListItem","position":2,"name":"A solution for slow LLMs on Ollama server when accessing from Dify or Continue"}]},{"@type":"WebSite","@id":"https:\/\/blog.peddals.com\/#website","url":"https:\/\/blog.peddals.com\/","name":"Peddals Blog","description":"AI, LLM, Python, Mac, Pythonista3, iOS, etc. in Japanese and English","publisher":{"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.peddals.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994","name":"Handsome","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","caption":"Handsome"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g"}}]}},"_links":{"self":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/1723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/comments?post=1723"}],"version-history":[{"count":10,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/1723\/revisions"}],"predecessor-version":[{"id":2050,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/1723\/revisions\/2050"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/media\/1721"}],"wp:attachment":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/media?parent=1723"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/categories?post=1723"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/tags?post=1723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}