{"id":2324,"date":"2025-02-11T19:09:14","date_gmt":"2025-02-11T10:09:14","guid":{"rendered":"https:\/\/blog.peddals.com\/?p=2324"},"modified":"2025-11-15T19:45:33","modified_gmt":"2025-11-15T10:45:33","slug":"ollama-vram-fine-tune-with-kv-cache","status":"publish","type":"post","link":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/","title":{"rendered":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">As of January 2025, there are settings for acceleration and VRAM optimization available for trial use in Ollama. It seems that both may become standard settings soon, but since the latest version <code>0.5.7<\/code> at the time of writing requires users to set them up themselves, I will share how to do so.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For those using local LLMs on Apple Silicon Mac (M series CPU), please also check out the previous article. It introduces how to allocate memory to the Mac&#8217;s GPU.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-peddals-blog wp-block-embed-peddals-blog\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"yWPOd1I3NC\"><a href=\"https:\/\/blog.peddals.com\/en\/fine-tune-vram-size-of-mac-for-llm\/\">Optimizing VRAM Settings for Using Local LLM on macOS (Fine-tuning: 1)<\/a><\/blockquote><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; visibility: hidden;\" title=\"&#8220;Optimizing VRAM Settings for Using Local LLM on macOS (Fine-tuning: 1)&#8221; &#8212; Peddals Blog\" src=\"https:\/\/blog.peddals.com\/en\/fine-tune-vram-size-of-mac-for-llm\/embed\/#?secret=sDGKVMZSg8#?secret=yWPOd1I3NC\" data-secret=\"yWPOd1I3NC\" width=\"525\" height=\"296\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Environment<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This is a setting for Ollama, so it should not depend on the OS, but I only touched on how to do this on macOS. Also, there seem to be ways to install it by building from source code, using brew, or running with Docker, but I don&#8217;t know how to set it up without using the app, so please look into that. Sorry.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>macOS: Sequoia 15.1.1<\/li>\n\n\n\n<li>Ollama: 0.5.7 (Ollama.app downloadable at <a href=\"https:\/\/ollama.com\/\">the official website<\/a>.)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Official sources of information<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ollama FAQ:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/ollama\/ollama\/blob\/main\/docs\/faq.md#how-can-i-enable-flash-attention\" target=\"_blank\" rel=\"noreferrer noopener\">How can I enable Flash Attention?<\/a> (Flash Attention environment variable)<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/ollama\/ollama\/blob\/main\/docs\/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache\" target=\"_blank\" rel=\"noreferrer noopener\">How can I set the quantization type for the K\/V cache?<\/a> (K\/V cache  environment variable and notes)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The blog of the contributor who introduced K\/V caching features to Ollama:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/smcleod.net\/2024\/12\/bringing-k\/v-context-quantisation-to-ollama\/\" target=\"_blank\" rel=\"noreferrer noopener\">Bringing K\/V Context Quantisation to Ollama<\/a> (Technical details. Very interesting.)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Fine-tuning (2) Reduce VRAM Usage and Increase Speed with Flash Attention<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The method I wrote in my previous blog post above was (1), so here I will start from (2).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, enable Flash Attention in Ollama. Flash Attention helps reduce VRAM usage and also increases the computation speed of LLMs. As it has been explained in various documents, there don&#8217;t seem to be any negative impacts from enabling this feature. While some claim that it triples the speed, even if it doesn\u2019t quite do that, there&#8217;s no reason not to enable it if all effects are positive. It seems likely that Ollama will default to having this enabled in the future, but for now, you need to enable it yourself. If you&#8217;re using a Mac, run the following command in Terminal:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-width:calc(1 * 0.6 * .875rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>launchctl setenv OLLAMA_FLASH_ATTENTION 1<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #DCDCAA\">launchctl<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">setenv<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">OLLAMA_FLASH_ATTENTION<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#1E1E1E;color:#c7c7c7;font-size:12px;line-height:1;position:relative\">Zsh<\/span><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To disable (revert), set the above value from <code>1<\/code> to <code>0<\/code>. To check the current settings, run the <code>getenv<\/code> command. Below is an example of its execution when it is enabled, returning a <code>1<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">% launchctl getenv OLLAMA_FLASH_ATTENTION<br>1<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Fine-tuning (3) Reduce VRAM Usage by K\/V Cache Quantization<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">K\/V cache quantization seems to be a technique that improves computational efficiency by quantizing the context cache and reducing the required memory. It is also referred to as K\/V context cache quantization at times. While fine-tuning (1) increased VRAM for loading LLMs to handle larger models or longer contexts, K\/V cache achieves similar results by reducing the amount of memory used during model execution. While 8-bit quantization of the model itself causes only minor performance degradation and improves speed, it is expected that K\/V cache quantization will have a similar effect on context cache size. <strong>When 8-bit quantization is applied to the K\/V cache, the required memory amount becomes about half<\/strong> of what it would be without quantization, <strong>allowing for doubling the usable context length<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This feature is currently marked as Experimental in Ollama, and <a href=\"https:\/\/smcleod.net\/2024\/12\/bringing-k\/v-context-quantisation-to-ollama\/#when-not-to-use-kv-context-cache-quantisation\" target=\"_blank\" rel=\"noreferrer noopener\">there is a possibility that performance may degrade when using embedding models<\/a>, vision-multimodal models, or high-attention-head type models. Therefore, it seems that Ollama automatically disables this setting when an Embed model is detected. So, understanding that compatibility issues with the model could be a problem, you should try it out and if performance decreases, disable it. Unfortunately, there is no way to enable or disable this for each model at present.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are the settings: When it comes to quantization options, you can choose between 8-bit (<code>q8_0<\/code>) and 4-bit (<code>q4_0<\/code>), though by default there is no quantization (f16). If you opt for 4-bit, while memory reduction will be significant, performance will also decrease. Therefore, unless it&#8217;s a case where you need to use models that previously couldn&#8217;t run on GPU alone, choose 8-bit. Additionally, enabling Flash Attention is necessary as a prerequisite; please proceed after executing the fine-tuning (2) mentioned above. The command for Mac (in the case of 8-bit) would be as follows:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>launchctl setenv OLLAMA_KV_CACHE_TYPE \"q8_0\"<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #DCDCAA\">launchctl<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">setenv<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">OLLAMA_KV_CACHE_TYPE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;q8_0&quot;<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#1E1E1E;color:#c7c7c7;font-size:12px;line-height:1;position:relative\">Zsh<\/span><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To reset to default, specify &#8220;<code>f16<\/code>&#8221; as the value. To check the current setting, run the <code>getenv<\/code> command. Example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">% launchctl getenv OLLAMA_KV_CACHE_TYPE<br>q8_0<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After setting up, you can run the model in Ollama and check the logs to see the quantization and cache size. In the following example, it is default <code>f16<\/code> until halfway, and after the change, it becomes <code>q8_0<\/code>, showing that the overall size has decreased.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(Feb 16, 2025: corrected command.)<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #DCDCAA\">%<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">grep<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;KV self size&quot;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">~\/.ollama\/logs\/server2.log<\/span><span style=\"color: #D4D4D4\">|<\/span><span style=\"color: #DCDCAA\">tail<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1792.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (f16):  896.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">f16<\/span><span style=\"color: #D4D4D4\">):  896.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1536.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (f16):  768.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">f16<\/span><span style=\"color: #D4D4D4\">):  768.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #B5CEA8\">512.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (f16):  256.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">f16<\/span><span style=\"color: #D4D4D4\">):  256.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1792.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (f16):  896.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">f16<\/span><span style=\"color: #D4D4D4\">):  896.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1792.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (f16):  896.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">f16<\/span><span style=\"color: #D4D4D4\">):  896.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #B5CEA8\">952.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (q8_0):  476.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">q8_0<\/span><span style=\"color: #D4D4D4\">):  476.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #B5CEA8\">952.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (q8_0):  476.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">q8_0<\/span><span style=\"color: #D4D4D4\">):  476.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #B5CEA8\">680.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (q8_0):  340.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">q8_0<\/span><span style=\"color: #D4D4D4\">):  340.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #B5CEA8\">816.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (q8_0):  408.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">q8_0<\/span><span style=\"color: #D4D4D4\">):  408.00 MiB<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">llama_new_context_with_model:<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">KV<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">self<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">size<\/span><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">=<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1224.00<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">MiB,<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">K<\/span><span style=\"color: #D4D4D4\"> (q8_0):  612.00 MiB, V (<\/span><span style=\"color: #DCDCAA\">q8_0<\/span><span style=\"color: #D4D4D4\">):  612.00 MiB<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#1E1E1E;color:#c7c7c7;font-size:12px;line-height:1;position:relative\">Zsh<\/span><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Set Variables Permanently<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With the above two setup methods, the settings will be initialized after restarting the Mac. Below, I introduce a method to create a script that sets the environment variables and can be launched when you log in.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1. Launch Script Editor in Applications &gt; Utilities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2. Command + N to open a new window and copy-paste the below script. It simply sets environment variables then launch Ollama.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>do shell script \"launchctl setenv OLLAMA_HOST \"0.0.0.0\"\"\ndo shell script \"launchctl setenv OLLAMA_FLASH_ATTENTION 1\"\ndo shell script \"launchctl setenv OLLAMA_KV_CACHE_TYPE \"q8_0\"\"\ntell application \"Ollama\" to run<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #DCDCAA\">do shell script<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;launchctl setenv OLLAMA_HOST &quot;<\/span><span style=\"color: #B5CEA8\">0.0<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #B5CEA8\">0.0<\/span><span style=\"color: #CE9178\">&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">do shell script<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;launchctl setenv OLLAMA_FLASH_ATTENTION 1&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">do shell script<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;launchctl setenv OLLAMA_KV_CACHE_TYPE &quot;<\/span><span style=\"color: #D4D4D4\">q8_0<\/span><span style=\"color: #CE9178\">&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">tell<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #4EC9B0\">application<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&quot;Ollama&quot;<\/span><span style=\"color: #D4D4D4\"> to <\/span><span style=\"color: #DCDCAA\">run<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#1E1E1E;color:#c7c7c7;font-size:12px;line-height:1;position:relative\">AppleScript<\/span><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">3. File menu &gt; Export As &gt; set like below and Save:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Export As: <strong>LaunchOllama.app<\/strong><\/li>\n\n\n\n<li>Where: <strong>Application<\/strong><\/li>\n\n\n\n<li>File Format: <strong>Application<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">4. Apple menu &gt; Settings &gt; General &gt; Login items<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5. If you already have Ollama.app, click on [ &#8211; ] button to remove it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6. Click on [ + ] and select the app LaunchOllama.app you just created in the step #3.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7. Reboot your Mac, login, navigate to <a href=\"http:\/\/localhost:11434\" target=\"_blank\" rel=\"noreferrer noopener\">http:\/\/localhost:11434<\/a> and run command such as <code>launchctl getenv OLLAMA_FLASH_ATTENTION<\/code> to see <code>1<\/code> is returned. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Super Helpful Tool &#8211; Interactive VRAM Estimator<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the K\/V cache feature contributor&#8217;s blog introduced earlier, there is a super useful tool called <a href=\"https:\/\/smcleod.net\/2024\/12\/bringing-k\/v-context-quantisation-to-ollama\/#interactive-vram-estimator\" target=\"_blank\" rel=\"noreferrer noopener\">Interactive VRAM Estimator<\/a>. You can find if a model you want to use will fit in your VRAM with this tool. A combination of the parameter size of the model, the context length, and the quantization level, it estimates the total size in VRAM per K\/V Cache quantization level.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, in the case of <a href=\"https:\/\/ollama.com\/library\/deepseek-r1:32b\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSeek-R1:32B_Q4_K_M<\/a>, you would choose <strong>32B<\/strong> and <strong>Q4_K_M<\/strong>. If you have set up the <strong>K\/V cache for Q8_0<\/strong> this time, while looking at the Total of the green bar, select the <strong>Context Size<\/strong> to estimate the VRAM size required to run with the combination.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"776\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1024x776.jpg\" alt=\"\" class=\"wp-image-2178\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1024x776.jpg 1024w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-300x227.jpg 300w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-768x582.jpg 768w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1536x1165.jpg 1536w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image.jpg 1828w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption class=\"wp-element-caption\">It estimates 16K tokens should fit in 21.5GB VRAM<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">With 32K (= 32768) tokens<span style=\"caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: -webkit-standard; font-size: medium; letter-spacing: normal; white-space: normal;\">\u898b\u8fbc\u307f<\/span>, it exceeds my Mac&#8217;s VRAM of 24GB, so I&#8217;ll enable the Advanced mode in the top right to come up with a more aggressive number. By tweaking the Context Size slider while keeping an eye on the Total of Q8_0, it seems that 24K (24 * 1024=24576) fits within 23GB RAM. Awesome, huh?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"745\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1024x745.png\" alt=\"\" class=\"wp-image-2179\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1024x745.png 1024w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-300x218.png 300w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-768x559.png 768w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-1536x1117.png 1536w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image.png 1562w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">So, here&#8217;s the result of running <code>ollama ps<\/code> after putting 24576 in the Size of context window for the generative AI app I made with Dify. It&#8217;s processing at a neat 100% GPU usage. Victory!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"549\" height=\"48\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-6.png\" alt=\"\" class=\"wp-image-2381\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-6.png 549w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-6-300x26.png 300w\" sizes=\"auto, (max-width: 549px) 100vw, 549px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is where you set the context length of your AI app in Dify:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"423\" src=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5-1024x423.png\" alt=\"\" class=\"wp-image-2353\" srcset=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5-1024x423.png 1024w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5-300x124.png 300w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5-768x317.png 768w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5-1536x634.png 1536w, https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/image-5.png 1540w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Last Miscellaneous Notes<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the previous and this article, I introduced methods for fine-tuning the environment side to run LLMs effectively. Since I only have 32GB of unified memory, it&#8217;s been always challenging for me to use LLMs. Thanks to new technology, it has become easier to enjoy open-source LLMs more than before. I hope that even one more person can do so. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I have not conducted any investigations regarding execution speed, so please try it out yourself. At least, just by understanding and implementing the method to accommodate the memory required by LLMs or fit them into 100% VRAM, I think you will find that recent models can be quite enjoyable at a practical speed. 10 tokens per sec should be enough most cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To be honest, I think it&#8217;s tough to do all sorts of things with a local LLM on just 16GB. On the other hand, if you have 128GB, you could run locally LLMs in parallel.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Recently, while Chinese companies&#8217; models have been highly praised for their performance, there are also discussions about prohibiting their use due to concerns over information leaks. Since you can run them locally, you don&#8217;t need to worry and can try them freely. Personally, I like the performance and quick response of the newly released French model <a href=\"https:\/\/ollama.com\/library\/mistral-small\" target=\"_blank\" rel=\"noreferrer noopener\">mistral-small:24b<\/a>. It&#8217;s also very nice that it doesn&#8217;t involve Chinese language or characters like Chinese-made models do (maybe I&#8217;m a bit sick of it). Does anyone know when the final (non-preview) version of QwQ will be available?<\/p>\n\n\n\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary>Image by Stable Diffusion (Mochi Diffusion)<\/summary>\n<p class=\"wp-block-paragraph\">Simply, I asked for an image of lots of goods loaded onto a llama. Initially, I had Mistral-Small 24B create prompts based on my image, but it was completely unsatisfactory. It seems that rather than writing all sorts of things, just listing essential words and repeating generation leads to something more fitting.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Date:<br>2025-2-2 1:55:30<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Model:<br>realisticVision-v51VAE_original_768x512_cn<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Size:<br>768 x 512<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Include in Image:<br>A Llama with heavy load of luggage on it<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Exclude from Image:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Seed:<br>2221886765<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Steps:<br>20<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Guidance Scale:<br>20.0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Scheduler:<br>DPM-Solver++<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ML Compute Unit:<br>CPU &amp; GPU<\/p>\n<\/details>\n","protected":false},"excerpt":{"rendered":"<p>As of January 2025, there are settings for acceleration and VRAM optimization available for trial use in Ollam &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2188,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_locale":"en_US","_original_post":"https:\/\/blog.peddals.com\/?p=2170","footnotes":""},"categories":[25,8],"tags":[12],"class_list":["post-2324","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","category-macos","tag-mac","en-US"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog<\/title>\n<meta name=\"description\" content=\"Setup Flash Attention and KV Cache for speed and VRAM optimization\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog\" \/>\n<meta property=\"og:description\" content=\"Setup Flash Attention and KV Cache for speed and VRAM optimization\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/\" \/>\n<meta property=\"og:site_name\" content=\"Peddals Blog\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-11T10:09:14+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-15T10:45:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"768\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Handsome\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Handsome\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"25 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/\"},\"author\":{\"name\":\"Handsome\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"headline\":\"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)\",\"datePublished\":\"2025-02-11T10:09:14+00:00\",\"dateModified\":\"2025-11-15T10:45:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/\"},\"wordCount\":1507,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg\",\"keywords\":[\"mac\"],\"articleSection\":[\"AI\\\/LLM\",\"macOS\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/\",\"name\":\"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg\",\"datePublished\":\"2025-02-11T10:09:14+00:00\",\"dateModified\":\"2025-11-15T10:45:33+00:00\",\"description\":\"Setup Flash Attention and KV Cache for speed and VRAM optimization\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#primaryimage\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg\",\"contentUrl\":\"https:\\\/\\\/blog.peddals.com\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg\",\"width\":768,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/en\\\/ollama-vram-fine-tune-with-kv-cache\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u30db\u30fc\u30e0\",\"item\":\"https:\\\/\\\/blog.peddals.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#website\",\"url\":\"https:\\\/\\\/blog.peddals.com\\\/\",\"name\":\"Peddals Blog\",\"description\":\"AI, LLM, Python, Mac, Pythonista3, iOS, etc. in Japanese and English\",\"publisher\":{\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/blog.peddals.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/blog.peddals.com\\\/#\\\/schema\\\/person\\\/81b2dabca748c3d11a45722f02d9d994\",\"name\":\"Handsome\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\",\"caption\":\"Handsome\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog","description":"Setup Flash Attention and KV Cache for speed and VRAM optimization","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog","og_description":"Setup Flash Attention and KV Cache for speed and VRAM optimization","og_url":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/","og_site_name":"Peddals Blog","article_published_time":"2025-02-11T10:09:14+00:00","article_modified_time":"2025-11-15T10:45:33+00:00","og_image":[{"width":768,"height":512,"url":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg","type":"image\/jpeg"}],"author":"Handsome","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Handsome","Est. reading time":"25 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#article","isPartOf":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/"},"author":{"name":"Handsome","@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"headline":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)","datePublished":"2025-02-11T10:09:14+00:00","dateModified":"2025-11-15T10:45:33+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/"},"wordCount":1507,"commentCount":0,"publisher":{"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"image":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg","keywords":["mac"],"articleSection":["AI\/LLM","macOS"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/","url":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/","name":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2) | Peddals Blog","isPartOf":{"@id":"https:\/\/blog.peddals.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#primaryimage"},"image":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg","datePublished":"2025-02-11T10:09:14+00:00","dateModified":"2025-11-15T10:45:33+00:00","description":"Setup Flash Attention and KV Cache for speed and VRAM optimization","breadcrumb":{"@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#primaryimage","url":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg","contentUrl":"https:\/\/blog.peddals.com\/wp-content\/uploads\/2025\/02\/A-Llama-with-heavy-load-of-luggage-on-it.511.2221886765.jpg","width":768,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/blog.peddals.com\/en\/ollama-vram-fine-tune-with-kv-cache\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u30db\u30fc\u30e0","item":"https:\/\/blog.peddals.com\/"},{"@type":"ListItem","position":2,"name":"Optimizing Ollama VRAM Settings for Using Local LLM on macOS (Fine-tuning: 2)"}]},{"@type":"WebSite","@id":"https:\/\/blog.peddals.com\/#website","url":"https:\/\/blog.peddals.com\/","name":"Peddals Blog","description":"AI, LLM, Python, Mac, Pythonista3, iOS, etc. in Japanese and English","publisher":{"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.peddals.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/blog.peddals.com\/#\/schema\/person\/81b2dabca748c3d11a45722f02d9d994","name":"Handsome","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g","caption":"Handsome"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/51d7363349ec538c4d62c9ebe89488fd7388729ad0c9dfeebd8bb32ebfb11f17?s=96&d=mm&r=g"}}]}},"_links":{"self":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/2324","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/comments?post=2324"}],"version-history":[{"count":15,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/2324\/revisions"}],"predecessor-version":[{"id":3195,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/posts\/2324\/revisions\/3195"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/media\/2188"}],"wp:attachment":[{"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/media?parent=2324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/categories?post=2324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.peddals.com\/wp-json\/wp\/v2\/tags?post=2324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}