Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Record] Find why FAITH responses slow down (with CPU) suddenly when it works well with GPU initially #37

Open
arealclimber opened this issue Nov 1, 2024 · 2 comments
Assignees
Labels
2 hard level 2 documentation Improvements or additions to documentation

Comments

@arealclimber
Copy link
Member

arealclimber commented Nov 1, 2024

[記錄] 找出最初使用 GPU 運作良好的 FAITH 回應突然變慢(使用 CPU)的原因

# 查 docker container ollama log
docker compose logs ollama

# 看 CPU 用量
docker stats

# 查 GPU 用量
nvidia-smi -l 1
@arealclimber arealclimber self-assigned this Nov 1, 2024
@arealclimber arealclimber added documentation Improvements or additions to documentation 2 hard level 2 labels Nov 1, 2024
@arealclimber
Copy link
Member Author

猜測一開始使用 GPU 但是過一段時間問問題改成用 CPU 的 log

docker compose logs ollama
ollama-1  | 2024/11/01 06:06:45 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama-1  | time=2024-11-01T06:06:45.963Z level=INFO source=images.go:782 msg="total blobs: 17"
ollama-1  | time=2024-11-01T06:06:45.964Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
ollama-1  | time=2024-11-01T06:06:45.964Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
ollama-1  | time=2024-11-01T06:06:45.967Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2793465207/runners
ollama-1  | time=2024-11-01T06:06:48.934Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]"
ollama-1  | time=2024-11-01T06:06:48.934Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
ollama-1  | time=2024-11-01T06:06:49.116Z level=INFO source=types.go:105 msg="inference compute" id=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c library=cuda compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
ollama-1  | Downloading model: llama3.1
ollama-1  | [GIN] 2024/11/01 - 06:06:50 | 200 |       32.93µs |       127.0.0.1 | HEAD     "/"
ollama-1  | [GIN] 2024/11/01 - 06:06:52 | 200 |  1.108139466s |       127.0.0.1 | POST     "/api/pull"
pulling manifest
ollama-1  | pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
ollama-1  | pulling 948af2743fc7... 100% ▕████████████████▏ 1.5 KB
ollama-1  | pulling 0ba8f0e314b4... 100% ▕████████████████▏  12 KB
ollama-1  | pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B
ollama-1  | pulling 1a4c3c319823... 100% ▕████████████████▏  485 B
ollama-1  | verifying sha256 digest
ollama-1  | writing manifest
ollama-1  | removing any unused layers
ollama-1  | success
ollama-1  | Downloading model: nomic-embed-text
ollama-1  | [GIN] 2024/11/01 - 06:06:52 | 200 |      23.768µs |       127.0.0.1 | HEAD     "/"
ollama-1  | [GIN] 2024/11/01 - 06:06:53 | 200 |  965.275972ms |       127.0.0.1 | POST     "/api/pull"
pulling manifest
ollama-1  | pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB
ollama-1  | pulling c71d239df917... 100% ▕████████████████▏  11 KB
ollama-1  | pulling ce4a164fc046... 100% ▕████████████████▏   17 B
ollama-1  | pulling 31df23ea7daa... 100% ▕████████████████▏  420 B
ollama-1  | verifying sha256 digest
ollama-1  | writing manifest
ollama-1  | removing any unused layers
ollama-1  | success
ollama-1  | [GIN] 2024/11/01 - 06:07:02 | 200 |     655.475µs |      172.19.0.5 | GET      "/api/tags"
ollama-1  | [GIN] 2024/11/01 - 06:07:03 | 200 |   21.215624ms |      172.19.0.5 | POST     "/api/create"
ollama-1  | cuda driver library failed to get device context 800time=2024-11-01T09:43:32.448Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1  | time=2024-11-01T09:43:32.467Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c parallel=4 available=16593125376 required="6.2 GiB"
ollama-1  | time=2024-11-01T09:43:32.467Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.5 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1  | time=2024-11-01T09:43:32.469Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama2793465207/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 45759"
ollama-1  | time=2024-11-01T09:43:32.470Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1  | time=2024-11-01T09:43:32.470Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1  | time=2024-11-01T09:43:32.470Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1  | INFO [main] build info | build=1 commit="1e6f655" tid="132621287960576" timestamp=1730454212
ollama-1  | INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132621287960576" timestamp=1730454212 total_threads=20
ollama-1  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="45759" tid="132621287960576" timestamp=1730454212
ollama-1  | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
ollama-1  | llama_model_loader: - kv   1:                               general.type str              = model
ollama-1  | llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
ollama-1  | llama_model_loader: - kv   3:                           general.finetune str              = Instruct
ollama-1  | llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
ollama-1  | llama_model_loader: - kv   5:                         general.size_label str              = 8B
ollama-1  | llama_model_loader: - kv   6:                            general.license str              = llama3.1
ollama-1  | llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
ollama-1  | llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1  | llama_model_loader: - kv   9:                          llama.block_count u32              = 32
ollama-1  | llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
ollama-1  | llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
ollama-1  | llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
ollama-1  | llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
ollama-1  | llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
ollama-1  | llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
ollama-1  | llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
ollama-1  | llama_model_loader: - kv  17:                          general.file_type u32              = 2
ollama-1  | llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
ollama-1  | llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
ollama-1  | llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
ollama-1  | llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
ollama-1  | llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1  | llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1  | llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1  | llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
ollama-1  | llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
ollama-1  | llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
ollama-1  | llama_model_loader: - kv  28:               general.quantization_version u32              = 2
ollama-1  | llama_model_loader: - type  f32:   66 tensors
ollama-1  | llama_model_loader: - type q4_0:  225 tensors
ollama-1  | llama_model_loader: - type q6_K:    1 tensors
ollama-1  | time=2024-11-01T09:43:32.721Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1  | llm_load_vocab: special tokens cache size = 256
ollama-1  | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-1  | llm_load_print_meta: arch             = llama
ollama-1  | llm_load_print_meta: vocab type       = BPE
ollama-1  | llm_load_print_meta: n_vocab          = 128256
ollama-1  | llm_load_print_meta: n_merges         = 280147
ollama-1  | llm_load_print_meta: vocab_only       = 0
ollama-1  | llm_load_print_meta: n_ctx_train      = 131072
ollama-1  | llm_load_print_meta: n_embd           = 4096
ollama-1  | llm_load_print_meta: n_layer          = 32
ollama-1  | llm_load_print_meta: n_head           = 32
ollama-1  | llm_load_print_meta: n_head_kv        = 8
ollama-1  | llm_load_print_meta: n_rot            = 128
ollama-1  | llm_load_print_meta: n_swa            = 0
ollama-1  | llm_load_print_meta: n_embd_head_k    = 128
ollama-1  | llm_load_print_meta: n_embd_head_v    = 128
ollama-1  | llm_load_print_meta: n_gqa            = 4
ollama-1  | llm_load_print_meta: n_embd_k_gqa     = 1024
ollama-1  | llm_load_print_meta: n_embd_v_gqa     = 1024
ollama-1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
ollama-1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-1  | llm_load_print_meta: n_ff             = 14336
ollama-1  | llm_load_print_meta: n_expert         = 0
ollama-1  | llm_load_print_meta: n_expert_used    = 0
ollama-1  | llm_load_print_meta: causal attn      = 1
ollama-1  | llm_load_print_meta: pooling type     = 0
ollama-1  | llm_load_print_meta: rope type        = 0
ollama-1  | llm_load_print_meta: rope scaling     = linear
ollama-1  | llm_load_print_meta: freq_base_train  = 500000.0
ollama-1  | llm_load_print_meta: freq_scale_train = 1
ollama-1  | llm_load_print_meta: n_ctx_orig_yarn  = 131072
ollama-1  | llm_load_print_meta: rope_finetuned   = unknown
ollama-1  | llm_load_print_meta: ssm_d_conv       = 0
ollama-1  | llm_load_print_meta: ssm_d_inner      = 0
ollama-1  | llm_load_print_meta: ssm_d_state      = 0
ollama-1  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-1  | llm_load_print_meta: model type       = 8B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_print_meta: model params     = 8.03 B
ollama-1  | llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
ollama-1  | llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
ollama-1  | llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
ollama-1  | llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: LF token         = 128 'Ä'
ollama-1  | llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: max token length = 256
ollama-1  | ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
ollama-1  | llm_load_tensors: ggml ctx size =    0.14 MiB
ollama-1  | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1  | llm_load_tensors:        CPU buffer size =  4437.80 MiB
ollama-1  | llama_new_context_with_model: n_ctx      = 8192
ollama-1  | llama_new_context_with_model: n_batch    = 512
ollama-1  | llama_new_context_with_model: n_ubatch   = 512
ollama-1  | llama_new_context_with_model: flash_attn = 0
ollama-1  | llama_new_context_with_model: freq_base  = 500000.0
ollama-1  | llama_new_context_with_model: freq_scale = 1
ollama-1  | ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected
ollama-1  | llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
ollama-1  | llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ollama-1  | ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected
ollama-1  | llama_new_context_with_model:        CPU  output buffer size =     2.02 MiB
ollama-1  | ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected
ollama-1  | llama_new_context_with_model:  CUDA_Host compute buffer size =   560.01 MiB
ollama-1  | llama_new_context_with_model: graph nodes  = 1030
ollama-1  | llama_new_context_with_model: graph splits = 1
ollama-1  | INFO [main] model loaded | tid="132621287960576" timestamp=1730454213
ollama-1  | time=2024-11-01T09:43:33.977Z level=INFO source=server.go:632 msg="llama runner started in 1.51 seconds"
ollama-1  | [GIN] 2024/11/01 - 09:44:51 | 200 |         1m19s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 09:47:08 | 200 | 35.622098325s |      172.19.0.5 | POST     "/api/chat"

@arealclimber
Copy link
Member Author

重啟 docker compose 之後使用 GPU 的 ollama log

docker compose logs ollama
ollama-1  | 2024/11/01 09:59:46 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama-1  | time=2024-11-01T09:59:46.788Z level=INFO source=images.go:782 msg="total blobs: 17"
ollama-1  | time=2024-11-01T09:59:46.788Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
ollama-1  | time=2024-11-01T09:59:46.789Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
ollama-1  | time=2024-11-01T09:59:46.789Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama878632986/runners
ollama-1  | time=2024-11-01T09:59:49.883Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]"
ollama-1  | time=2024-11-01T09:59:49.883Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
ollama-1  | time=2024-11-01T09:59:50.015Z level=INFO source=types.go:105 msg="inference compute" id=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c library=cuda compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
ollama-1  | Downloading model: llama3.1
ollama-1  | [GIN] 2024/11/01 - 09:59:51 | 200 |      37.194µs |       127.0.0.1 | HEAD     "/"
ollama-1  | [GIN] 2024/11/01 - 09:59:53 | 200 |   1.23522516s |       127.0.0.1 | POST     "/api/pull"
pulling manifest
ollama-1  | pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
ollama-1  | pulling 948af2743fc7... 100% ▕████████████████▏ 1.5 KB
ollama-1  | pulling 0ba8f0e314b4... 100% ▕████████████████▏  12 KB
ollama-1  | pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B
ollama-1  | pulling 1a4c3c319823... 100% ▕████████████████▏  485 B
ollama-1  | verifying sha256 digest
ollama-1  | writing manifest
ollama-1  | removing any unused layers
ollama-1  | success
ollama-1  | Downloading model: nomic-embed-text
ollama-1  | [GIN] 2024/11/01 - 09:59:53 | 200 |      21.352µs |       127.0.0.1 | HEAD     "/"
ollama-1  | [GIN] 2024/11/01 - 09:59:54 | 200 |  1.044653723s |       127.0.0.1 | POST     "/api/pull"
pulling manifest
ollama-1  | pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB
ollama-1  | pulling c71d239df917... 100% ▕████████████████▏  11 KB
ollama-1  | pulling ce4a164fc046... 100% ▕████████████████▏   17 B
ollama-1  | pulling 31df23ea7daa... 100% ▕████████████████▏  420 B
ollama-1  | verifying sha256 digest
ollama-1  | writing manifest
ollama-1  | removing any unused layers
ollama-1  | success
ollama-1  | [GIN] 2024/11/01 - 10:00:01 | 200 |     696.203µs |      172.19.0.5 | GET      "/api/tags"
ollama-1  | [GIN] 2024/11/01 - 10:00:01 | 200 |   22.716065ms |      172.19.0.5 | POST     "/api/create"
ollama-1  | time=2024-11-01T10:00:47.730Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c parallel=4 available=16593125376 required="6.2 GiB"
ollama-1  | time=2024-11-01T10:00:47.731Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.5 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1  | time=2024-11-01T10:00:47.732Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama878632986/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 38345"
ollama-1  | time=2024-11-01T10:00:47.733Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1  | time=2024-11-01T10:00:47.733Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1  | time=2024-11-01T10:00:47.733Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1  | INFO [main] build info | build=1 commit="1e6f655" tid="129334520164352" timestamp=1730455247
ollama-1  | INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="129334520164352" timestamp=1730455247 total_threads=20
ollama-1  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="38345" tid="129334520164352" timestamp=1730455247
ollama-1  | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
ollama-1  | llama_model_loader: - kv   1:                               general.type str              = model
ollama-1  | llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
ollama-1  | llama_model_loader: - kv   3:                           general.finetune str              = Instruct
ollama-1  | llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
ollama-1  | llama_model_loader: - kv   5:                         general.size_label str              = 8B
ollama-1  | llama_model_loader: - kv   6:                            general.license str              = llama3.1
ollama-1  | llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
ollama-1  | llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1  | llama_model_loader: - kv   9:                          llama.block_count u32              = 32
ollama-1  | llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
ollama-1  | llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
ollama-1  | llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
ollama-1  | llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
ollama-1  | llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
ollama-1  | llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
ollama-1  | llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
ollama-1  | llama_model_loader: - kv  17:                          general.file_type u32              = 2
ollama-1  | llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
ollama-1  | llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
ollama-1  | llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
ollama-1  | llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
ollama-1  | llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1  | llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1  | llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1  | llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
ollama-1  | llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
ollama-1  | llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
ollama-1  | llama_model_loader: - kv  28:               general.quantization_version u32              = 2
ollama-1  | llama_model_loader: - type  f32:   66 tensors
ollama-1  | llama_model_loader: - type q4_0:  225 tensors
ollama-1  | llama_model_loader: - type q6_K:    1 tensors
ollama-1  | time=2024-11-01T10:00:47.984Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1  | llm_load_vocab: special tokens cache size = 256
ollama-1  | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-1  | llm_load_print_meta: arch             = llama
ollama-1  | llm_load_print_meta: vocab type       = BPE
ollama-1  | llm_load_print_meta: n_vocab          = 128256
ollama-1  | llm_load_print_meta: n_merges         = 280147
ollama-1  | llm_load_print_meta: vocab_only       = 0
ollama-1  | llm_load_print_meta: n_ctx_train      = 131072
ollama-1  | llm_load_print_meta: n_embd           = 4096
ollama-1  | llm_load_print_meta: n_layer          = 32
ollama-1  | llm_load_print_meta: n_head           = 32
ollama-1  | llm_load_print_meta: n_head_kv        = 8
ollama-1  | llm_load_print_meta: n_rot            = 128
ollama-1  | llm_load_print_meta: n_swa            = 0
ollama-1  | llm_load_print_meta: n_embd_head_k    = 128
ollama-1  | llm_load_print_meta: n_embd_head_v    = 128
ollama-1  | llm_load_print_meta: n_gqa            = 4
ollama-1  | llm_load_print_meta: n_embd_k_gqa     = 1024
ollama-1  | llm_load_print_meta: n_embd_v_gqa     = 1024
ollama-1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
ollama-1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-1  | llm_load_print_meta: n_ff             = 14336
ollama-1  | llm_load_print_meta: n_expert         = 0
ollama-1  | llm_load_print_meta: n_expert_used    = 0
ollama-1  | llm_load_print_meta: causal attn      = 1
ollama-1  | llm_load_print_meta: pooling type     = 0
ollama-1  | llm_load_print_meta: rope type        = 0
ollama-1  | llm_load_print_meta: rope scaling     = linear
ollama-1  | llm_load_print_meta: freq_base_train  = 500000.0
ollama-1  | llm_load_print_meta: freq_scale_train = 1
ollama-1  | llm_load_print_meta: n_ctx_orig_yarn  = 131072
ollama-1  | llm_load_print_meta: rope_finetuned   = unknown
ollama-1  | llm_load_print_meta: ssm_d_conv       = 0
ollama-1  | llm_load_print_meta: ssm_d_inner      = 0
ollama-1  | llm_load_print_meta: ssm_d_state      = 0
ollama-1  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-1  | llm_load_print_meta: model type       = 8B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_print_meta: model params     = 8.03 B
ollama-1  | llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
ollama-1  | llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
ollama-1  | llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
ollama-1  | llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: LF token         = 128 'Ä'
ollama-1  | llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: max token length = 256
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1  | ggml_cuda_init: found 1 CUDA devices:
ollama-1  |   Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ollama-1  | llm_load_tensors: ggml ctx size =    0.27 MiB
ollama-1  | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1  | llm_load_tensors:        CPU buffer size =   281.81 MiB
ollama-1  | llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
ollama-1  | llama_new_context_with_model: n_ctx      = 8192
ollama-1  | llama_new_context_with_model: n_batch    = 512
ollama-1  | llama_new_context_with_model: n_ubatch   = 512
ollama-1  | llama_new_context_with_model: flash_attn = 0
ollama-1  | llama_new_context_with_model: freq_base  = 500000.0
ollama-1  | llama_new_context_with_model: freq_scale = 1
ollama-1  | llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
ollama-1  | llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
ollama-1  | llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
ollama-1  | llama_new_context_with_model: graph nodes  = 1030
ollama-1  | llama_new_context_with_model: graph splits = 2
ollama-1  | INFO [main] model loaded | tid="129334520164352" timestamp=1730455250
ollama-1  | time=2024-11-01T10:00:50.745Z level=INFO source=server.go:632 msg="llama runner started in 3.01 seconds"
ollama-1  | [GIN] 2024/11/01 - 10:01:04 | 200 | 16.933931685s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:01:16 | 200 |  9.919378341s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:02:25 | 200 | 19.677854442s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:03:41 | 200 | 12.781913808s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:04:24 | 200 |  7.010408381s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | time=2024-11-01T10:09:29.565Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c parallel=4 available=16593125376 required="6.2 GiB"
ollama-1  | time=2024-11-01T10:09:29.565Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.5 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1  | time=2024-11-01T10:09:29.567Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama878632986/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 45283"
ollama-1  | time=2024-11-01T10:09:29.567Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1  | time=2024-11-01T10:09:29.567Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1  | time=2024-11-01T10:09:29.568Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1  | INFO [main] build info | build=1 commit="1e6f655" tid="133416257802240" timestamp=1730455769
ollama-1  | INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="133416257802240" timestamp=1730455769 total_threads=20
ollama-1  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="45283" tid="133416257802240" timestamp=1730455769
ollama-1  | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
ollama-1  | llama_model_loader: - kv   1:                               general.type str              = model
ollama-1  | llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
ollama-1  | llama_model_loader: - kv   3:                           general.finetune str              = Instruct
ollama-1  | llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
ollama-1  | llama_model_loader: - kv   5:                         general.size_label str              = 8B
ollama-1  | llama_model_loader: - kv   6:                            general.license str              = llama3.1
ollama-1  | llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
ollama-1  | llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1  | llama_model_loader: - kv   9:                          llama.block_count u32              = 32
ollama-1  | llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
ollama-1  | llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
ollama-1  | llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
ollama-1  | llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
ollama-1  | llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
ollama-1  | llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
ollama-1  | llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
ollama-1  | llama_model_loader: - kv  17:                          general.file_type u32              = 2
ollama-1  | llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
ollama-1  | llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
ollama-1  | llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
ollama-1  | llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
ollama-1  | llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1  | llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1  | llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1  | llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
ollama-1  | llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
ollama-1  | llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
ollama-1  | llama_model_loader: - kv  28:               general.quantization_version u32              = 2
ollama-1  | llama_model_loader: - type  f32:   66 tensors
ollama-1  | llama_model_loader: - type q4_0:  225 tensors
ollama-1  | llama_model_loader: - type q6_K:    1 tensors
ollama-1  | time=2024-11-01T10:09:29.819Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1  | llm_load_vocab: special tokens cache size = 256
ollama-1  | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-1  | llm_load_print_meta: arch             = llama
ollama-1  | llm_load_print_meta: vocab type       = BPE
ollama-1  | llm_load_print_meta: n_vocab          = 128256
ollama-1  | llm_load_print_meta: n_merges         = 280147
ollama-1  | llm_load_print_meta: vocab_only       = 0
ollama-1  | llm_load_print_meta: n_ctx_train      = 131072
ollama-1  | llm_load_print_meta: n_embd           = 4096
ollama-1  | llm_load_print_meta: n_layer          = 32
ollama-1  | llm_load_print_meta: n_head           = 32
ollama-1  | llm_load_print_meta: n_head_kv        = 8
ollama-1  | llm_load_print_meta: n_rot            = 128
ollama-1  | llm_load_print_meta: n_swa            = 0
ollama-1  | llm_load_print_meta: n_embd_head_k    = 128
ollama-1  | llm_load_print_meta: n_embd_head_v    = 128
ollama-1  | llm_load_print_meta: n_gqa            = 4
ollama-1  | llm_load_print_meta: n_embd_k_gqa     = 1024
ollama-1  | llm_load_print_meta: n_embd_v_gqa     = 1024
ollama-1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
ollama-1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-1  | llm_load_print_meta: n_ff             = 14336
ollama-1  | llm_load_print_meta: n_expert         = 0
ollama-1  | llm_load_print_meta: n_expert_used    = 0
ollama-1  | llm_load_print_meta: causal attn      = 1
ollama-1  | llm_load_print_meta: pooling type     = 0
ollama-1  | llm_load_print_meta: rope type        = 0
ollama-1  | llm_load_print_meta: rope scaling     = linear
ollama-1  | llm_load_print_meta: freq_base_train  = 500000.0
ollama-1  | llm_load_print_meta: freq_scale_train = 1
ollama-1  | llm_load_print_meta: n_ctx_orig_yarn  = 131072
ollama-1  | llm_load_print_meta: rope_finetuned   = unknown
ollama-1  | llm_load_print_meta: ssm_d_conv       = 0
ollama-1  | llm_load_print_meta: ssm_d_inner      = 0
ollama-1  | llm_load_print_meta: ssm_d_state      = 0
ollama-1  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-1  | llm_load_print_meta: model type       = 8B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_print_meta: model params     = 8.03 B
ollama-1  | llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
ollama-1  | llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
ollama-1  | llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
ollama-1  | llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: LF token         = 128 'Ä'
ollama-1  | llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: max token length = 256
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1  | ggml_cuda_init: found 1 CUDA devices:
ollama-1  |   Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ollama-1  | llm_load_tensors: ggml ctx size =    0.27 MiB
ollama-1  | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1  | llm_load_tensors:        CPU buffer size =   281.81 MiB
ollama-1  | llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
ollama-1  | llama_new_context_with_model: n_ctx      = 8192
ollama-1  | llama_new_context_with_model: n_batch    = 512
ollama-1  | llama_new_context_with_model: n_ubatch   = 512
ollama-1  | llama_new_context_with_model: flash_attn = 0
ollama-1  | llama_new_context_with_model: freq_base  = 500000.0
ollama-1  | llama_new_context_with_model: freq_scale = 1
ollama-1  | llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
ollama-1  | llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
ollama-1  | llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
ollama-1  | llama_new_context_with_model: graph nodes  = 1030
ollama-1  | llama_new_context_with_model: graph splits = 2
ollama-1  | INFO [main] model loaded | tid="133416257802240" timestamp=1730455771
ollama-1  | time=2024-11-01T10:09:31.576Z level=INFO source=server.go:632 msg="llama runner started in 2.01 seconds"
ollama-1  | [GIN] 2024/11/01 - 10:09:45 | 200 | 15.970325893s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:09:55 | 200 |  14.03876944s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | time=2024-11-01T10:20:17.580Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-edc5e32d-a11e-5246-d105-95a845fb9f1c parallel=4 available=16593125376 required="6.2 GiB"
ollama-1  | time=2024-11-01T10:20:17.580Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.5 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1  | time=2024-11-01T10:20:17.582Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama878632986/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 36675"
ollama-1  | time=2024-11-01T10:20:17.582Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1  | time=2024-11-01T10:20:17.582Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1  | time=2024-11-01T10:20:17.582Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1  | INFO [main] build info | build=1 commit="1e6f655" tid="133835129552896" timestamp=1730456417
ollama-1  | INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="133835129552896" timestamp=1730456417 total_threads=20
ollama-1  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="36675" tid="133835129552896" timestamp=1730456417
ollama-1  | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
ollama-1  | llama_model_loader: - kv   1:                               general.type str              = model
ollama-1  | llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
ollama-1  | llama_model_loader: - kv   3:                           general.finetune str              = Instruct
ollama-1  | llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
ollama-1  | llama_model_loader: - kv   5:                         general.size_label str              = 8B
ollama-1  | llama_model_loader: - kv   6:                            general.license str              = llama3.1
ollama-1  | llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
ollama-1  | llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1  | llama_model_loader: - kv   9:                          llama.block_count u32              = 32
ollama-1  | llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
ollama-1  | llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
ollama-1  | llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
ollama-1  | llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
ollama-1  | llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
ollama-1  | llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
ollama-1  | llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
ollama-1  | llama_model_loader: - kv  17:                          general.file_type u32              = 2
ollama-1  | llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
ollama-1  | llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
ollama-1  | llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
ollama-1  | llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
ollama-1  | llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1  | llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1  | llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1  | llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
ollama-1  | llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
ollama-1  | llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
ollama-1  | llama_model_loader: - kv  28:               general.quantization_version u32              = 2
ollama-1  | llama_model_loader: - type  f32:   66 tensors
ollama-1  | llama_model_loader: - type q4_0:  225 tensors
ollama-1  | llama_model_loader: - type q6_K:    1 tensors
ollama-1  | time=2024-11-01T10:20:17.833Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1  | llm_load_vocab: special tokens cache size = 256
ollama-1  | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-1  | llm_load_print_meta: arch             = llama
ollama-1  | llm_load_print_meta: vocab type       = BPE
ollama-1  | llm_load_print_meta: n_vocab          = 128256
ollama-1  | llm_load_print_meta: n_merges         = 280147
ollama-1  | llm_load_print_meta: vocab_only       = 0
ollama-1  | llm_load_print_meta: n_ctx_train      = 131072
ollama-1  | llm_load_print_meta: n_embd           = 4096
ollama-1  | llm_load_print_meta: n_layer          = 32
ollama-1  | llm_load_print_meta: n_head           = 32
ollama-1  | llm_load_print_meta: n_head_kv        = 8
ollama-1  | llm_load_print_meta: n_rot            = 128
ollama-1  | llm_load_print_meta: n_swa            = 0
ollama-1  | llm_load_print_meta: n_embd_head_k    = 128
ollama-1  | llm_load_print_meta: n_embd_head_v    = 128
ollama-1  | llm_load_print_meta: n_gqa            = 4
ollama-1  | llm_load_print_meta: n_embd_k_gqa     = 1024
ollama-1  | llm_load_print_meta: n_embd_v_gqa     = 1024
ollama-1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
ollama-1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-1  | llm_load_print_meta: n_ff             = 14336
ollama-1  | llm_load_print_meta: n_expert         = 0
ollama-1  | llm_load_print_meta: n_expert_used    = 0
ollama-1  | llm_load_print_meta: causal attn      = 1
ollama-1  | llm_load_print_meta: pooling type     = 0
ollama-1  | llm_load_print_meta: rope type        = 0
ollama-1  | llm_load_print_meta: rope scaling     = linear
ollama-1  | llm_load_print_meta: freq_base_train  = 500000.0
ollama-1  | llm_load_print_meta: freq_scale_train = 1
ollama-1  | llm_load_print_meta: n_ctx_orig_yarn  = 131072
ollama-1  | llm_load_print_meta: rope_finetuned   = unknown
ollama-1  | llm_load_print_meta: ssm_d_conv       = 0
ollama-1  | llm_load_print_meta: ssm_d_inner      = 0
ollama-1  | llm_load_print_meta: ssm_d_state      = 0
ollama-1  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-1  | llm_load_print_meta: model type       = 8B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_print_meta: model params     = 8.03 B
ollama-1  | llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
ollama-1  | llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
ollama-1  | llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
ollama-1  | llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: LF token         = 128 'Ä'
ollama-1  | llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ollama-1  | llm_load_print_meta: max token length = 256
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1  | ggml_cuda_init: found 1 CUDA devices:
ollama-1  |   Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ollama-1  | llm_load_tensors: ggml ctx size =    0.27 MiB
ollama-1  | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1  | llm_load_tensors:        CPU buffer size =   281.81 MiB
ollama-1  | llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
ollama-1  | llama_new_context_with_model: n_ctx      = 8192
ollama-1  | llama_new_context_with_model: n_batch    = 512
ollama-1  | llama_new_context_with_model: n_ubatch   = 512
ollama-1  | llama_new_context_with_model: flash_attn = 0
ollama-1  | llama_new_context_with_model: freq_base  = 500000.0
ollama-1  | llama_new_context_with_model: freq_scale = 1
ollama-1  | llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
ollama-1  | llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
ollama-1  | llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
ollama-1  | llama_new_context_with_model: graph nodes  = 1030
ollama-1  | llama_new_context_with_model: graph splits = 2
ollama-1  | INFO [main] model loaded | tid="133835129552896" timestamp=1730456419
ollama-1  | time=2024-11-01T10:20:20.091Z level=INFO source=server.go:632 msg="llama runner started in 2.51 seconds"
ollama-1  | [GIN] 2024/11/01 - 10:20:32 | 200 | 15.476656611s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:20:47 | 200 | 13.984552173s |      172.19.0.5 | POST     "/api/chat"
ollama-1  | [GIN] 2024/11/01 - 10:20:52 | 200 | 14.902360012s |      172.19.0.5 | POST     "/api/chat"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 hard level 2 documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant