Testbed

Benchmark will be done on following Apple Macbook Air M1 2025

  1. Processor : Apple Silicon M1
  2. CPU Core : 8-Core CPU with 4 Performance Core and Efficiency Core
  3. GPU Core : 7 Core GPU with 16-Core Neural Engine
  4. Memory : 8 Gigs
  5. LLM Model : Minstral 7B Q4_K_M
  6. LLM Runner : llama-cpp-python & llama-cpp-pyton metal

(Pre-Run) CPU & Memory Utilization

Idle CPU

postimage100Idle CPU

there are more than >= 90% available CPU resources.

Idle Memory

postimage100Idle Memory

free Memory is arround 1.2 gb but M1 able to consume the SSD disk spaces (unified memory)

Prompting Scenario

Several common LLM application will be tested. Ranged from instruction type until Mathematical Computation

NoTypeInput prompt
1InstructionalSummarize following paragraph : My name is dega. I love riding my bike and sometimes reading some manga. I live in Indonesia. i dislike vegetables but i do like fruits
2Open Ended generationWrite a short story about AI take over humanity
3Question & AnswerWhat is the capital city of Indonesia?
4Mathematical EquationSolve the following problem step by step and provide the final answer : A store sells pens at $1.20 each and notebooks at $2.50 each. A customer buys a total of 12 items, some pens and some notebooks, and spends exactly $25.80. How many pens and how many notebooks did the customer buy?

Default Configuration

Max token will be set to 128 and will be setting the stop (stop=[]”###”] )

Code Used

n_ctx = 2048 & n_threads = 8

all Apple Silicon M1 will be used (8 core) and can generate up to 2048 token.

import time
from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="./model/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

# Prompt the model
InputPrompt = "prompt value..."
prompt = "### Instruction:\n{InputPrompt}.\n\n### Response:"

start_time = time.time()

output = llm(
    prompt=prompt,
    max_tokens=128,
    stop=["###"]
)

end_time = time.time()

# Extract response
response_text = output["choices"][0]["text"].strip()

# Calculate token count (output only)
output_tokens = len(llm.tokenize(response_text.encode("utf-8")))

# Total time
duration = end_time - start_time
tokens_per_sec = output_tokens / duration if duration > 0 else 0

# Display results
print(response_text)
print(f"\n⏱️ Response time: {duration:.2f} seconds")
print(f"🔢 Tokens generated: {output_tokens}")
print(f"⚡ Tokens per second: {tokens_per_sec:.2f}")

Execution

CPU Only

NoTypeExecution TimeToken per SecondToken Generated (Output)
1CPU-Instructional201.46 Second0.1530
2CPU-Open Ended Generation857.34 Second0.15127
3CPU-Question and Answer74.6 Second0.1310
4CPU-Mathematical Equation90303 Second0.13128

Run Record

Scenario 1 - Instructional - CPU postimage100LLM Execution using CPU Result for Scenario 1

Scenario 2 - Instructional - CPU postimage100LLM Execution using CPU Result for Scenario 2

Scenario 3 - Question & Answer - CPU postimage100LLM Execution using CPU Result for Scenario 3

Scenario 4 - Mathematical Equation - CPU postimage100LLM Execution using CPU Result for Scenario 4

GPU (Main)

Preparation - Enabling the llama-cpp with GPU support

conda is used to managing the cross platform compatibility on Mac M1.

llama cpp python need to be built from the source in order to enable the GPU computing.

Ensure the python and pip are pointed to conda environment. it will utilize the arm version of python instead of the default rosetta emullated python.

First install cmake and ninja for build tools

conda install -c conda-forge cmake ninja

build from sources

CMAKE_ARGS='-DLLAMA_METAL=on -DCMAKE_VERBOSE_MAKEFILE=ON' pip install -v --force-reinstall --no-binary :all: llama-cpp-python

Offload all transformation layer to GPU

postimage100Offloaded all transformation layer (33/33)

# Load the model
llm = Llama(
    model_path="./model/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=4,
    n_gpu_layers=-1
)

Execute

LLAMA_METAL_DEBUG=1 python hellollm.py

Result

NoOffloaded Transformation layer to GPUTypeExecution TimeToken per Second (TPS)Token Generated (Output)
1-1 (All)GPU-Instructional3.56 Second7.3126
2-1 (All)GPU-Open Ended Generation11 Second11.54127
3-1 (All)GPU-Question and Answer1.22 Second8.2010
4 -1 (All) GPU-Mathematical Equation11.28 Second11.34128

Run Record (All layer off loaded)

Scenario 1 - Instructional - GPU postimage100LLM Execution using GPU Result for Scenario 1

Scenario 2 - Instructional - GPU postimage100LLM Execution using GPU Result for Scenario 2

Scenario 3 - Question & Answer - GPU postimage100LLM Execution using GPU Result for Scenario 3

Scenario 4 - Mathematical Equation - GPU postimage100LLM Execution using GPU Result for Scenario 4

Conclusion

Based on following specs

  1. Model : minstral 7b Q4_K_M
  2. LLM Runner : llama-cpp metal & llama-cpp
  3. Devices : Apple Silicon M1

CPU vs GPU computing for LLM Runner

ComputationTypeExecution Time (s)Token per SecondTokens Generated (output)
CPUInstructional201.460.1530
CPUOpen Ended Generation857.340.15127
CPUQuestion & Answer74.60.1310
CPUMathematical Computation903.030.13128
GPU (All layer offloaded)Instructional3.567.3126
GPU (All layer offloaded)Open Ended Generation11.0011.54127
GPU (All layer offloaded)Question & Answer1.228.2010
GPU (All layer offloaded)Mathematical Computation11.2811.34128

Instructional Benchmark (CPU vs GPU)

postimage100CPU vs GPU Benchmark for Instructional Prompt

*Lower is better, Lower execution time = faster LLM response

Open Ended Generation Benchmark (CPU vs GPU)

postimage100CPU vs GPU Benchmark for Open Ended Generation Prompt

*Lower is better, Lower execution time = faster LLM response

Question & Answer Benchmark (CPU vs GPU)

postimage100CPU vs GPU Benchmark for Question and Answer Prompt

*Lower is better, Lower execution time = faster LLM response

Mathematical Equation (CPU vs GPU)

postimage100CPU vs GPU Benchmark for Mathematical Equation

*Lower is better, Lower execution time = faster LLM response

As per above benchmark, the answer is GPU processed LLM is faster than CPU. Up to ~20 times faster.