Creating the python project

Generally there two approach to create python environment for one specific project

Its recommended to use miniconda to manage several python environment

python version 3.10.x & Apple Silicon M1 Processor will be used on this article.

Creating project using miniconda

following syntax will create python environment with specific python version.

conda create --name hello-world-minstral python=3.10

Creating project using python venv

by default python can also create virtual environment using following syntax

python -m venv hello-world-minstral

however, the environment will be created based on the python version used.

python --version

python 3.10 need to be exposed to the path first, before creating the virtual environment.


thats why its generally recommended to use miniconda to create python project as its easier to manage different python version without the need of managing multiple $PATH (if in windows)

notes: conda also handled cross platform compatibilities. for example, on mac silicon case, conda will handle the cross platform dependencies and the package has been optimized for M1/M2/M3/M4

Installing dependencies for Minstral 7b

switch to the newly created conda environment “hello-world-minstral”.

view available conda environment

conda env list

activate the hello-world-minstral environment

conda activate hello-world-minstral

postimage100 Check python version & activated conda environment

ensure the python environment is 3.10.18 (after activating conda environment)

python --version

Install Inference engine (llama-cpp)

CPU Only

zsh

init the conda

conda init zsh

install llama-cpp-python

pip install "llama-cpp-python"

bash

pip install llama-cpp-python

Wait until the installation is completed.

postimage100 Installing LLM Runner : llama-cpp with support for apple silicon M series

Llama-cpp with GPU Support

build directly from the sources

CMAKE_ARGS='-DLLAMA_METAL=on' pip install --force-reinstall --no-binary :all: llama-cpp-python

What is Inference Engine?


a C++ inference engine needed to run quantized LLM locally and metal version of llama-cpp for GPU support on silicon M1.

Inference engine = LLM Runner, and Llama-Cpp in inference engine based on C++ which can support CPU and GPU.

Inference engine responsible for following things :

  1. load the trained model (minstral, LLaMA, etc)
  2. Process the prompt input
  3. Generate output (after input send to the loaded model and capture the responses)

Download the Model (GGUF Format)

postimage100 Q4_K_M is lightweight but yet still usable

Apple M1 with 8 gb of ram will be used to ran the LLM. One of the minstral model that fit for this case is : Minstral 7B Q4_K_M

Minstral 7B Q4_K_M require +- 6 gb of RAM.

Download following Files

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF

postimage100 Q4_K_M saved on model folder

Create new folder called “model” on the miniconda project and store the Minstral 7b Q4_K_M GGUF

cp Mistral-7B-Instruct-v0.1-GGUF /src/model

Load the model & run it

Ensure llama_cpp_python  is installed. On this artcle version 0.3.12 is used with python 3.10.18

create a new python file called as “hellollm.py”

from llama_cpp import Llama

# Load the model 
llm = Llama(
    model_path="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=4, 
)

# Prompt the model
output = llm(
    prompt="### Instruction:\nSay hello to the world.\n\n### Response:",
    max_tokens=128,
    stop=["###"]
)

# Print result
print(output["choices"][0]["text"].strip())

postimage100 run minstral Q4_K_M using llama-cpp-python

run it

python hellollm.py

it should expect following responses.

Hello, world! It's great to be here and ready to assist you with any information or tasks you might have. How can I help you today?

postimage100 Minstral 7b responses to the hello world instruction

Calculating Response time

import time

to compute the execution time from the input prompt collection by LLM runner / inference engine until LLM returning responses, timing code calculation can be added

add start time just before the LLM model took the input

# Prompt the model
prompt = "### Instruction:\nSay hello to the world.\n\n### Response:"

start_time = time.time()

output = llm(
    prompt=prompt,
    max_tokens=128,
    stop=["###"]
)

and add endtime after the output generated

output = llm(
    prompt=prompt,
    max_tokens=128,
    stop=["###"]
)
end_time = time.time()

print the responses time

print(f"\n⏱️ Response time: {end_time - start_time:.2f} seconds")