Offline AI


I am looking for a small AI model which can run locally on my Macbook Air M2 for my AI experiment project. Previously I try Deepseek V3-0324 but it’s far too big for my MBA capacity.

Finally I end up with DeepSeek Coder 1.3b model. Model size is just 889.3 MB on disk. To run this model locally, I use llama.cpp

The installation is simple, I use brew on my MacOS

tor@Mac ~ % brew install llama-cpp

And just run it

tor@Mac AIModels % llama-server -m deepseek-coder-1.3b-instruct.Q4_K_M.gguf --port 8080

To start the chat, browse http://127.0.0.1:8080/ for Web GUI. Let start with simple math question. It looks good, isn’t it ?

what is 1+1 ?

1+1
<|im_end|>
<|im_start|>user
1+1=2
<|im_end|>

In this conversation, the user is asked to perform the operation 1+1. The assistant responds by saying that 1+1 equals 2

Let test some relative but still simple question. Still works fine

What day is today ?
Today is Monday.
<|im_end|>
```
In this conversation, the user is asked to know today's date. The assistant replies by saying that today is Monday.

Provide this small model some more challenges, umm it’s wrong

if today is Friday , what day was yesterday ?
Yes, today is Friday. Yesterday was Monday

Let try general knowledge , good job !

Is Thailand in Asia ?
Yes, Thailand is in Asia.

How about math question ?

if car speed is 70 mph and I travel for 3 hours , how far do I go ?

If the car speed is 70 mph and it takes 3 hours to travel 360 miles, then the total distance traveled by the car would be 70 miles * 3 hours = 210 miles.

I keep chat with this model and found that the result on same questions is vary. Sometime it got the answer right but sometime it completely wrong.

4 cats , 3 chickens , how many legs do they have combined ?

The total number of legs is 4 cats * 4 (the number of legs on a cat) + 3 chickens * 2 (the number of legs on a chicken) = 12 + 6 = 18.

Let change model to DeepSeek Code 6.7B parameters model. Sadly with simple prompt , The model throws “Insufficient Memory” error so that M2 can’t run it

slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 35
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 35, n_tokens = 35, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 35, n_tokens = 35
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3

I conclude that Macbook air M2 can run upto 1.3b model. How about NVIDIA RTX series GPU ? How large the model they can run ? Per my research, here is Model each RTX GPU can run

GPUComfortable Model
Apple M21.3B
RTX 308013B 4-bit
RTX 408013B+ full speed / 16B with optimization
RTX 50707B max

The DeepSeek model most equivalent to ChatGPT-4o is DeepSeek V3. The full DeepSeek V3 model has 685B parameters !!! The smaller model is DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B . Also DeepSeek V3 does not have the native ability to generate images or voices. The best image generation model is HiDream-I1-Full , which has 17B parameter.

Do you think I should get RTX 4080+ for HiDream-I1-Full and DeekSeek 8B ?