Skip to content

Commit b84e9ce

Browse files
committed
feat(gemma-3): add the gemma-3 vision example
Signed-off-by: dm4 <dm4@secondstate.io>
1 parent 1978c85 commit b84e9ce

File tree

7 files changed

+548
-0
lines changed

7 files changed

+548
-0
lines changed

.github/workflows/llama.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,24 @@ jobs:
214214
'Hello, world.'
215215
sha1sum *.wav
216216
217+
- name: Gemma-3 Vision
218+
shell: bash
219+
run: |
220+
test -f ~/.wasmedge/env && source ~/.wasmedge/env
221+
cd wasmedge-ggml/gemma-3
222+
curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf
223+
curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-mmproj-f16.gguf
224+
curl -LO https://llava-vl.github.io/static/images/monalisa.jpg
225+
cargo build --target wasm32-wasip1 --release
226+
time wasmedge --dir .:. \
227+
--env n_gpu_layers="$NGL" \
228+
--env image=monalisa.jpg \
229+
--env mmproj=gemma-3-4b-it-mmproj-f16.gguf \
230+
--nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf \
231+
target/wasm32-wasip1/release/wasmedge-ggml-gemma-3.wasm \
232+
default \
233+
$'<start_of_turn>user\n<start_of_image><image><end_of_image>Describe this image<end_of_turn>\n<start_of_turn>model\n'
234+
217235
- name: Build llama-stream
218236
run: |
219237
cd wasmedge-ggml/llama-stream

wasmedge-ggml/gemma-3/Cargo.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[package]
2+
name = "wasmedge-ggml-gemma-3"
3+
version = "0.1.0"
4+
edition = "2021"
5+
6+
[dependencies]
7+
serde_json = "1.0"
8+
wasmedge-wasi-nn = "0.7.1"
9+
10+
[[bin]]
11+
name = "wasmedge-ggml-gemma-3"
12+
path = "src/main.rs"
13+
14+
[[bin]]
15+
name = "wasmedge-ggml-gemma-3-base64"
16+
path = "src/base64.rs"

wasmedge-ggml/gemma-3/README.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Gemma-3 Example For WASI-NN with GGML Backend
2+
3+
> [!NOTE]
4+
> Please refer to the [wasmedge-ggml/README.md](../README.md) for the general introduction and the setup of the WASI-NN plugin with GGML backend. This document will focus on the specific example of the Gemma-3 model.
5+
6+
## Get Gemma-3 Model
7+
8+
```bash
9+
curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf
10+
curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-mmproj-f16.gguf
11+
```
12+
13+
## Prepare the Image
14+
15+
```bash
16+
curl -LO https://llava-vl.github.io/static/images/monalisa.jpg
17+
```
18+
19+
## Execute (with image file)
20+
21+
> [!NOTE]
22+
> You may see some warnings stating `key clip.vision.* not found in file.`. These are expected and can be ignored.
23+
24+
```console
25+
$ wasmedge --dir .:. \
26+
--env mmproj=gemma-3-4b-it-mmproj-f16.gguf \
27+
--env image=monalisa.jpg \
28+
--nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf \
29+
wasmedge-ggml-gemme-3.wasm default
30+
```
31+
32+
## Execute (with base64 encoded image)
33+
34+
> [!NOTE]
35+
> You may see some warnings stating `key clip.vision.* not found in file.`. These are expected and can be ignored.
36+
37+
```console
38+
$ wasmedge --dir .:. \
39+
--env mmproj=gemma-3-4b-it-mmproj-f16.gguf \
40+
--nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf \
41+
wasmedge-ggml-gemme-3-base64.wasm default
42+
```
43+
44+
## Results
45+
46+
```
47+
USER:
48+
Describe this image
49+
50+
ASSISTANT:
51+
Okay, let's describe the image.
52+
53+
The image is a portrait of a man, almost certainly **Leonardo da Vinci's *Mona Lisa*.**
54+
55+
Here's a breakdown of the key features and overall impression:
56+
57+
* **Subject:** A woman, believed to be Lisa del Giocondo, is the central figure. She is seated, turned slightly to the viewer, with her hands folded in her
58+
lap.
59+
60+
* **Composition:** She's positioned in a pyramidal form, creating a sense of stability and balance.
61+
62+
* **Expression:** Her most famous feature is her enigmatic smile. It's subtle, almost ambiguous, and seems to shift depending on the angle of observation.
63+
This is a key part of the painting's enduring mystery.
64+
65+
* **Technique:** Da Vinci employed his signature *sfumato* technique - a subtle blending of colors and tones that creates a soft, hazy effect, particularly
66+
around her eyes and mouth. This contributes to the dreamlike quality of the painting.
67+
68+
* **Background:** A landscape is visible in the background, seemingly a hazy, distant vista of mountains and water. The landscape is atmospheric and slightl
69+
y blurred, further drawing attention to the figure.
70+
71+
* **Color Palette:** The painting employs a muted, earthy color palette – browns, greens, golds, and blues - giving it a timeless and serene quality.
72+
73+
* **Overall Impression:** The *Mona Lisa* is a masterpiece of Renaissance art. It's renowned for its realism, psychological depth, and technical brilliance.
74+
It exudes an aura of mystery and beauty, which is why it's so widely recognized and studied.
75+
76+
**Do you want me to delve deeper into a specific aspect of the image, such as:**
77+
78+
* The historical context of the painting?
79+
* The techniques Da Vinci used?
80+
* The theories surrounding her smile?
81+
82+
USER:
83+
The techniques Da Vinci used?
84+
85+
ASSISTANT:
86+
Okay, let’s dive deeper into the techniques Leonardo da Vinci employed to create the *Mona Lisa*. He was a meticulous and innovative artist, and the painting showcases several groundbreaking techniques he developed and perfected. Here’s a breakdown of the key ones:
87+
88+
**1. Sfumato:**
89+
90+
* **What it is:** This is arguably the *most* famous technique associated with the *Mona Lisa*. “Sfumato” is an Italian word meaning “smoked” or “blurred.” It’s a subtle, almost imperceptible blending of colors and tones that creates a soft, hazy effect, particularly around the edges of forms and features.
91+
* **How he used it:** Da Vinci achieved this by applying incredibly thin layers of oil paint – often just a glaze – and meticulously blending them without harsh lines. He’d work with a tiny brush, gradually building up the tones to create a sense of depth and softness. You see it most prominently around her eyes and mouth, contributing to the elusive quality of her smile.
92+
93+
**2. Chiaroscuro:**
94+
95+
* **What it is:** Chiaroscuro (Italian for "light-dark") is the use of strong contrasts between light and dark to create dramatic effects.
96+
* **How he used it:** Da Vinci uses chiaroscuro to model Mona Lisa’s face and body, giving her a three-dimensional appearance. The subtle gradations of light and shadow define her features and create a sense of volume.
97+
98+
**3. Layering and Glazing:**
99+
100+
* **What it is:** This technique involves applying many thin, translucent layers of paint (glazes) over a dry underpainting.
101+
* **How he used it:** Da Vinci built up the colors of the *Mona Lisa* through numerous thin glazes. Each layer subtly alters the color and tone of the layer beneath it. This created a luminous quality and depth of color that was far more vibrant than previous painting techniques. It also helped to create the *sfumato* effect.
102+
103+
**4. Aerial Perspective (Atmospheric Perspective):**
104+
105+
* **What it is:** This technique uses variations in color and detail to create the illusion of depth in a landscape. Objects further away appear paler, less detailed, and bluer.
106+
* **How he used it:** The background landscape in the *Mona Lisa* demonstrates aerial perspective beautifully. The mountains and the distant water are rendered with muted colors and softened details, suggesting their
107+
```

wasmedge-ggml/gemma-3/src/base64.rs

Lines changed: 203 additions & 0 deletions
Large diffs are not rendered by default.

wasmedge-ggml/gemma-3/src/main.rs

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
use serde_json::Value;
2+
use std::collections::HashMap;
3+
use std::env;
4+
use std::io;
5+
use wasmedge_wasi_nn::{
6+
self, BackendError, Error, ExecutionTarget, GraphBuilder, GraphEncoding, GraphExecutionContext,
7+
TensorType,
8+
};
9+
10+
fn read_input() -> String {
11+
loop {
12+
let mut answer = String::new();
13+
io::stdin()
14+
.read_line(&mut answer)
15+
.expect("Failed to read line");
16+
if !answer.is_empty() && answer != "\n" && answer != "\r\n" {
17+
return answer.trim().to_string();
18+
}
19+
}
20+
}
21+
22+
fn get_options_from_env() -> HashMap<&'static str, Value> {
23+
let mut options = HashMap::new();
24+
25+
// Required parameters for llava
26+
if let Ok(val) = env::var("mmproj") {
27+
options.insert("mmproj", Value::from(val.as_str()));
28+
} else {
29+
eprintln!("Failed to get mmproj model.");
30+
std::process::exit(1);
31+
}
32+
if let Ok(val) = env::var("image") {
33+
options.insert("image", Value::from(val.as_str()));
34+
} else {
35+
eprintln!("Failed to get the target image.");
36+
std::process::exit(1);
37+
}
38+
39+
// Optional parameters
40+
if let Ok(val) = env::var("enable_log") {
41+
options.insert("enable-log", serde_json::from_str(val.as_str()).unwrap());
42+
} else {
43+
options.insert("enable-log", Value::from(false));
44+
}
45+
if let Ok(val) = env::var("enable_debug_log") {
46+
options.insert(
47+
"enable-debug-log",
48+
serde_json::from_str(val.as_str()).unwrap(),
49+
);
50+
} else {
51+
options.insert("enable-debug-log", Value::from(false));
52+
}
53+
if let Ok(val) = env::var("ctx_size") {
54+
options.insert("ctx-size", serde_json::from_str(val.as_str()).unwrap());
55+
} else {
56+
options.insert("ctx-size", Value::from(4096));
57+
}
58+
if let Ok(val) = env::var("n_gpu_layers") {
59+
options.insert("n-gpu-layers", serde_json::from_str(val.as_str()).unwrap());
60+
} else {
61+
options.insert("n-gpu-layers", Value::from(0));
62+
}
63+
options
64+
}
65+
66+
fn set_data_to_context(context: &mut GraphExecutionContext, data: Vec<u8>) -> Result<(), Error> {
67+
context.set_input(0, TensorType::U8, &[1], &data)
68+
}
69+
70+
fn get_data_from_context(context: &GraphExecutionContext, index: usize) -> String {
71+
// Preserve for 4096 tokens with average token length 6
72+
const MAX_OUTPUT_BUFFER_SIZE: usize = 4096 * 6;
73+
let mut output_buffer = vec![0u8; MAX_OUTPUT_BUFFER_SIZE];
74+
let mut output_size = context
75+
.get_output(index, &mut output_buffer)
76+
.expect("Failed to get output");
77+
output_size = std::cmp::min(MAX_OUTPUT_BUFFER_SIZE, output_size);
78+
79+
String::from_utf8_lossy(&output_buffer[..output_size]).to_string()
80+
}
81+
82+
fn get_output_from_context(context: &GraphExecutionContext) -> String {
83+
get_data_from_context(context, 0)
84+
}
85+
86+
fn get_metadata_from_context(context: &GraphExecutionContext) -> Value {
87+
serde_json::from_str(&get_data_from_context(context, 1)).expect("Failed to get metadata")
88+
}
89+
90+
fn main() {
91+
let args: Vec<String> = env::args().collect();
92+
let model_name: &str = &args[1];
93+
94+
// Set options for the graph. Check our README for more details:
95+
// https://github.com/second-state/WasmEdge-WASINN-examples/tree/master/wasmedge-ggml#parameters
96+
let options = get_options_from_env();
97+
// You could also set the options manually like this:
98+
99+
// Create graph and initialize context.
100+
let graph = GraphBuilder::new(GraphEncoding::Ggml, ExecutionTarget::AUTO)
101+
.config(serde_json::to_string(&options).expect("Failed to serialize options"))
102+
.build_from_cache(model_name)
103+
.expect("Failed to build graph");
104+
let mut context = graph
105+
.init_execution_context()
106+
.expect("Failed to init context");
107+
108+
// If there is a third argument, use it as the prompt and enter non-interactive mode.
109+
// This is mainly for the CI workflow.
110+
if args.len() >= 3 {
111+
let prompt = &args[2];
112+
// Set the prompt.
113+
println!("Prompt:\n{}", prompt);
114+
let tensor_data = prompt.as_bytes().to_vec();
115+
context
116+
.set_input(0, TensorType::U8, &[1], &tensor_data)
117+
.expect("Failed to set input");
118+
println!("Response:");
119+
120+
// Get the number of input tokens and llama.cpp versions.
121+
let input_metadata = get_metadata_from_context(&context);
122+
println!("[INFO] llama_commit: {}", input_metadata["llama_commit"]);
123+
println!(
124+
"[INFO] llama_build_number: {}",
125+
input_metadata["llama_build_number"]
126+
);
127+
println!(
128+
"[INFO] Number of input tokens: {}",
129+
input_metadata["input_tokens"]
130+
);
131+
132+
// Get the output.
133+
context.compute().expect("Failed to compute");
134+
let output = get_output_from_context(&context);
135+
println!("{}", output.trim());
136+
137+
// Retrieve the output metadata.
138+
let metadata = get_metadata_from_context(&context);
139+
println!(
140+
"[INFO] Number of input tokens: {}",
141+
metadata["input_tokens"]
142+
);
143+
println!(
144+
"[INFO] Number of output tokens: {}",
145+
metadata["output_tokens"]
146+
);
147+
return;
148+
}
149+
150+
let mut saved_prompt = String::new();
151+
let image_placeholder = "<image>";
152+
153+
loop {
154+
println!("USER:");
155+
let input = read_input();
156+
157+
// Gemma-3 prompt format: '<start_of_turn>user\n<start_of_image><image><end_of_image>Describe this image<end_of_turn>\n<start_of_turn>model\n'
158+
if saved_prompt.is_empty() {
159+
saved_prompt = format!(
160+
"<start_of_turn>user\n<start_of_image>{}<end_of_image>{}<end_of_turn>\n<start_of_turn>model\n",
161+
image_placeholder, input
162+
);
163+
} else {
164+
saved_prompt = format!(
165+
"{}<start_of_turn>user\n{}<end_of_turn>\n<start_of_turn>model\n",
166+
saved_prompt, input
167+
);
168+
}
169+
170+
// Set prompt to the input tensor.
171+
set_data_to_context(&mut context, saved_prompt.as_bytes().to_vec())
172+
.expect("Failed to set input");
173+
174+
// Execute the inference.
175+
let mut reset_prompt = false;
176+
match context.compute() {
177+
Ok(_) => (),
178+
Err(Error::BackendError(BackendError::ContextFull)) => {
179+
println!("\n[INFO] Context full, we'll reset the context and continue.");
180+
reset_prompt = true;
181+
}
182+
Err(Error::BackendError(BackendError::PromptTooLong)) => {
183+
println!("\n[INFO] Prompt too long, we'll reset the context and continue.");
184+
reset_prompt = true;
185+
}
186+
Err(err) => {
187+
println!("\n[ERROR] {}", err);
188+
std::process::exit(1);
189+
}
190+
}
191+
192+
// Retrieve the output.
193+
let mut output = get_output_from_context(&context);
194+
println!("ASSISTANT:\n{}", output.trim());
195+
196+
// Update the saved prompt.
197+
if reset_prompt {
198+
saved_prompt.clear();
199+
} else {
200+
output = output.trim().to_string();
201+
saved_prompt = format!("{}{}<end_of_turn>\n", saved_prompt, output);
202+
}
203+
}
204+
}
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)