Skip to content

Commit a57926b

Browse files
authored
Merge pull request #331 from ratfactor/tokenization
First tokenization exerice
2 parents 371beb1 + ec5e15a commit a57926b

File tree

3 files changed

+175
-0
lines changed

3 files changed

+175
-0
lines changed

build.zig

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1057,6 +1057,27 @@ const exercises = [_]Exercise{
10571057
.output = "",
10581058
.kind = .@"test",
10591059
},
1060+
.{
1061+
.main_file = "103_tokenization.zig",
1062+
.output =
1063+
\\My
1064+
\\name
1065+
\\is
1066+
\\Ozymandias
1067+
\\King
1068+
\\of
1069+
\\Kings
1070+
\\Look
1071+
\\on
1072+
\\my
1073+
\\Works
1074+
\\ye
1075+
\\Mighty
1076+
\\and
1077+
\\despair
1078+
\\This little poem has 15 words!
1079+
,
1080+
},
10601081
.{
10611082
.main_file = "999_the_end.zig",
10621083
.output =

exercises/103_tokenization.zig

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
//
2+
// The functionality of the standard library is becoming increasingly
3+
// important in Zig. On the one hand, it is helpful to look at how
4+
// the individual functions are implemented. Because this is wonderfully
5+
// suitable as a template for your own functions. On the other hand,
6+
// these standard functions are part of the basic equipment of Zig.
7+
//
8+
// This means that they are always available on every system.
9+
// Therefore it is worthwhile to deal with them also in Ziglings.
10+
// It's a great way to learn important skills. For example, it is
11+
// often necessary to process large amounts of data from files.
12+
// And for this sequential reading and processing, Zig provides some
13+
// useful functions, which we will take a closer look at in the coming
14+
// exercises.
15+
//
16+
// A nice example of this has been published on the Zig homepage,
17+
// replacing the somewhat dusty 'Hello world!
18+
//
19+
// Nothing against 'Hello world!', but it just doesn't do justice
20+
// to the elegance of Zig and that's a pity, if someone takes a short,
21+
// first look at the homepage and doesn't get 'enchanted'. And for that
22+
// the present example is simply better suited and we will therefore
23+
// use it as an introduction to tokenizing, because it is wonderfully
24+
// suited to understand the basic principles.
25+
//
26+
// In the following exercises we will also read and process data from
27+
// large files and at the latest then it will be clear to everyone how
28+
// useful all this is.
29+
//
30+
// Let's start with the analysis of the example from the Zig homepage
31+
// and explain the most important things.
32+
//
33+
// const std = @import("std");
34+
//
35+
// // Here a function from the Standard library is defined,
36+
// // which transfers numbers from a string into the respective
37+
// // integer values.
38+
// const parseInt = std.fmt.parseInt;
39+
//
40+
// // Defining a test case
41+
// test "parse integers" {
42+
//
43+
// // Four numbers are passed in a string.
44+
// // Please note that the individual values are separated
45+
// // either by a space or a comma.
46+
// const input = "123 67 89,99";
47+
//
48+
// // In order to be able to process the input values,
49+
// // memory is required. An allocator is defined here for
50+
// // this purpose.
51+
// const ally = std.testing.allocator;
52+
//
53+
// // The allocator is used to initialize an array into which
54+
// // the numbers are stored.
55+
// var list = std.ArrayList(u32).init(ally);
56+
//
57+
// // This way you can never forget what is urgently needed
58+
// // and the compiler doesn't grumble either.
59+
// defer list.deinit();
60+
//
61+
// // Now it gets exciting:
62+
// // A standard tokenizer is called (Zig has several) and
63+
// // used to locate the positions of the respective separators
64+
// // (we remember, space and comma) and pass them to an iterator.
65+
// var it = std.mem.tokenize(u8, input, " ,");
66+
//
67+
// // The iterator can now be processed in a loop and the
68+
// // individual numbers can be transferred.
69+
// while (it.next()) |num| {
70+
// // But be careful: The numbers are still only available
71+
// // as strings. This is where the integer parser comes
72+
// // into play, converting them into real integer values.
73+
// const n = try parseInt(u32, num, 10);
74+
//
75+
// // Finally the individual values are stored in the array.
76+
// try list.append(n);
77+
// }
78+
//
79+
// // For the subsequent test, a second static array is created,
80+
// // which is directly filled with the expected values.
81+
// const expected = [_]u32{ 123, 67, 89, 99 };
82+
//
83+
// // Now the numbers converted from the string can be compared
84+
// // with the expected ones, so that the test is completed
85+
// // successfully.
86+
// for (expected, list.items) |exp, actual| {
87+
// try std.testing.expectEqual(exp, actual);
88+
// }
89+
// }
90+
//
91+
// So much for the example from the homepage.
92+
// Let's summarize the basic steps again:
93+
//
94+
// - We have a set of data in sequential order, separated from each other
95+
// by means of various characters.
96+
//
97+
// - For further processing, for example in an array, this data must be
98+
// read in, separated and, if necessary, converted into the target format.
99+
//
100+
// - We need a buffer that is large enough to hold the data.
101+
//
102+
// - This buffer can be created either statically at compile time, if the
103+
// amount of data is already known, or dynamically at runtime by using
104+
// a memory allocator.
105+
//
106+
// - The data are divided by means of Tokenizer at the respective
107+
// separators and stored in the reserved memory. This usually also
108+
// includes conversion to the target format.
109+
//
110+
// - Now the data can be conveniently processed further in the correct format.
111+
//
112+
// These steps are basically always the same.
113+
// Whether the data is read from a file or entered by the user via the
114+
// keyboard, for example, is irrelevant. Only subtleties are distinguished
115+
// and that's why Zig has different tokenizers. But more about this in
116+
// later exercises.
117+
//
118+
// Now we also want to write a small program to tokenize some data,
119+
// after all we need some practice. Suppose we want to count the words
120+
// of this little poem:
121+
//
122+
// My name is Ozymandias, King of Kings;
123+
// Look on my Works, ye Mighty, and despair!
124+
// by Percy Bysshe Shelley
125+
//
126+
//
127+
const std = @import("std");
128+
const print = std.debug.print;
129+
130+
pub fn main() !void {
131+
132+
// our input
133+
const poem =
134+
\\My name is Ozymandias, King of Kings;
135+
\\Look on my Works, ye Mighty, and despair!
136+
;
137+
138+
// now the tokenizer, but what do we need here?
139+
var it = std.mem.tokenize(u8, poem, ???);
140+
141+
// print all words and count them
142+
var cnt: usize = 0;
143+
while (it.next()) |word| {
144+
cnt += 1;
145+
print("{s}\n", .{word});
146+
}
147+
148+
// print the result
149+
print("This little poem has {d} words!\n", .{cnt});
150+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
139c139
2+
< var it = std.mem.tokenize(u8, poem, ???);
3+
---
4+
> var it = std.mem.tokenize(u8, poem, " ,;!\n");

0 commit comments

Comments
 (0)