Skip to content

Commit 43ec72f

Browse files
committed
fix(textprocessing): improve English tokenizer
1 parent e57c419 commit 43ec72f

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

src/textprocessing.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ function tokenizer(text::AbstractString, regexp=r"\w+")
5050
[text[i] for i in findall(regexp, text)]
5151
end
5252

53-
function tokenizer_eng(text::AbstractString, regexp=r"\w[\w']*")
53+
function tokenizer_eng(text::AbstractString, regexp=r"\b\w+(?:'\w+)*\b")
5454
indices = findall(regexp, text)
5555
[endswith(text[i], "'s") ? text[i][1:prevind(text[i], end, 2)] : text[i] for i in indices]
5656
end

0 commit comments

Comments
 (0)