Skip to content

Commit 5bc1b98

Browse files
joey12300ZeyuChen
andauthored
faster_tokenizers->faster_tokenizer (#2503)
* faster_tokenizers->faster_tokenizer * update version * FasterTokenizers->FasterTokenizer Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
1 parent 5cd8fd5 commit 5bc1b98

File tree

12 files changed

+29
-29
lines changed

12 files changed

+29
-29
lines changed

README_cn.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
198198

199199
### 高性能分布式训练与推理
200200

201-
#### FasterTokenizers:高性能文本处理库
201+
#### FasterTokenizer:高性能文本处理库
202202

203203
<div align="center">
204204
<img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
@@ -208,7 +208,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
208208
AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
209209
```
210210

211-
为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FasterTokenizers文档](./faster_tokenizers)
211+
为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FasterTokenizer文档](./faster_tokenizer)
212212

213213
#### ⚡️ FasterGeneration:高性能生成加速库
214214

README_en.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ For more details please refer to [Speech Command Analysis](./applications/speech
199199

200200
### High Performance Distributed Training and Inference
201201

202-
#### FasterTokenizers: High Performance Text Preprocessing Library
202+
#### FasterTokenizer: High Performance Text Preprocessing Library
203203

204204
<div align="center">
205205
<img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
@@ -209,7 +209,7 @@ For more details please refer to [Speech Command Analysis](./applications/speech
209209
AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
210210
```
211211

212-
Set `use_faster=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FasterTokenizers](./faster_tokenizers).
212+
Set `use_faster=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FasterTokenizer](./faster_tokenizer).
213213

214214
#### ⚡ FasterGeneration: High Perforance Generation Library
215215

faster_tokenizer/faster_tokenizer/demo/README.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
# FasterTokenizers Demo
1+
# FasterTokenizer Demo
22

33
## 1. 环境准备
44

5-
当前版本支持Linux环境:Ubuntu 18.04+,CentOS 7+。
5+
当前版本支持Linux环境:Ubuntu 16.04+,CentOS 7+。
66
|依赖|版本|
77
|---|---|
88
|cmake|>=16.0|
99
|gcc|>=8.2.0|
1010

11-
1. 下载FasterTokenizers预编译包
11+
1. 下载FasterTokenizer预编译包
1212

1313
```shell
14-
wget -c https://bj.bcebos.com/paddlenlp/faster_tokenizers/faster_tokenizers_cpp-0.1.0.tar.gz
14+
wget -c https://bj.bcebos.com/paddlenlp/faster_tokenizer/faster_tokenizer_cpp-0.1.0.tar.gz
1515
```
1616

1717
2. 解压。
1818

1919
```shell
20-
tar xvfz faster_tokenizers_cpp-0.1.0.tar.gz
20+
tar xvfz faster_tokenizer_cpp-0.1.0.tar.gz
2121
```
2222

2323
## 2. 快速开始
@@ -29,8 +29,8 @@ tar xvfz faster_tokenizers_cpp-0.1.0.tar.gz
2929
mkdir build
3030
cd build
3131

32-
# 运行cmake,通过指定faster_tokenizers包的路径,构建Makefile
33-
cmake .. -DFASTER_TOKENIZER_LIB=/path/to/faster_tokenizers_cpp
32+
# 运行cmake,通过指定faster_tokenizer包的路径,构建Makefile
33+
cmake .. -DFASTER_TOKENIZER_LIB=/path/to/faster_tokenizer_cpp
3434

3535
# 编译
3636
make

faster_tokenizer/perf/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 飞桨FasterTokenizers性能测试
1+
# 飞桨FasterTokenizer性能测试
22

33
在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器,简称飞桨FasterTokenizer。为了验证飞桨FasterTokenizer的性能快的特点,PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较,主要进行性能参考的是HuggingFace BertTokenizer, Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比,在中文的数据下进行性能对比实验,下面是具体实验设置信息:
44
* [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers):
@@ -26,7 +26,7 @@ import tensorflow_text as tf_text
2626
tf_tokenizer = tf_text.BertTokenizer(vocab)
2727
```
2828

29-
* [飞桨FasterTokenizers](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental):
29+
* [飞桨FasterTokenizer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental):
3030

3131
```python
3232
from paddlenlp.experimental import FasterTokenizer

faster_tokenizer/python/faster_tokenizer/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
__version__ = "0.1.1"
15+
__version__ = "0.1.2"
1616

1717
from typing import Tuple, Union, Tuple, List
1818

model_zoo/ernie-3.0/deploy/serving/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsingh
3535
### 安装FasterTokenizer文本处理加速库(可选)
3636
如果部署环境是Linux,推荐安装faster_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。目前暂不支持Windows设备安装,将会在下个版本支持。
3737
```
38-
pip install faster_tokenizers
38+
pip install faster_tokenizer
3939
```
4040

4141

model_zoo/ernie-3.0/deploy/triton/README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,11 @@ docker exec -it triton_server bash
3232
python3 -m pip install paddlenlp
3333
```
3434

35-
### 安装FasterTokenizers文本处理加速库(可选)
36-
如果部署环境是Linux,推荐安装faster_tokenizers可以得到更极致的文本处理效率,进一步提升服务性能。目前暂不支持Windows设备安装,将会在下个版本支持。
35+
### 安装FasterTokenizer文本处理加速库(可选)
36+
如果部署环境是Linux,推荐安装faster_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。目前暂不支持Windows设备安装,将会在下个版本支持。
3737
```
3838
# 注意:在容器内安装
39-
python3 -m pip install faster_tokenizers
39+
python3 -m pip install faster_tokenizer
4040
```
4141

4242

paddlenlp/transformers/__init__.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@
114114
from .auto.tokenizer import *
115115

116116
# For faster tokenizer
117-
from ..utils.import_utils import is_faster_tokenizers_available
118-
if is_faster_tokenizers_available():
117+
from ..utils.import_utils import is_faster_tokenizer_available
118+
if is_faster_tokenizer_available():
119119
from .bert.faster_tokenizer import *
120120
from .ernie.faster_tokenizer import *

paddlenlp/transformers/auto/tokenizer.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from paddlenlp.utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
2222
from paddlenlp.utils.env import MODEL_HOME
2323
from paddlenlp.utils.log import logger
24-
from paddlenlp.utils.import_utils import is_faster_tokenizers_available
24+
from paddlenlp.utils.import_utils import is_faster_tokenizer_available
2525

2626
__all__ = [
2727
"AutoTokenizer",
@@ -82,7 +82,7 @@
8282
("ErnieFasterTokenizer", "ernie")
8383
])
8484
# For FasterTokenizer
85-
if is_faster_tokenizers_available():
85+
if is_faster_tokenizer_available():
8686
TOKENIZER_MAPPING_NAMES.update(FASTER_TOKENIZER_MAPPING_NAMES)
8787

8888

@@ -188,7 +188,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args,
188188
actual_tokenizer_class = tokenizer_class[0]
189189
break
190190
if use_faster:
191-
if is_faster_tokenizers_available():
191+
if is_faster_tokenizer_available():
192192
is_support_faster_tokenizer = False
193193
for tokenizer_class in tokenizer_classes:
194194
if tokenizer_class[1]:
@@ -204,8 +204,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args,
204204
)
205205
else:
206206
logger.warning(
207-
"Can't find the faster_tokenizers package, "
208-
"please ensure install faster_tokenizers correctly. "
207+
"Can't find the faster_tokenizer package, "
208+
"please ensure install faster_tokenizer correctly. "
209209
"You can install faster_tokenizer by `pip install faster_tokenizer`"
210210
"(Currently only work for linux platform).")
211211

paddlenlp/transformers/tokenizer_utils_faster.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def __init__(self, *args, **kwargs):
7474
else:
7575
raise ValueError(
7676
"Couldn't instantiate the backend tokenizer from one of: \n"
77-
"(1) a `faster_tokenizers` library serialization file, \n"
77+
"(1) a `faster_tokenizer` library serialization file, \n"
7878
"(2) a slow tokenizer instance to convert or \n"
7979
"(3) an equivalent slow tokenizer class to instantiate and convert. \n"
8080
"You need to have sentencepiece installed to convert a slow tokenizer to a fast one."
@@ -266,7 +266,7 @@ def set_truncation_and_padding(
266266
pad_to_multiple_of: Optional[int],
267267
):
268268
"""
269-
Define the truncation and the padding strategies for fast tokenizers (provided by PaddleNLP's faster_tokenizers
269+
Define the truncation and the padding strategies for fast tokenizers (provided by PaddleNLP's faster_tokenizer
270270
library) and restore the tokenizer settings afterwards.
271271
272272
The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a

paddlenlp/utils/import_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@
1515
import importlib.util
1616

1717

18-
def is_faster_tokenizers_available():
18+
def is_faster_tokenizer_available():
1919
package_spec = importlib.util.find_spec("faster_tokenizer")
2020
return package_spec is not None and package_spec.has_location

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def get_package_data_files(package, data, package_dir=None):
6161
url="https://github.com/PaddlePaddle/PaddleNLP",
6262
packages=setuptools.find_packages(
6363
where='.',
64-
exclude=('examples*', 'tests*', 'applications*', 'faster_tokenizers*',
64+
exclude=('examples*', 'tests*', 'applications*', 'faster_tokenizer*',
6565
'faster_generation*', 'model_zoo*')),
6666
package_data={
6767
'paddlenlp.ops':

0 commit comments

Comments
 (0)