faster_tokenizers->faster_tokenizer (#2503)

joey12300 · ZeyuChen · web-flow · commit 5bc1b98624db · 2022-06-14T10:18:50.000+08:00
* faster_tokenizers-&gt;faster_tokenizer

* update version

* FasterTokenizers-&gt;FasterTokenizer

Co-authored-by: Zeyu Chen &lt;chenzeyu01@baidu.com&gt;
diff --git a/README_cn.md b/README_cn.md
@@ -198,7 +198,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
 
 ### 高性能分布式训练与推理
 
-#### ⚡ FasterTokenizers：高性能文本处理库
+#### ⚡ FasterTokenizer：高性能文本处理库
 
 <div align="center">
     <img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
@@ -208,7 +208,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
 AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
 ```
 
-为了实现更极致的模型部署性能，安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考[FasterTokenizers文档](./faster_tokenizers)。
+为了实现更极致的模型部署性能，安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考[FasterTokenizer文档](./faster_tokenizer)。
 
 #### ⚡️ FasterGeneration：高性能生成加速库
 
diff --git a/README_en.md b/README_en.md
@@ -199,7 +199,7 @@ For more details please refer to [Speech Command Analysis](./applications/speech
 
 ### High Performance Distributed Training and Inference
 
-#### ⚡ FasterTokenizers: High Performance Text Preprocessing Library
+#### ⚡ FasterTokenizer: High Performance Text Preprocessing Library
 
 <div align="center">
     <img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
@@ -209,7 +209,7 @@ For more details please refer to [Speech Command Analysis](./applications/speech
 AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
 ```
 
-Set `use_faster=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FasterTokenizers](./faster_tokenizers).
+Set `use_faster=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FasterTokenizer](./faster_tokenizer).
 
 #### ⚡ FasterGeneration: High Perforance Generation Library
 
diff --git a/faster_tokenizer/faster_tokenizer/demo/README.md b/faster_tokenizer/faster_tokenizer/demo/README.md
@@ -1,23 +1,23 @@
-# FasterTokenizers Demo
+# FasterTokenizer Demo
 
 ## 1. 环境准备
 
-当前版本支持Linux环境：Ubuntu 18.04+，CentOS 7+。
+当前版本支持Linux环境：Ubuntu 16.04+，CentOS 7+。
 |依赖|版本|
 |---|---|
 |cmake|>=16.0|
 |gcc|>=8.2.0|
 
-1. 下载FasterTokenizers预编译包。
+1. 下载FasterTokenizer预编译包。
 
 ```shell
-wget -c https://bj.bcebos.com/paddlenlp/faster_tokenizers/faster_tokenizers_cpp-0.1.0.tar.gz
+wget -c https://bj.bcebos.com/paddlenlp/faster_tokenizer/faster_tokenizer_cpp-0.1.0.tar.gz
 ```
 
 2. 解压。
 
 ```shell
-tar xvfz faster_tokenizers_cpp-0.1.0.tar.gz
+tar xvfz faster_tokenizer_cpp-0.1.0.tar.gz
 ```
 
 ## 2. 快速开始
@@ -29,8 +29,8 @@ tar xvfz faster_tokenizers_cpp-0.1.0.tar.gz
 mkdir build
 cd build
 
-# 运行cmake，通过指定faster_tokenizers包的路径，构建Makefile
-cmake .. -DFASTER_TOKENIZER_LIB=/path/to/faster_tokenizers_cpp
+# 运行cmake，通过指定faster_tokenizer包的路径，构建Makefile
+cmake .. -DFASTER_TOKENIZER_LIB=/path/to/faster_tokenizer_cpp
 
 # 编译
 make
diff --git a/faster_tokenizer/perf/README.md b/faster_tokenizer/perf/README.md
@@ -1,4 +1,4 @@
-# 飞桨FasterTokenizers性能测试
+# 飞桨FasterTokenizer性能测试
 
 在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器，简称飞桨FasterTokenizer。为了验证飞桨FasterTokenizer的性能快的特点，PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较，主要进行性能参考的是HuggingFace BertTokenizer， Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比，在中文的数据下进行性能对比实验，下面是具体实验设置信息：
 * [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers):
@@ -26,7 +26,7 @@ import tensorflow_text as tf_text
 tf_tokenizer = tf_text.BertTokenizer(vocab)
 ```
 
-* [飞桨FasterTokenizers](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental):
+* [飞桨FasterTokenizer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental):
 
 ```python
 from paddlenlp.experimental import FasterTokenizer
diff --git a/faster_tokenizer/python/faster_tokenizer/__init__.py b/faster_tokenizer/python/faster_tokenizer/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-__version__ = "0.1.1"
+__version__ = "0.1.2"
 
 from typing import Tuple, Union, Tuple, List
 
diff --git a/model_zoo/ernie-3.0/deploy/serving/README.md b/model_zoo/ernie-3.0/deploy/serving/README.md
@@ -35,7 +35,7 @@ pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsingh
 ### 安装FasterTokenizer文本处理加速库（可选）
 如果部署环境是Linux，推荐安装faster_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。目前暂不支持Windows设备安装，将会在下个版本支持。
 ```
-pip install faster_tokenizers
+pip install faster_tokenizer
 ```
 
 
diff --git a/model_zoo/ernie-3.0/deploy/triton/README.md b/model_zoo/ernie-3.0/deploy/triton/README.md
@@ -32,11 +32,11 @@ docker exec -it triton_server bash
 python3 -m pip install paddlenlp
 ```
 
-### 安装FasterTokenizers文本处理加速库（可选）
-如果部署环境是Linux，推荐安装faster_tokenizers可以得到更极致的文本处理效率，进一步提升服务性能。目前暂不支持Windows设备安装，将会在下个版本支持。
+### 安装FasterTokenizer文本处理加速库（可选）
+如果部署环境是Linux，推荐安装faster_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。目前暂不支持Windows设备安装，将会在下个版本支持。
 ```
 # 注意：在容器内安装
-python3 -m pip install faster_tokenizers
+python3 -m pip install faster_tokenizer
 ```
 
 
diff --git a/paddlenlp/transformers/__init__.py b/paddlenlp/transformers/__init__.py
@@ -114,7 +114,7 @@
 from .auto.tokenizer import *
 
 # For faster tokenizer
-from ..utils.import_utils import is_faster_tokenizers_available
-if is_faster_tokenizers_available():
+from ..utils.import_utils import is_faster_tokenizer_available
+if is_faster_tokenizer_available():
     from .bert.faster_tokenizer import *
     from .ernie.faster_tokenizer import *
diff --git a/paddlenlp/transformers/auto/tokenizer.py b/paddlenlp/transformers/auto/tokenizer.py
@@ -21,7 +21,7 @@
 from paddlenlp.utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
 from paddlenlp.utils.env import MODEL_HOME
 from paddlenlp.utils.log import logger
-from paddlenlp.utils.import_utils import is_faster_tokenizers_available
+from paddlenlp.utils.import_utils import is_faster_tokenizer_available
 
 __all__ = [
     "AutoTokenizer",
@@ -82,7 +82,7 @@
                                               ("ErnieFasterTokenizer", "ernie")
                                               ])
 # For FasterTokenizer
-if is_faster_tokenizers_available():
+if is_faster_tokenizer_available():
     TOKENIZER_MAPPING_NAMES.update(FASTER_TOKENIZER_MAPPING_NAMES)
 
 
@@ -188,7 +188,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args,
                                 actual_tokenizer_class = tokenizer_class[0]
                                 break
                         if use_faster:
-                            if is_faster_tokenizers_available():
+                            if is_faster_tokenizer_available():
                                 is_support_faster_tokenizer = False
                                 for tokenizer_class in tokenizer_classes:
                                     if tokenizer_class[1]:
@@ -204,8 +204,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args,
                                     )
                             else:
                                 logger.warning(
-                                    "Can't find the faster_tokenizers package, "
-                                    "please ensure install faster_tokenizers correctly. "
+                                    "Can't find the faster_tokenizer package, "
+                                    "please ensure install faster_tokenizer correctly. "
                                     "You can install faster_tokenizer by `pip install faster_tokenizer`"
                                     "(Currently only work for linux platform).")
 
diff --git a/paddlenlp/transformers/tokenizer_utils_faster.py b/paddlenlp/transformers/tokenizer_utils_faster.py
@@ -74,7 +74,7 @@ def __init__(self, *args, **kwargs):
         else:
             raise ValueError(
                 "Couldn't instantiate the backend tokenizer from one of: \n"
-                "(1) a `faster_tokenizers` library serialization file, \n"
+                "(1) a `faster_tokenizer` library serialization file, \n"
                 "(2) a slow tokenizer instance to convert or \n"
                 "(3) an equivalent slow tokenizer class to instantiate and convert. \n"
                 "You need to have sentencepiece installed to convert a slow tokenizer to a fast one."
@@ -266,7 +266,7 @@ def set_truncation_and_padding(
         pad_to_multiple_of: Optional[int],
     ):
         """
-        Define the truncation and the padding strategies for fast tokenizers (provided by PaddleNLP's faster_tokenizers
+        Define the truncation and the padding strategies for fast tokenizers (provided by PaddleNLP's faster_tokenizer
         library) and restore the tokenizer settings afterwards.
 
         The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a
diff --git a/paddlenlp/utils/import_utils.py b/paddlenlp/utils/import_utils.py
@@ -15,6 +15,6 @@
 import importlib.util
 
 
-def is_faster_tokenizers_available():
+def is_faster_tokenizer_available():
     package_spec = importlib.util.find_spec("faster_tokenizer")
     return package_spec is not None and package_spec.has_location
diff --git a/setup.py b/setup.py
@@ -61,7 +61,7 @@ def get_package_data_files(package, data, package_dir=None):
     url="https://github.com/PaddlePaddle/PaddleNLP",
     packages=setuptools.find_packages(
         where='.',
-        exclude=('examples*', 'tests*', 'applications*', 'faster_tokenizers*',
+        exclude=('examples*', 'tests*', 'applications*', 'faster_tokenizer*',
                  'faster_generation*', 'model_zoo*')),
     package_data={
         'paddlenlp.ops':