Refine the README for TPP usage (#1498)

lvliang-intel · web-flow · commit 307ab8e88e1c · 2024-04-19T22:28:06.000+08:00
diff --git a/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/README.md b/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/README.md
@@ -0,0 +1,71 @@
+This README serves as a guide to set up the backend for a code generation chatbot utilizing the NeuralChat framework. You can deploy this code generation chatbot across various platforms, including Intel XEON Scalable Processors, Habana's Gaudi processors (HPU), Intel Data Center GPU and Client GPU, Nvidia Data Center GPU, and Client GPU.
+
+This code generation chatbot demonstrates how to deploy it specifically on Intel XEON processors using [intel exteion for pytorch](https://github.com/intel/intel-extension-for-pytorch) BFloat16 optimization. 
+
+# Setup Conda
+
+First, you need to install and configure the Conda environment:
+
+```bash
+# Download and install Miniconda
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+bash Miniconda*.sh
+source ~/.bashrc
+conda create -n demo python=3.9
+conda activate demo
+```
+
+# Install numactl
+
+Next, install the numactl library:
+
+```shell
+sudo apt install numactl
+```
+
+# Install ITREX
+
+```bash
+git clone https://github.com/intel/intel-extension-for-transformers.git
+cd ./intel-extension-for-transformers/
+python setup.py install
+```
+
+# Install NeuralChat Python Dependencies
+
+Install neuralchat dependencies:
+
+```bash
+pip install -r ../../../../../../requirements_cpu.txt
+```
+
+# Install Python dependencies
+```bash
+conda install astunparse ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses -y
+conda install jemalloc gperftools -c conda-forge -y
+```
+
+# Configure the codegen.yaml
+
+You can customize the configuration file 'codegen.yaml' to match your environment setup. Here's a table to help you understand the configurable options:
+
+|  Item               | Value                                      |
+| ------------------- | ------------------------------------------ |
+| host                | 0.0.0.0                                    |
+| port                | 8000                                       |
+| model_name_or_path  | "ise-uiuc/Magicoder-S-DS-6.7B"             |
+| device              | "cpu"                                      |
+| tasks_list          | ['codegen']                                |
+
+Note: To switch from code generation to text generation mode, adjust the model_name_or_path settings accordingly, e.g. the model_name_or_path can be set "Intel/neural-chat-7b-v3-3".
+
+
+# Run the Code Generation Chatbot Server
+
+To start the code-generating chatbot server, use the following command:
+
+```shell
+bash run.sh
+```
+
+Note: Please adapt the core number in the commands `export OMP_NUM_THREADS=48` and `numactl -l -C 0-47 python -m run_code_gen` based on your CPU specifications for the `run.sh` script, which can be checked using `lscpu`.
diff --git a/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/codegen.yaml b/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/codegen.yaml
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This is the parameter configuration file for NeuralChat Serving.
+
+#################################################################################
+#                             SERVER SETTING                                    #
+#################################################################################
+host: 0.0.0.0
+port: 8000
+
+model_name_or_path: "ise-uiuc/Magicoder-S-DS-6.7B"
+device: "cpu"
+
+# task choices = ['textchat', 'voicechat', 'retrieval', 'text2image', 'finetune', 'codegen']
+tasks_list: ['codegen']
diff --git a/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/run.sh b/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/run.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Kill the exist and re-run
+ps -ef |grep 'run_code_gen' |awk '{print $2}' |xargs kill -9
+
+# KMP
+export KMP_BLOCKTIME=1
+export KMP_SETTINGS=1
+export KMP_AFFINITY=granularity=fine,compact,1,0
+
+# OMP
+export OMP_NUM_THREADS=48
+export LD_PRELOAD=${CONDA_PREFIX}/lib/libiomp5.so
+
+# tc malloc
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+
+numactl -l -C 0-47 python -m run_code_gen 2>&1 | tee run.log
diff --git a/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/run_code_gen.py b/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/ipex/run_code_gen.py
@@ -0,0 +1,26 @@
+# !/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
+
+def main():
+    server_executor = NeuralChatServerExecutor()
+    server_executor(config_file="./codegen.yaml", log_file="./codegen.log")
+
+if __name__ == "__main__":
+    main()
diff --git a/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/tpp/README.md b/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/xeon/tpp/README.md
@@ -29,8 +29,7 @@ sudo apt install numactl
 ```bash
 git clone https://github.com/intel/intel-extension-for-transformers.git
 cd ./intel-extension-for-transformers/
-pip install -r requirements.txt
-pip install -e .
+python setup.py install
 ```
 
 # Install NeuralChat Python Dependencies
@@ -73,21 +72,54 @@ export USE_MXFP4=1
 export KV_CACHE_INC_SIZE=512
 ```
 
-# Configure Multi-NumaNodes
-To use the multi-socket model parallelism with Xeon servers, you need to configure a hostfile first.
+# Configure the codegen.yaml
+
+You can customize the configuration file 'codegen.yaml' to match your environment setup. Here's a table to help you understand the configurable options:
+
+|  Item               | Value                                      |
+| ------------------- | ------------------------------------------ |
+| host                | 0.0.0.0                                    |
+| port                | 8000                                       |
+| model_name_or_path  | "Phind/Phind-CodeLlama-34B-v2"             |
+| device              | "cpu"                                      |
+| use_tpp             | true                                       |
+| tasks_list          | ['codegen']                                |
+
+Note: To switch from code generation to text generation mode, adjust the model_name_or_path settings accordingly, e.g. the model_name_or_path can be set "meta-llama/Llama-2-13b-chat-hf".
+
+# Using Single NumaNode
+To configure a single NUMA node on Xeon processors, edit the hostfile located at `../../../../../../server/config/hostfile` and set the NUMA node number to 1.
+
+## Modify hostfile
+```bash
+vim ../../../../../../server/config/hostfile
+    localhost slots=1
+```
+
+## Run the Code Generation Chatbot Server
+
+To start the code-generating chatbot server, use the following command:
+
+```shell
+bash run.sh
+```
+
 
-Here is a example using 3 numa nodes on single socket on GNR server.
+# Configure Multi-NumaNodes
+To utilize multi-socket model parallelism on Xeon servers, you'll need to adjust the hostfile settings.
+For instance, to allocate 3 NUMA nodes on a single socket of the GNR server, modify the hostfile as shown below:
 
 ## Modify hostfile
 ```bash
 vim ../../../../../../server/config/hostfile
     localhost slots=3
 ```
 
+Afterward, run the run.sh script as previously instructed.
+
 
 # Configure Multi-Nodes
 To use the multi-node model parallelism with Xeon servers, you need to configure a hostfile first and make sure ssh is able between your servers.
-
 For example, you have two servers which have the IP of `192.168.1.1` and `192.168.1.2`, and each of it has 3 numa nodes on single socket.
 
 ## Modify hostfile
@@ -135,22 +167,8 @@ localhost slots=3
 192.168.1.2 slots=3
 ```
 
-# Configure the codegen.yaml
-
-You can customize the configuration file 'codegen.yaml' to match your environment setup. Here's a table to help you understand the configurable options:
-
-|  Item              | Value                                      |
-| ------------------- | --------------------------------------- |
-| host                | 0.0.0.0                              |
-| port                | 8000                                   |
-| model_name_or_path  | "Phind/Phind-CodeLlama-34B-v2"        |
-| device              | "cpu"                                  |
-| use_tpp             | true                                  |
-| tasks_list          | ['codegen']                           |
-
-
-# Run the Code Generation Chatbot Server
-Before running the code-generating chatbot server, make sure you have already deploy the same `conda environment` and `intel-extension-for-tranformers codes` on both servers.
+## Run the Code Generation Chatbot Server
+Before running the code-generating chatbot server, make sure you have already deploy the same conda environment and intel-extension-for-tranformers codes on both servers.
 
 To start the code-generating chatbot server, use the following command: