对话机器人 Rasa （二）：中文支持

Rasa 安装之后，默认是不支持中文对话的。

学习、配置的策略

查到的示例，pipeline 配置各不相同，不动手试，难以知道相互间的优劣。

所以，先从能运行的最简单配置开始。例如使用《Rasa 实战：构建开源对话机器人》这本书上的推荐的中文 pipeline。
里面有个医疗机器人的 nlu 配置示例。当然，只包含了 nlu 部分的配置，即识别意图和实体，没有回复配置。

效果

rasa 中文对话机器人

基于 Rasa websocket 的网页组件实现。

最简单的中文配置

打开项目根目录下的 config.yml 配置文件，修改如下：

recipe: default.v1

language: zh

pipeline:
  - name: JiebaTokenizer
  - name: LanguageModelFeaturizer
    model_name: "bert"
    model_weight: "bert-base-chinese"
  - name: "DIETClassifier"

language 需要由 en 修改为 zh，即中文。
pipeline 可以参考我整理的 Rasa NLU pipeline 组件列表。
具体每个组件的作用及区别，可以参考 Rasa 中 JiebaTokenizer, LanguageModelFeaturizer 与 DIETClassifier 各自的作用及区别

什么是 NLU

NLU（Natural Language Understanding）是自然语言理解的缩写。

rasa 中 nlu 的作用:

Rasa NLU 模块的主要功能是解析用户输入数据，识别出用户输入的实体、意图等关键信息，同时也可以添加诸如情感分析等自定义模块。

配置 nlu.yml

修改 data/nlu.yml，在已有的英文语料基础上，增加一些中文的语料。

version: "3.1"

nlu:
- intent: greet
  examples: |
    - hey
    - hello
    - hi
    - hello there
    - good morning
    - good evening
    - moin
    - hey there
    - let's go
    - hey dude
    - goodmorning
    - goodevening
    - good afternoon
    - 你好！
    - 您好！
    - 在么！
    - 在吗！
    - 喂！

- intent: goodbye
  examples: |
    - cu
    - good by
    - cee you later
    - good night
    - bye
    - goodbye
    - have a nice day
    - see you around
    - bye bye
    - see you later
    - 拜拜！
    - 再见！
    - 拜！
    - 退出。
    - 结束。
    - exit

- intent: affirm
  examples: |
    - yes
    - y
    - indeed
    - of course
    - that sounds good
    - correct
    - 是的
    - 是

- intent: deny
  examples: |
    - no
    - n
    - never
    - I don't think so
    - don't like that
    - no way
    - not really
    - 不
    - 不是的
    - 不是

重新训练模型

data 目录下的各种 yml 配置文件里存储的就是训练数据，例如 nlu.yml。

rasa train nlu

期间下载 tf_model.h5 1.88G，怎么这么大。。。（这个文件是 BERT 模型引入的。BERT，Bidirectional Encoder Representations from Transformers，是一种基于 TensorFlow 框架的模型。BERT 模型使用 Transformer 架构来学习文本表示，可以用于各种自然语言处理任务，如文本分类、命名实体识别、问答等。TensorFlow 是一个广泛使用的机器学习框架，可用于训练和部署各种深度学习模型。tf_model.h5 是使用 TensorFlow 框架训练的模型文件，其中 .h5 表示它是一个 HDF5 格式的文件。）

但是训练出来的模型文件，只有 20M。

> ls -lah models/
total 44M
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:35 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:03 ../
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

测试：

rasa shell nlu

测试效果

greet intent，即，打招呼的意图:

Next message:
你好
{
  "text": "你好",
  "intent": {
    "name": "greet",
    "confidence": 0.9999979734420776
  },

goodbye intent, 即，再见的意图:

Next message:
再见
{
  "text": "再见",
  "intent": {
    "name": "goodbye",
    "confidence": 0.9999972581863403
  },

上面两个意料之中，至少可以说明已经支持中文了。而不是默认 en 的情况下，输入中文, 没有任何的回复。

比较让我吃惊的是下面这个的意图识别：

Next message:
我拒绝
{
  "text": "我拒绝",
  "intent": {
    "name": "deny",
    "confidence": 0.9226003289222717
  },

我在 deny intent 的语料配置中，并没有设置“拒绝”这个词，但是依然准测的识别出来了。说明引入了预训练的中文语言模型，但是不知道是 pipeline 哪个配置引入的。
后续了解一下。

也有不满意的情况：

Next message:
你好啊
{
  "text": "好啊",
  "intent": {
    "name": "affirm",
    "confidence": 0.4897577464580536
  },
  "entities": [],
  "text_tokens": [
    [
      0,
      1
    ],
    [
      1,
      2
    ]
  ],
  "intent_ranking": [
    {
      "name": "affirm",
      "confidence": 0.4897577464580536
    },
    {
      "name": "greet",
      "confidence": 0.34744495153427124
    },

实际上，第一候选意图应该是 greet，却被识别为了 affirm。还是不够智能，但是基本满足要求了。

支持中文回复

前面训练 nlu 模型的过程，只是支持了中文的解析，但是并不支持中文回复。

在 domain.yml 中添加中文回复:

version: "3.1"

intents:
  - greet
  - goodbye
  - affirm
  - deny
  - mood_great
  - mood_unhappy
  - bot_challenge

responses:
  utter_greet:
  - text: "你好！吃了么？"

  utter_cheer_up:
  - text: "Here is something to cheer you up:"
    image: "https://i.imgur.com/nGF1K8f.jpg"

  utter_did_that_help:
  - text: "Did that help you?"

  utter_happy:
  - text: "Great, carry on!"

  utter_goodbye:
  - text: "再见"

  utter_iamabot:
  - text: "我是一个机器人，你可以叫我小远子"

session_config:
  session_expiration_time: 60
  carry_over_slots_to_new_session: true

重新训练

由于之前用 rasa train nlu 训练出来的模型只是解析，并不包含回复逻辑，所以需要重新训练。

注意，不要带 nlu 参数：

> rasa train

The configuration for policies was chosen automatically. It was written into the config file at 'config.yml'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Starting to train component 'JiebaTokenizer'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Finished training component 'JiebaTokenizer'.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.493 seconds.
Prefix dict has been built successfully.
2023-04-08 09:43:10 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2023-04-08 09:43:39 INFO     rasa.engine.training.hooks  - Starting to train component 'DIETClassifier'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
Epochs: 100% 300/300 [00:32<00:00,  9.16it/s, t_loss=0.282, i_acc=1]
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Finished training component 'DIETClassifier'.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'MemoizationPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'TEDPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'UnexpecTEDIntentPolicy' from cache.
Your Rasa model is trained and saved at 'models/20230408-094308-burning-dessert.tar.gz'.

查看 models 目录，会看到多了一个非 nlu 开头的模型文件，文件大小比之前多了 4M。

> ls -lah models/
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  8 09:44 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 17:28 ../
-rwxrwxrwx 1 zhongwei zhongwei  24M Apr  8 09:44 20230408-094308-burning-dessert.tar.gz*
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

rasa shell

再次启动 rasa shell，会看到同时启用了 rasa server, 并加载了新训练的模型文件。

> rasa shell
2023-04-08 09:46:57 INFO     root  - Connecting to channel 'cmdline' which was specified by the '--connector' argument. Any other channels will be ignored. To connect to all given channels, omit the '--connector' argument.
2023-04-08 09:46:57 INFO     root  - Starting Rasa server on http://0.0.0.0:5005
2023-04-08 09:46:57 INFO     rasa.core.processor  - Loading model models/20230408-094308-burning-dessert.tar.gz...
2023-04-08 09:46:59 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
2023-04-08 09:47:43 WARNING  rasa.shared.utils.common  - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2023-04-08 09:47:50 INFO     root  - Rasa server is up and running.
Bot loaded. Type a message and press enter (use '/stop' to exit):

中文对话测试

Your input ->  你好
你好！吃了么？

Your input ->  你是机器人么
我是一个机器人，你可以叫我小远子

input ->  你是谁
我是一个机器人，你可以叫我小远子

果然支持中文回复了。

rasa train nlu 异常

rasa.engine.exceptions.GraphSchemaValidationException: Component ‘JiebaTokenizer’ requires the following packages which are currently not installed: jieba.

解决：

pip3 install jieba

rasa.engine.exceptions.GraphSchemaValidationException: Component ‘LanguageModelFeaturizer’ requires the following packages which are currently not installed: transformers.

解决:

pip3 install transformers

huggingface 无法访问的解决方法

国内使用的话，会遇到无法从 huggingface 下载模型的问题，需要参考这个：

https://www.zhihu.com/question/599683557/answer/3352307859

参考

https://rasa.com/docs/rasa/language-support/

查看合集

📖 对话机器人 Rasa 中文系列教程

关于作者 🌱

我是来自山东烟台的一名开发者，有感兴趣的话题，或者软件开发需求，欢迎加微信 zhongwei 聊聊，或者关注我的个人公众号“大象工具”，查看更多联系方式