对话机器人 Rasa (二):中文支持

发布时间: 2023-04-07 16:54:25 作者: 大象笔记

Rasa 安装之后,默认是不支持中文对话的。

学习、配置的策略

查到的示例,pipeline 配置各不相同,不动手试,难以知道相互间的优劣。

所以,先从能运行的最简单配置开始。例如使用《Rasa 实战:构建开源对话机器人》这本书上的推荐的中文 pipeline。 里面有个医疗机器人的 nlu 配置示例。当然,只包含了 nlu 部分的配置,即识别意图和实体,没有回复配置。

效果

基于 Rasa websocket 的网页组件 实现。

最简单的中文配置

打开项目根目录下的 config.yml 配置文件,修改如下:

recipe: default.v1

language: zh

pipeline:
  - name: JiebaTokenizer
  - name: LanguageModelFeaturizer
    model_name: "bert"
    model_weight: "bert-base-chinese"
  - name: "DIETClassifier"

什么是 NLU

NLU(Natural Language Understanding)是自然语言理解的缩写。

rasa 中 nlu 的作用:

Rasa NLU 模块的主要功能是解析用户输入数据,识别出用户输入的实体、意图等关键信息,同时也可以添加诸如情感分析等自定义模块。

配置 nlu.yml

修改 data/nlu.yml,在已有的英文语料基础上,增加一些中文的语料。

version: "3.1"

nlu:
- intent: greet
  examples: |
    - hey
    - hello
    - hi
    - hello there
    - good morning
    - good evening
    - moin
    - hey there
    - let's go
    - hey dude
    - goodmorning
    - goodevening
    - good afternoon
    - 你好!
    - 您好!
    - 在么!
    - 在吗!
    - 喂!

- intent: goodbye
  examples: |
    - cu
    - good by
    - cee you later
    - good night
    - bye
    - goodbye
    - have a nice day
    - see you around
    - bye bye
    - see you later
    - 拜拜!
    - 再见!
    - 拜!
    - 退出。
    - 结束。
    - exit

- intent: affirm
  examples: |
    - yes
    - y
    - indeed
    - of course
    - that sounds good
    - correct
    - 是的
    - 是

- intent: deny
  examples: |
    - no
    - n
    - never
    - I don't think so
    - don't like that
    - no way
    - not really
    - 不
    - 不是的
    - 不是

重新训练模型

data 目录下的各种 yml 配置文件里存储的就是训练数据,例如 nlu.yml。

rasa train nlu

期间下载 tf_model.h5 1.88G,怎么这么大。。。(这个文件是 BERT 模型引入的。BERT,Bidirectional Encoder Representations from Transformers,是一种基于 TensorFlow 框架的模型。BERT 模型使用 Transformer 架构来学习文本表示,可以用于各种自然语言处理任务,如文本分类、命名实体识别、问答等。TensorFlow 是一个广泛使用的机器学习框架,可用于训练和部署各种深度学习模型。tf_model.h5 是使用 TensorFlow 框架训练的模型文件,其中 .h5 表示它是一个 HDF5 格式的文件。)

但是训练出来的模型文件,只有 20M。

> ls -lah models/
total 44M
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:35 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:03 ../
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

测试:

rasa shell nlu

测试效果

greet intent,即,打招呼的意图:

Next message:
你好
{
  "text": "你好",
  "intent": {
    "name": "greet",
    "confidence": 0.9999979734420776
  },

goodbye intent, 即,再见的意图:

Next message:
再见
{
  "text": "再见",
  "intent": {
    "name": "goodbye",
    "confidence": 0.9999972581863403
  },

上面两个意料之中,至少可以说明已经支持中文了。而不是默认 en 的情况下,输入中文, 没有任何的回复。

比较让我吃惊的是下面这个的意图识别:

Next message:
我拒绝
{
  "text": "我拒绝",
  "intent": {
    "name": "deny",
    "confidence": 0.9226003289222717
  },

我在 deny intent 的语料配置中,并没有设置“拒绝”这个词,但是依然准测的识别出来了。说明引入了预训练的中文语言模型,但是不知道是 pipeline 哪个配置引入的。 后续了解一下。

也有不满意的情况:

Next message:
你好啊
{
  "text": "好啊",
  "intent": {
    "name": "affirm",
    "confidence": 0.4897577464580536
  },
  "entities": [],
  "text_tokens": [
    [
      0,
      1
    ],
    [
      1,
      2
    ]
  ],
  "intent_ranking": [
    {
      "name": "affirm",
      "confidence": 0.4897577464580536
    },
    {
      "name": "greet",
      "confidence": 0.34744495153427124
    },

实际上,第一候选意图应该是 greet,却被识别为了 affirm。还是不够智能,但是基本满足要求了。

支持中文回复

前面训练 nlu 模型的过程,只是支持了中文的解析,但是并不支持中文回复。

在 domain.yml 中添加中文回复:

version: "3.1"

intents:
  - greet
  - goodbye
  - affirm
  - deny
  - mood_great
  - mood_unhappy
  - bot_challenge

responses:
  utter_greet:
  - text: "你好!吃了么?"

  utter_cheer_up:
  - text: "Here is something to cheer you up:"
    image: "https://i.imgur.com/nGF1K8f.jpg"

  utter_did_that_help:
  - text: "Did that help you?"

  utter_happy:
  - text: "Great, carry on!"

  utter_goodbye:
  - text: "再见"

  utter_iamabot:
  - text: "我是一个机器人,你可以叫我小远子"

session_config:
  session_expiration_time: 60
  carry_over_slots_to_new_session: true

重新训练

由于之前用 rasa train nlu 训练出来的模型只是解析,并不包含回复逻辑,所以需要重新训练。

注意,不要带 nlu 参数:

> rasa train

The configuration for policies was chosen automatically. It was written into the config file at 'config.yml'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Starting to train component 'JiebaTokenizer'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Finished training component 'JiebaTokenizer'.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.493 seconds.
Prefix dict has been built successfully.
2023-04-08 09:43:10 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2023-04-08 09:43:39 INFO     rasa.engine.training.hooks  - Starting to train component 'DIETClassifier'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
Epochs: 100% 300/300 [00:32<00:00,  9.16it/s, t_loss=0.282, i_acc=1]
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Finished training component 'DIETClassifier'.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'MemoizationPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'TEDPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'UnexpecTEDIntentPolicy' from cache.
Your Rasa model is trained and saved at 'models/20230408-094308-burning-dessert.tar.gz'.

查看 models 目录,会看到多了一个非 nlu 开头的模型文件,文件大小比之前多了 4M。

> ls -lah models/
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  8 09:44 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 17:28 ../
-rwxrwxrwx 1 zhongwei zhongwei  24M Apr  8 09:44 20230408-094308-burning-dessert.tar.gz*
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

rasa shell

再次启动 rasa shell,会看到同时启用了 rasa server, 并加载了新训练的模型文件。

> rasa shell
2023-04-08 09:46:57 INFO     root  - Connecting to channel 'cmdline' which was specified by the '--connector' argument. Any other channels will be ignored. To connect to all given channels, omit the '--connector' argument.
2023-04-08 09:46:57 INFO     root  - Starting Rasa server on http://0.0.0.0:5005
2023-04-08 09:46:57 INFO     rasa.core.processor  - Loading model models/20230408-094308-burning-dessert.tar.gz...
2023-04-08 09:46:59 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
2023-04-08 09:47:43 WARNING  rasa.shared.utils.common  - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2023-04-08 09:47:50 INFO     root  - Rasa server is up and running.
Bot loaded. Type a message and press enter (use '/stop' to exit):

中文对话测试

Your input ->  你好
你好!吃了么?

Your input ->  你是机器人么
我是一个机器人,你可以叫我小远子

input ->  你是谁
我是一个机器人,你可以叫我小远子

果然支持中文回复了。

rasa train nlu 异常

rasa.engine.exceptions.GraphSchemaValidationException: Component 'JiebaTokenizer' requires the following packages which are currently not installed: jieba.

解决:

pip3 install jieba

rasa.engine.exceptions.GraphSchemaValidationException: Component 'LanguageModelFeaturizer' requires the following packages which are currently not installed: transformers.

解决:

pip3 install transformers

huggingface 无法访问的解决方法

国内使用的话,会遇到无法从 huggingface 下载模型的问题,需要参考这个:

https://www.zhihu.com/question/599683557/answer/3352307859

参考

查看合集

📖 对话机器人 Rasa 中文系列教程

我是一名山东烟台的开发者,联系作者