对话机器人 Rasa (二):中文支持

文章目录

    Rasa 安装之后,默认是不支持中文对话的。

    学习、配置的策略

    查到的示例,pipeline 配置各不相同,不动手试,难以知道相互间的优劣。

    所以,先从能运行的最简单配置开始。例如使用《Rasa 实战:构建开源对话机器人》这本书上的推荐的中文 pipeline。
    里面有个医疗机器人的 nlu 配置示例。当然,只包含了 nlu 部分的配置,即识别意图和实体,没有回复配置。

    效果

    rasa 中文对话机器人

    基于 Rasa websocket 的网页组件 实现。

    最简单的中文配置

    打开项目根目录下的 config.yml 配置文件,修改如下:

    recipe: default.v1
    
    language: zh
    
    pipeline:
      - name: JiebaTokenizer
      - name: LanguageModelFeaturizer
        model_name: "bert"
        model_weight: "bert-base-chinese"
      - name: "DIETClassifier"
    

    什么是 NLU

    NLU(Natural Language Understanding)是自然语言理解的缩写。

    rasa 中 nlu 的作用:

    Rasa NLU 模块的主要功能是解析用户输入数据,识别出用户输入的实体、意图等关键信息,同时也可以添加诸如情感分析等自定义模块。

    配置 nlu.yml

    修改 data/nlu.yml,在已有的英文语料基础上,增加一些中文的语料。

    version: "3.1"
    
    nlu:
    - intent: greet
      examples: |
        - hey
        - hello
        - hi
        - hello there
        - good morning
        - good evening
        - moin
        - hey there
        - let's go
        - hey dude
        - goodmorning
        - goodevening
        - good afternoon
        - 你好!
        - 您好!
        - 在么!
        - 在吗!
        - 喂!
    
    - intent: goodbye
      examples: |
        - cu
        - good by
        - cee you later
        - good night
        - bye
        - goodbye
        - have a nice day
        - see you around
        - bye bye
        - see you later
        - 拜拜!
        - 再见!
        - 拜!
        - 退出。
        - 结束。
        - exit
    
    - intent: affirm
      examples: |
        - yes
        - y
        - indeed
        - of course
        - that sounds good
        - correct
        - 是的
        - 是
    
    - intent: deny
      examples: |
        - no
        - n
        - never
        - I don't think so
        - don't like that
        - no way
        - not really
        - 不
        - 不是的
        - 不是
    

    重新训练模型

    data 目录下的各种 yml 配置文件里存储的就是训练数据,例如 nlu.yml。

    rasa train nlu
    

    期间下载 tf_model.h5 1.88G,怎么这么大。。。(这个文件是 BERT 模型引入的。BERT,Bidirectional Encoder Representations from Transformers,是一种基于 TensorFlow 框架的模型。BERT 模型使用 Transformer 架构来学习文本表示,可以用于各种自然语言处理任务,如文本分类、命名实体识别、问答等。TensorFlow 是一个广泛使用的机器学习框架,可用于训练和部署各种深度学习模型。tf_model.h5 是使用 TensorFlow 框架训练的模型文件,其中 .h5 表示它是一个 HDF5 格式的文件。)

    但是训练出来的模型文件,只有 20M。

    > ls -lah models/
    total 44M
    drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:35 ./
    drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:03 ../
    -rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*
    

    测试:

    rasa shell nlu
    

    测试效果

    greet intent,即,打招呼的意图:

    Next message:
    你好
    {
      "text": "你好",
      "intent": {
        "name": "greet",
        "confidence": 0.9999979734420776
      },
    

    goodbye intent, 即,再见的意图:

    Next message:
    再见
    {
      "text": "再见",
      "intent": {
        "name": "goodbye",
        "confidence": 0.9999972581863403
      },
    

    上面两个意料之中,至少可以说明已经支持中文了。而不是默认 en 的情况下,输入中文, 没有任何的回复。

    比较让我吃惊的是下面这个的意图识别:

    Next message:
    我拒绝
    {
      "text": "我拒绝",
      "intent": {
        "name": "deny",
        "confidence": 0.9226003289222717
      },
    

    我在 deny intent 的语料配置中,并没有设置“拒绝”这个词,但是依然准测的识别出来了。说明引入了预训练的中文语言模型,但是不知道是 pipeline 哪个配置引入的。
    后续了解一下。

    也有不满意的情况:

    Next message:
    你好啊
    {
      "text": "好啊",
      "intent": {
        "name": "affirm",
        "confidence": 0.4897577464580536
      },
      "entities": [],
      "text_tokens": [
        [
          0,
          1
        ],
        [
          1,
          2
        ]
      ],
      "intent_ranking": [
        {
          "name": "affirm",
          "confidence": 0.4897577464580536
        },
        {
          "name": "greet",
          "confidence": 0.34744495153427124
        },
    

    实际上,第一候选意图应该是 greet,却被识别为了 affirm。还是不够智能,但是基本满足要求了。

    支持中文回复

    前面训练 nlu 模型的过程,只是支持了中文的解析,但是并不支持中文回复。

    在 domain.yml 中添加中文回复:

    version: "3.1"
    
    intents:
      - greet
      - goodbye
      - affirm
      - deny
      - mood_great
      - mood_unhappy
      - bot_challenge
    
    responses:
      utter_greet:
      - text: "你好!吃了么?"
    
      utter_cheer_up:
      - text: "Here is something to cheer you up:"
        image: "https://i.imgur.com/nGF1K8f.jpg"
    
      utter_did_that_help:
      - text: "Did that help you?"
    
      utter_happy:
      - text: "Great, carry on!"
    
      utter_goodbye:
      - text: "再见"
    
      utter_iamabot:
      - text: "我是一个机器人,你可以叫我小远子"
    
    session_config:
      session_expiration_time: 60
      carry_over_slots_to_new_session: true
    

    重新训练

    由于之前用 rasa train nlu 训练出来的模型只是解析,并不包含回复逻辑,所以需要重新训练。

    注意,不要带 nlu 参数:

    > rasa train
    
    The configuration for policies was chosen automatically. It was written into the config file at 'config.yml'.
    2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Starting to train component 'JiebaTokenizer'.
    2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Finished training component 'JiebaTokenizer'.
    Building prefix dict from the default dictionary ...
    Loading model from cache /tmp/jieba.cache
    Loading model cost 0.493 seconds.
    Prefix dict has been built successfully.
    2023-04-08 09:43:10 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
    All model checkpoint layers were used when initializing TFBertModel.
    
    All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
    If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
    2023-04-08 09:43:39 INFO     rasa.engine.training.hooks  - Starting to train component 'DIETClassifier'.
    /home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
      rasa.shared.utils.io.raise_warning(
    Epochs: 100% 300/300 [00:32<00:00,  9.16it/s, t_loss=0.282, i_acc=1]
    2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Finished training component 'DIETClassifier'.
    2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'MemoizationPolicy' from cache.
    2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
    2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'TEDPolicy' from cache.
    2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'UnexpecTEDIntentPolicy' from cache.
    Your Rasa model is trained and saved at 'models/20230408-094308-burning-dessert.tar.gz'.
    

    查看 models 目录,会看到多了一个非 nlu 开头的模型文件,文件大小比之前多了 4M。

    > ls -lah models/
    drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  8 09:44 ./
    drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 17:28 ../
    -rwxrwxrwx 1 zhongwei zhongwei  24M Apr  8 09:44 20230408-094308-burning-dessert.tar.gz*
    -rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*
    

    rasa shell

    再次启动 rasa shell,会看到同时启用了 rasa server, 并加载了新训练的模型文件。

    > rasa shell
    2023-04-08 09:46:57 INFO     root  - Connecting to channel 'cmdline' which was specified by the '--connector' argument. Any other channels will be ignored. To connect to all given channels, omit the '--connector' argument.
    2023-04-08 09:46:57 INFO     root  - Starting Rasa server on http://0.0.0.0:5005
    2023-04-08 09:46:57 INFO     rasa.core.processor  - Loading model models/20230408-094308-burning-dessert.tar.gz...
    2023-04-08 09:46:59 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
    All model checkpoint layers were used when initializing TFBertModel.
    
    All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
    If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
    /home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
      rasa.shared.utils.io.raise_warning(
    2023-04-08 09:47:43 WARNING  rasa.shared.utils.common  - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
    2023-04-08 09:47:50 INFO     root  - Rasa server is up and running.
    Bot loaded. Type a message and press enter (use '/stop' to exit):
    

    中文对话测试

    Your input ->  你好
    你好!吃了么?
    
    Your input ->  你是机器人么
    我是一个机器人,你可以叫我小远子
    
    input ->  你是谁
    我是一个机器人,你可以叫我小远子
    

    果然支持中文回复了。

    rasa train nlu 异常

    rasa.engine.exceptions.GraphSchemaValidationException: Component ‘JiebaTokenizer’ requires the following packages which are currently not installed: jieba.

    解决:

    pip3 install jieba
    

    rasa.engine.exceptions.GraphSchemaValidationException: Component ‘LanguageModelFeaturizer’ requires the following packages which are currently not installed: transformers.

    解决:

    pip3 install transformers
    

    huggingface 无法访问的解决方法

    国内使用的话,会遇到无法从 huggingface 下载模型的问题,需要参考这个:

    https://www.zhihu.com/question/599683557/answer/3352307859

    参考

    • https://rasa.com/docs/rasa/language-support/

    查看合集

    📖 对话机器人 Rasa 中文系列教程

    关于作者 🌱

    我是来自山东烟台的一名开发者,有感兴趣的话题,或者软件开发需求,欢迎加微信 zhongwei 聊聊,或者关注我的个人公众号“大象工具”, 查看更多联系方式