使用微软 Azure 接口做身份证或护照的 OCR 信息提取

接口文档地址

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/concept-id-document?view=doc-intel-3.1.0&viewFallbackFrom=form-recog-3.0.0

python sdk

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?view=doc-intel-3.1.0&preserve-view=true&pivots=programming-language-python

安装

pip3 install azure-ai-formrecognizer==3.3.0b1

代码示例

# sample document
docUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

# create your `DocumentAnalysisClient` instance and `AzureKeyCredential` variable
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))

poller = document_analysis_client.begin_analyze_document_from_url(
		"prebuilt-document", docUrl)
result = poller.result()

但是，示例是用的 PDF 文档的 URL 链接做的示例，而我需要的是本地文件上传，或者二进制文件流。

其他接口

DocumentAnalysisClient 是否存在其他接口。

参考：

https://learn.microsoft.com/en-us/python/api/azure-ai-formrecognizer/azure.ai.formrecognizer.documentanalysisclient?view=azure-python

DocumentAnalysisClient 有三个方法

begin_analyze_document_from_url 即上面示例中使用的方法
begin_analyze_document: 正是我需要的
close

begin_analyze_document 参数

model_id: 模型 id，实际是个字符串，代表使用的模型名。例如，身份证，发票等。
document: JPEG, PNG, PDF, TIFF, BMP, or HEIF type file stream or bytes.
locale: 对于美区用户，这里指定为 en-US 即可

新的示例代码

with open(path_to_sample_documents, "rb") as f:
   poller = document_analysis_client.begin_analyze_document(
	   "prebuilt-invoice", document=f, locale="en-US"
   )
invoices = poller.result()

身份证/护照对应的 model_id

回到第一个文档：

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/concept-id-document?view=doc-intel-3.1.0&viewFallbackFrom=form-recog-3.0.0

model_id 为：

prebuilt-idDocument

适用范围

全球护照簿、护照卡
美国驾驶证、身份证、居留许可（绿卡）、社会保障卡、军人身份证
欧洲驾驶执照、身份证、居留许可
印度驾驶证、PAN 卡、Aadhaar 卡
加拿大驾驶证、身份证、居留许可（枫叶卡）
澳大利亚驾驶证、照片卡、Key-pass ID（包括数字版本）

测试图片

找到了我多年前办的护照。

字段提取

不同类型证件对应的字段不同。但是还是有共同的部分：

FirstName
LastName
DocumentNumber

费用

https://azure.microsoft.com/zh-cn/pricing/details/form-recognizer/

对于预构建模型: 文档、布局、收据、发票、ID、W-2、卡片(保险、疫苗、名片)

每 1,000 页 $10 美刀。即，每次 0.01 美金。每月前 500 次免费。

Hello world

例如，以我的护照为例：

#!/usr/bin/env python3

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = "xxx"  // 需替换
key = "xxx"       // 需替换

print("Hello world!")

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

with open("./test.jpg", "rb") as f:
    poller = document_analysis_client.begin_analyze_document(
        "prebuilt-idDocument", document=f, locale="en-US"
    )
result = poller.result()

# >>> result.documents[0].fields["FirstName"].value
# 'ZHONGWEI'
# >>> result.documents[0].fields["LastName"].value
# 'SUN.'
# >>> result.documents[0].fields["DocumentNumber"].value
# 'xxxxxxxxx'

测试成功。

获取不到的情况

我找了一张非身份证/护照的 jpg 图片进行测试，以确认无效图片的返回格式：

>>> print(result.documents[0].fields)
{'CountryRegion': DocumentField(value_type=countryRegion, value='USA', content=None, bounding_regions=[], spans=[], confidence=0.995), 'Region': DocumentField(value_type=string, value='South Carolina', content=None, bounding_regions=[], spans=[], confidence=0.99)}

>>> print(result.documents[0].fields["FirstName"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'FirstName'

>>> "FirstName" in result.documents[0].fields
False

判断这几个 key 是否在 fields 中即可。不行！！！还有一种情况是 documents 直接为一个空 list。

TODO

是否可以判别护照/身份证为伪造

依赖冲突问题

安装 azure sdk 会跟 rasa 的依赖版本冲突，安装之后，会导致 rasa 无法启动

报错：

ImportError: cannot import name 'LegacyVersion' from 'packaging.version' (/home/zhongwei/.local/lib/python3.8/site-packages/packaging/version.py)

实际上在安装 azure sdk 时，已经有提示了

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.                                                                                                                                                                        rasa 3.4.6 requires attrs<22.2,>=19.3, but you have attrs 23.1.0 which is incompatible.                                                                                    rasa 3.4.6 requires jsonschema<4.17,>=3.2, but you have jsonschema 4.19.0 which is incompatible.                                                                           rasa 3.4.6 requires packaging<21.0,>=20.0, but you have packaging 23.1 which is incompatible.                                                                              rasa 3.4.6 requires prompt-toolkit<3.0.29,>=3.0, but you have prompt-toolkit 3.0.39 which is incompatible.
tmuxp 1.9.4 requires click<8.1,>7, but you have click 8.1.6 which is incompatible.

所以还是需要一套开发环境隔离的方案。

图片过大问题

在海边陪闺女泡脚，用手机测试 OCR 服务，突然发现服务异常，预感不对。猜测要么端口问题，要么图片大小问题。于是赶紧往家赶。。。

看了下日志，果然是图片大小限制问题：

azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
    "code": "InvalidContentLength",
    "message": "The input image is too large. Refer to documentation for the maximum file size."
}

使用 azure ocr 免费额度，图片有 4M 的大小限制。而付费部分则可以到 500M。

纠结了一下是在 js 端上传前做压缩，还是在服务器端用 python 压缩。最后选择了服务端的压缩方式。压缩后果然可以了。