使用微软 Azure 接口做身份证或护照的 OCR 信息提取

发布时间: 2023-08-10 19:48:19 作者: 大象笔记

接口文档地址

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/concept-id-document?view=doc-intel-3.1.0&viewFallbackFrom=form-recog-3.0.0

python sdk

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?view=doc-intel-3.1.0&preserve-view=true&pivots=programming-language-python

安装

pip3 install azure-ai-formrecognizer==3.3.0b1

代码示例

# sample document
docUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

# create your `DocumentAnalysisClient` instance and `AzureKeyCredential` variable
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))

poller = document_analysis_client.begin_analyze_document_from_url(
		"prebuilt-document", docUrl)
result = poller.result()

但是,示例是用的 PDF 文档的 URL 链接做的示例,而我需要的是本地文件上传,或者二进制文件流。

其他接口

DocumentAnalysisClient 是否存在其他接口。

参考:

https://learn.microsoft.com/en-us/python/api/azure-ai-formrecognizer/azure.ai.formrecognizer.documentanalysisclient?view=azure-python

DocumentAnalysisClient 有三个方法

begin_analyze_document 参数

新的示例代码

with open(path_to_sample_documents, "rb") as f:
   poller = document_analysis_client.begin_analyze_document(
	   "prebuilt-invoice", document=f, locale="en-US"
   )
invoices = poller.result()

身份证/护照对应的 model_id

回到第一个文档:

https://learn.microsoft.com/zh-cn/azure/ai-services/document-intelligence/concept-id-document?view=doc-intel-3.1.0&viewFallbackFrom=form-recog-3.0.0

model_id 为:

prebuilt-idDocument

适用范围

测试图片

找到了我多年前办的护照。

字段提取

不同类型证件对应的字段不同。但是还是有共同的部分:

费用

https://azure.microsoft.com/zh-cn/pricing/details/form-recognizer/

对于预构建模型: 文档、布局、收据、发票、ID、W-2、卡片(保险、疫苗、名片)

每 1,000 页 $10 美刀。即,每次 0.01 美金。每月前 500 次免费。

Hello world

例如,以我的护照为例:

#!/usr/bin/env python3

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = "xxx"  // 需替换
key = "xxx"       // 需替换

print("Hello world!")

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

with open("./test.jpg", "rb") as f:
    poller = document_analysis_client.begin_analyze_document(
        "prebuilt-idDocument", document=f, locale="en-US"
    )
result = poller.result()

# >>> result.documents[0].fields["FirstName"].value
# 'ZHONGWEI'
# >>> result.documents[0].fields["LastName"].value
# 'SUN.'
# >>> result.documents[0].fields["DocumentNumber"].value
# 'xxxxxxxxx'

测试成功。

获取不到的情况

我找了一张非身份证/护照的 jpg 图片进行测试,以确认无效图片的返回格式:

>>> print(result.documents[0].fields)
{'CountryRegion': DocumentField(value_type=countryRegion, value='USA', content=None, bounding_regions=[], spans=[], confidence=0.995), 'Region': DocumentField(value_type=string, value='South Carolina', content=None, bounding_regions=[], spans=[], confidence=0.99)}

>>> print(result.documents[0].fields["FirstName"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'FirstName'

>>> "FirstName" in result.documents[0].fields
False

判断这几个 key 是否在 fields 中即可。不行!!!还有一种情况是 documents 直接为一个空 list。

TODO

依赖冲突问题

安装 azure sdk 会跟 rasa 的依赖版本冲突,安装之后,会导致 rasa 无法启动

报错:

ImportError: cannot import name 'LegacyVersion' from 'packaging.version' (/home/zhongwei/.local/lib/python3.8/site-packages/packaging/version.py)

实际上在安装 azure sdk 时,已经有提示了

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.                                                                                                                                                                        rasa 3.4.6 requires attrs<22.2,>=19.3, but you have attrs 23.1.0 which is incompatible.                                                                                    rasa 3.4.6 requires jsonschema<4.17,>=3.2, but you have jsonschema 4.19.0 which is incompatible.                                                                           rasa 3.4.6 requires packaging<21.0,>=20.0, but you have packaging 23.1 which is incompatible.                                                                              rasa 3.4.6 requires prompt-toolkit<3.0.29,>=3.0, but you have prompt-toolkit 3.0.39 which is incompatible.
tmuxp 1.9.4 requires click<8.1,>7, but you have click 8.1.6 which is incompatible.

所以还是需要一套开发环境隔离的方案。

图片过大问题

在海边陪闺女泡脚,用手机测试 OCR 服务,突然发现服务异常,预感不对。 猜测要么端口问题,要么图片大小问题。于是赶紧往家赶。。。

看了下日志,果然是图片大小限制问题:

azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
    "code": "InvalidContentLength",
    "message": "The input image is too large. Refer to documentation for the maximum file size."
}

使用 azure ocr 免费额度,图片有 4M 的大小限制。而付费部分则可以到 500M。

纠结了一下是在 js 端上传前做压缩,还是在 服务器端用 python 压缩。最后选择了服务端的压缩方式。压缩后果然可以了。

我是一名山东烟台的开发者,联系作者