Skip to content

Conversation

@xiaomizhoubaobei
Copy link
Owner

@xiaomizhoubaobei xiaomizhoubaobei commented Sep 4, 2025

User description

Summary by Sourcery

Refactor core OCR modules to improve logging and validation, introduce new OCR and Baidu authentication features, and tighten CI workflows across sync, release, and publish pipelines

New Features:

  • Add RecognizeGeneralTextImageWarn class for Tencent OCR warning recognition
  • Add Baidu access_token function for obtaining Baidu API credentials

Enhancements:

  • Introduce Verification.sanitize_log_data to mask sensitive data in logs
  • Improve input validation and unify logging format in GeneralBasicOCR and GeneralAccurateOCR
  • Update IDE Dockerfile to install Python dependencies via yilai.txt and clean up obsolete scripts

CI:

  • Extend GitHub Actions triggers to all pull_request lifecycle events
  • Restrict sync, release, and publish jobs to run only on merged pull requests
  • Fix version extraction regex and rename steps in the release workflow
  • Remove redundant steps in .cnb.yml and adjust publish-to-pypi conditions

PR Type

Enhancement, Other


Description

  • Add Baidu API integration with access token functionality

  • Introduce new OCR warning recognition class RecognizeGeneralTextImageWarn

  • Enhance logging security with sensitive data sanitization

  • Refactor CI workflows to restrict operations to merged PRs


Diagram Walkthrough

flowchart LR
  A["Existing Tencent OCR"] --> B["Enhanced with Logging Security"]
  C["New Baidu Module"] --> D["Access Token Function"]
  E["New OCR Warning Class"] --> F["RecognizeGeneralTextImageWarn"]
  G["CI Workflows"] --> H["Restricted to Merged PRs"]
Loading

File Walkthrough

Relevant files
Enhancement
8 files
__init__.py
Add Baidu imports and update exports                                         
+6/-1     
__init__.py
Initialize Baidu module structure                                               
+1/-0     
authorization.py
Add Baidu access token function                                                   
+12/-0   
GeneralAccurateOCR.py
Enhance logging with data sanitization                                     
+16/-5   
GeneralBasicOCR.py
Improve input validation and logging security                       
+15/-4   
RecognizeGeneralTextImageWarn.py
Add new OCR warning recognition class                                       
+112/-0 
__init__.py
Export new warning recognition class                                         
+2/-1     
verification.py
Add sensitive data sanitization utility                                   
+25/-0   
Formatting
1 files
ImageValidator.py
Update import statements formatting                                           
+2/-1     
Miscellaneous
1 files
install_sdk.sh
Remove obsolete SDK installation script                                   
+0/-57   
Configuration changes
5 files
.cnb.yml
Remove redundant dependency installation step                       
+0/-4     
publish-to-pypi.yml
Restrict publishing to merged PRs only                                     
+3/-3     
release.yml
Fix version extraction and restrict releases                         
+4/-5     
sync-to-coding.yml
Restrict sync operations to merged PRs                                     
+10/-12 
Dockerfile
Add Python dependencies installation via yilai.txt             
+3/-0     
Dependencies
1 files
yilai.txt
Add comprehensive SDK dependencies list                                   
+852/-0 
Additional files
1 files
__init__.py [link]   

-新增目录 baidu,用于存放百度相关代码
- 新增 RecognizeGeneralTextImageWarn 功能
- 重构日志记录,提高安全性
- 更新项目结构,删除冗余代码
- 优化 GitHub Actions 工作流
@bolt-new-by-stackblitz
Copy link

Review PR in StackBlitz Codeflow Run & review this pull request in StackBlitz Codeflow.

@sourcery-ai
Copy link

sourcery-ai bot commented Sep 4, 2025

Reviewer's Guide

This PR refactors CI/CD pipelines to refine workflow triggers and merge conditions, enhances Tencent OCR modules with parameterized logging and data sanitization, adds a new OCR warning handler and Baidu authorization support, and updates dependency management and container setup.

File-Level Changes

Change Details Files
Refine GitHub Actions triggers and conditions
  • Extend pull_request event types to include opened, synchronize, reopened, closed
  • Add if guards in sync and publish jobs to run only on merged PRs
  • Standardize step names and update version extraction regex in release workflow
.github/workflows/sync-to-coding.yml
.github/workflows/release.yml
.github/workflows/publish-to-pypi.yml
Enhance Tencent OCR logging and input validation
  • Introduce Verification.sanitize_log_data for sensitive data masking
  • Switch to parameterized logger calls and truncate long data
  • Strengthen empty ImageBase64/ImageUrl validation
mzapi/tencent/ocr/GeneralAccurateOCR.py
mzapi/tencent/ocr/GeneralBasicOCR.py
Add RecognizeGeneralTextImageWarn OCR handler
  • Implement new class with input validation, sanitization, and logging
  • Integrate with Tencent OCR SDK and raise detailed errors
mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py
Introduce verification utility
  • Create sanitize_log_data to truncate or mask sensitive/log data
mzapi/utlis/verification.py
Integrate Baidu OCR authorization
  • Add access_token helper for Baidu API
  • Expose baidu package in main init.py and update all
mzapi/baidu/authorization.py
mzapi/baidu/__init__.py
mzapi/__init__.py
Update dependencies and container setup
  • Add yilai.txt for Python requirements
  • Modify IDE Dockerfile to install dependencies
  • Remove redundant install steps in .cnb.yml
  • Clean up imports in ImageValidator
.ide/Dockerfile
yilai.txt
.cnb.yml
mzapi/utlis/ImageValidator.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xiaomizhoubaobei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

本次拉取请求对mzapi项目进行了结构性重构和功能扩展。核心变化在于优化了项目依赖管理方式,引入了新的云服务提供商(百度云)的集成能力,并对现有腾讯云OCR功能进行了增强和安全性改进。这些变更旨在提高代码的可维护性、扩展性以及敏感数据处理的安全性。

Highlights

  • 依赖管理优化: 移除了旧的SDK安装脚本(install_sdk.sh),转而通过yilai.txt文件集中管理Python依赖,并在Docker构建过程中自动安装,简化了环境配置。
  • 新增百度云SDK集成: 引入了mzapi/baidu模块,支持百度云的授权认证功能,为后续集成百度云服务(如NLP)奠定基础。
  • 腾讯云OCR功能扩展与优化: 新增了RecognizeGeneralTextImageWarn类,扩展了腾讯云OCR的通用告警文本图像识别能力。同时,对GeneralAccurateOCR和GeneralBasicOCR类进行了重构,引入了日志数据脱敏功能,并优化了输入参数校验逻辑,提升了安全性和健壮性。
  • 日志数据脱敏工具: 新增了mzapi/utlis/verification.py文件,提供sanitize_log_data方法,用于在日志中对敏感数据(如Base64编码的图片数据)进行截断或部分隐藏,增强了日志安全性。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'. (link)

General comments:

  • There’s a lot of duplicated workflow if conditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML.
  • The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
  • The .ide/Dockerfile installs dependencies from yilai.txt, which isn’t a standard name—renaming it to requirements.txt (or aligning with existing naming conventions) would clarify its purpose.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There’s a lot of duplicated workflow `if` conditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML.
- The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
- The `.ide/Dockerfile` installs dependencies from `yilai.txt`, which isn’t a standard name—renaming it to `requirements.txt` (or aligning with existing naming conventions) would clarify its purpose.

## Individual Comments

### Comment 1
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:43` </location>
<code_context>
+            # 只在没有处理器时添加处理器
+            if not self.logger.handlers:
+                handler = logging.StreamHandler()
+                handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+                self.logger.addHandler(handler)
+            else:
</code_context>

<issue_to_address>
Custom logging format uses non-standard field names.

Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.
</issue_to_address>

### Comment 2
<location> `mzapi/utlis/verification.py:13` </location>
<code_context>
+        if data is None:
+            return None
+        if isinstance(data, str):
+            if len(data) > max_length and 'base64' in data.lower():
+                # 处理Base64数据
+                return data[-20:] if len(data) > 100 else data[-10:]
+            elif len(data) > max_length:
+                # 处理长字符串
</code_context>

<issue_to_address>
Base64 detection logic may not reliably identify base64-encoded data.

Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.
</issue_to_address>

### Comment 3
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:82` </location>
<code_context>
+
+            if ImageUrl:
+                self.logger.debug("验证图片URL: %s", ImageUrl)
+                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
+                self.logger.debug("图片URL验证通过")
+                self.logger.debug("图片Base64验证通过")
</code_context>

<issue_to_address>
PDF format is not included in allowed extensions for URL validation.

The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.
</issue_to_address>

### Comment 4
<location> `yilai.txt:109` </location>
<code_context>
+alibabacloud_ccc20200527
+alibabacloud_ccc20200701
+alibabacloud_cciotgw20210721
+alibabacloud_cd2021127
+alibabacloud_cddc20200320
+alibabacloud_cdn20141111
</code_context>

<issue_to_address>
Possible typo in package name: 'alibabacloud_cd2021127'.

This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
alibabacloud_cd2021127
=======
alibabacloud_cd20211207
>>>>>>> REPLACE

</suggested_fix>

## Security Issues

### Issue 1
<location> `mzapi/baidu/authorization.py:11` </location>

<issue_to_address>
**security (python.requests.best-practice.use-timeout):** Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.

```suggestion
    response = requests.get(url, headers=headers, data=payload, timeout=30)
```

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

# 只在没有处理器时添加处理器
if not self.logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Custom logging format uses non-standard field names.

Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.

Comment on lines 13 to 15
if len(data) > max_length and 'base64' in data.lower():
# 处理Base64数据
return data[-20:] if len(data) > 100 else data[-10:]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Base64 detection logic may not reliably identify base64-encoded data.

Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.


if ImageUrl:
self.logger.debug("验证图片URL: %s", ImageUrl)
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: PDF format is not included in allowed extensions for URL validation.

The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.

alibabacloud_ccc20200527
alibabacloud_ccc20200701
alibabacloud_cciotgw20210721
alibabacloud_cd2021127
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Possible typo in package name: 'alibabacloud_cd2021127'.

This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.

Suggested change
alibabacloud_cd2021127
alibabacloud_cd20211207

'Content-Type': 'application/json',
'Accept': 'application/json'
}
response = requests.get(url, headers=headers, data=payload)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.requests.best-practice.use-timeout): Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.

Suggested change
response = requests.get(url, headers=headers, data=payload)
response = requests.get(url, headers=headers, data=payload, timeout=30)

Source: opengrep

:param EnableDetectText: 文本检测开关,默认为true。设置为false可直接进行单行识别,适用于仅包含正向单行文本的图片场景。
:param ConfigID: 配置ID支持: OCR -- 通用场景 MulOCR--多语种场景
"""
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

:param IsWords: 是否返回单字信息,默认false。
:return: 识别结果,返回为JSON字符串格式,包含文本识别结果、方向信息及可能的错误信息。
"""
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

self.logger.info("OCR客户端初始化完成")
except Exception as e:
self.logger.error(f"初始化失败: {str(e)}")
raise TencentCloudSDKException("初始化失败", str(e))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
raise TencentCloudSDKException("初始化失败", str(e))
raise TencentCloudSDKException("初始化失败", str(e)) from e

:param PdfPageNumber: 需要识别的PDF页面的对应页码,仅支持PDF单页识别,当上传文件为PDF且IsPdf参数值为true时有效,默认值为1。
:param Type: 识别类型,可选值为General 通用告警(支持所有类型告警),LicensePlate 车牌告警(支持翻拍告警)
"""
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

if data is None:
return None
if isinstance(data, str):
if len(data) > max_length and 'base64' in data.lower():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

这次的PR重构了项目的依赖管理方式,从 install_sdk.sh 脚本改为使用 yilai.txtpip,这是一个很好的改进。同时,添加了对百度云和腾讯云新功能的支持,并对日志记录进行了优化,以避免记录敏感数据。

代码审查发现了一些可以改进的地方:

  • 依赖文件 yilai.txt 包含了大量非必需的包,这会极大地增加安装时间和环境体积,建议进行优化。
  • 新添加的 RecognizeGeneralTextImageWarn.py 文件中存在一些问题,包括日志格式化字符串的拼写错误,这可能会导致运行时错误。
  • 部分代码的错误处理和日志记录可以更加完善和准确。
  • 一些工具函数(如 verification.py)的逻辑可以更清晰。

总的来说,这次重构方向是正确的,但在细节上还有提升空间。请查看具体的审查评论。

Comment on lines 41 to 48
if not self.logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
self.logger.addHandler(handler)
else:
# 确保现有处理器的格式一致
for h in self.logger.handlers:
h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

日志格式化字符串中存在拼写错误。%(pastime)s%(levelness)s 应该是 %(asctime)s%(levelname)s。这些错误会导致在记录日志时抛出 KeyError 异常,从而使程序崩溃。

Suggested change
if not self.logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
self.logger.addHandler(handler)
else:
# 确保现有处理器的格式一致
for h in self.logger.handlers:
h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
if not self.logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
self.logger.addHandler(handler)
else:
# 确保现有处理器的格式一致
for h in self.logger.handlers:
h.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))

Comment on lines 4 to 12
def access_token(ak, sk):
url = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={ak}&client_secret={sk}"
payload = ""
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
response = requests.get(url, headers=headers, data=payload)
return response.json().get("access_token")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

access_token 函数缺少对网络请求和响应的错误处理。如果 requests.get 调用失败(例如,网络问题、DNS错误),或者百度API返回非200状态码(例如,ak/sk错误),response.json() 可能会抛出异常,导致整个程序崩溃。建议添加 try...except 块来捕获潜在的 requests.exceptions.RequestExceptionjson.JSONDecodeError,并检查响应的状态码。同时,对于GET请求,使用params参数传递查询参数是更常见的做法。

def access_token(ak, sk):
    url = "https://aip.baidubce.com/oauth/2.0/token"
    params = {
        "grant_type": "client_credentials",
        "client_id": ak,
        "client_secret": sk
    }
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'
    }
    try:
        response = requests.get(url, params=params, headers=headers)
        response.raise_for_status()  # 如果状态码不是 2xx,则抛出 HTTPError
        return response.json().get("access_token")
    except requests.exceptions.RequestException as e:
        # 建议使用日志模块记录错误
        print(f"请求 access_token 时发生网络错误: {e}")
        return None
    except ValueError:  # json.JSONDecodeError
        print("解析百度API响应失败")
        return None

self.logger.debug("验证图片URL: %s", ImageUrl)
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])
self.logger.debug("图片URL验证通过")
self.logger.debug("图片Base64验证通过")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这行日志 “图片Base64验证通过” 具有误导性。代码中只对 ImageUrl 进行了验证,但并没有对 ImageBase64 的内容或格式进行任何验证。建议移除这行日志,或者实现一个真正的Base64验证函数。


if ImageUrl:
self.logger.debug("验证图片URL: %s", ImageUrl)
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

validate_url 的调用与文档字符串中的描述不一致。文档字符串(第68-69行)中提到支持 PDFBMP 格式,但这里的验证只允许 ["png", "jpg", "jpeg"]。这会导致用户在使用PDF或BMP格式的URL时遇到非预期的 ValueError

Suggested change
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])

self.logger.debug("验证图片URL: %s", ImageUrl)
self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
self.logger.debug("图片URL验证通过")
self.logger.debug("图片Base64验证通过")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这行日志 “图片Base64验证通过” 具有误导性。代码中只对 ImageUrl 进行了验证,但并没有对 ImageBase64 的内容或格式进行任何验证。建议移除这行日志。

Comment on lines +86 to +90
self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",
self.sanitize_log_data.sanitize_log_data(ImageUrl,50),
IsPdf,
PdfPageNumber,
Type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这条调试日志记录的参数不完整且有误。日志消息中缺少对 ImageBase64 的记录,并且错误地将 ImageUrl 传递给了 sanitize_log_data 函数。这会影响调试效率,因为无法看到传入的 ImageBase64 的部分信息。

Suggested change
self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",
self.sanitize_log_data.sanitize_log_data(ImageUrl,50),
IsPdf,
PdfPageNumber,
Type)
self.logger.debug("请求参数: ImageBase64=%s, ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",
self.sanitize_log_data.sanitize_log_data(ImageBase64,50),
ImageUrl,
IsPdf,
PdfPageNumber,
Type)

@@ -0,0 +1,25 @@
class Verification:
@staticmethod
def sanitize_log_data(data, max_length=100):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

sanitize_log_data 方法定义中有多余的空格,建议移除以保持代码风格一致。

Suggested change
def sanitize_log_data(data, max_length=100):
def sanitize_log_data(data, max_length=100):

Comment on lines 18 to 24
if isinstance(data, str):
truncated = data[:max_length]
# 确保截断不会破坏UTF-8字符
while len(truncated.encode('utf-8')) > max_length:
truncated = truncated[:-1]
return truncated
return data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里存在一个多余的 isinstance(data, str) 检查。代码已经在外层的 if isinstance(data, str): (第12行) 中确认了 data 是字符串类型。可以移除这个内部的检查来简化代码。

                truncated = data[:max_length]
                # 确保截断不会破坏UTF-8字符
                while len(truncated.encode('utf-8')) > max_length:
                    truncated = truncated[:-1]
                return truncated

@qodo-code-review
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns

Sensitive data in logs:
While sanitization is added, current implementation may still log raw URLs and partial Base64 or access tokens. Ensure tokens/credentials are fully masked and avoid logging Base64 unless necessary.

External request hardening: The Baidu access_token call lacks timeout and error checks; add timeouts, exception handling, and verify status codes to prevent DoS or leaking error bodies.

⚡ Recommended focus areas for review

Logging Format Bug

The logging formatter uses nonstandard fields like %(pastime)s and %(levelness)s which will raise KeyError at runtime; replace with standard %(asctime)s and %(levelname)s.

    handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
    self.logger.addHandler(handler)
else:
    # 确保现有处理器的格式一致
    for h in self.logger.handlers:
        h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
self.logger.info("初始化腾讯云OCR客户端,日志级别: %s", logging.getLevelName(log_level))
HTTP Method/Headers

The Baidu token request uses GET with JSON headers and empty data; Baidu OAuth expects form-encoded POST or a clean GET without misleading headers. Validate method, params, error handling, and timeouts.

url = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={ak}&client_secret={sk}"
payload = ""
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json'
}
response = requests.get(url, headers=headers, data=payload)
Sanitization Logic

Base64 detection relies on substring 'base64' and returns last chars which may still leak; also stray double spaces in method name and unnecessary nested isinstance checks can cause confusion and style issues.

def   sanitize_log_data(data, max_length=100):
    """处理日志中的敏感数据
    1. 如果是Base64数据,只显示最后10个字符并标记
    2. 长字符串自动截断并标记
    :param data: 需要处理的数据
    :param max_length: 最大长度,超过该长度将被截断
    """
    if data is None:
        return None
    if isinstance(data, str):
        if len(data) > max_length and 'base64' in data.lower():
            # 处理Base64数据
            return data[-20:] if len(data) > 100 else data[-10:]
        elif len(data) > max_length:
            # 处理长字符串
            if isinstance(data, str):
                truncated = data[:max_length]
                # 确保截断不会破坏UTF-8字符
                while len(truncated.encode('utf-8')) > max_length:
                    truncated = truncated[:-1]
                return truncated
            return data
    return data

@qodo-code-review
Copy link

qodo-code-review bot commented Sep 4, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Fix release workflow conditions

The release workflow now triggers on all pull_request events, but the job-level
guard was removed, so "release-build" (and the GitHub release step) can run on
unmerged PRs and likely fail due to insufficient permissions or create premature
releases. Reintroduce a job-level if (or step-level if) to run only on push or
merged pull_request events, or restrict the pull_request trigger to closed and
check merged == true to align with the PR’s stated intent.

Examples:

.github/workflows/release.yml [16-17]
  release-build:
    runs-on: ubuntu-latest

Solution Walkthrough:

Before:

# .github/workflows/release.yml
on:
  push:
    branches:
      - master
  pull_request:
    branches: [ "master" ]
    types: [opened, synchronize, reopened, closed]

jobs:
  release-build:
    runs-on: ubuntu-latest
    # No 'if' condition, so it runs on every trigger event.
    steps:
    - name: Create GitHub release
      ...

After:

# .github/workflows/release.yml
on:
  push:
    branches:
      - master
  pull_request:
    branches: [ "master" ]
    types: [closed] # Restrict trigger to 'closed'

jobs:
  release-build:
    runs-on: ubuntu-latest
    # Add a condition to only run on a merged PR.
    if: github.event.pull_request.merged == true || github.event_name == 'push'
    steps:
    - name: Create GitHub release
      ...
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical regression in the release.yml workflow where a job-level guard was removed, which would cause the release job to run incorrectly on unmerged pull requests.

High
Possible issue
Fix logging formatter typos
Suggestion Impact:The commit updated the logging formatter strings to use %(asctime)s and %(levelname)s in both the new handler and existing handlers, fixing the typos exactly as suggested.

code diff:

-                handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+                handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
                 self.logger.addHandler(handler)
             else:
                 # 确保现有处理器的格式一致
                 for h in self.logger.handlers:
-                    h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+                    h.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))

The logging formatter contains typos in the format string. %(pastime)s should be
%(asctime)s and %(levelness)s should be %(levelname)s. This will cause logging
to fail or display incorrect information.

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py [43]

-handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies invalid format specifiers (%(pastime)s and %(levelness)s) that would cause the logging to raise a KeyError, fixing a bug in the new class.

Medium
General
Remove misleading log message

The code logs "图片Base64验证通过" even when only URL validation was performed. This
misleading log message should be removed or moved to the appropriate validation
block.

mzapi/tencent/ocr/GeneralAccurateOCR.py [83-87]

 if ImageUrl:
     self.logger.debug("验证图片URL: %s", ImageUrl)
     self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])
     self.logger.debug("图片URL验证通过")
-    self.logger.debug("图片Base64验证通过")
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies and removes a misleading log message that incorrectly states Base64 validation passed, improving log accuracy and preventing potential confusion during debugging.

Low
Fix method definition spacing
Suggestion Impact:The commit removed the extra spaces between "def" and the method name, matching the suggested formatting change.

code diff:

-    def   sanitize_log_data(data, max_length=100):
+    def sanitize_log_data(data, max_length=100):

There are extra spaces between def and the method name which violates Python
syntax conventions. This should be a single space for proper formatting.

mzapi/utlis/verification.py [3]

-def   sanitize_log_data(data, max_length=100):
+def sanitize_log_data(data, max_length=100):

[Suggestion processed]

Suggestion importance[1-10]: 3

__

Why: The suggestion corrects a minor code style issue by removing extra spaces, which improves readability and adheres to PEP 8 conventions without affecting functionality.

Low
  • Update

-新增目录 baidu,用于存放百度相关代码
- 新增 RecognizeGeneralTextImageWarn 功能
- 重构日志记录,提高安全性
- 更新项目结构,删除冗余代码
- 优化 GitHub Actions 工作流
@xiaomizhoubaobei xiaomizhoubaobei merged commit f12c8e9 into master Sep 4, 2025
11 checks passed
@xiaomizhoubaobei xiaomizhoubaobei deleted the develop branch September 4, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants