refactor(mzapi): 重构项目并添加新功能 #2

xiaomizhoubaobei · 2025-09-04T15:43:04Z

User description

Summary by Sourcery

Refactor core OCR modules to improve logging and validation, introduce new OCR and Baidu authentication features, and tighten CI workflows across sync, release, and publish pipelines

New Features:

Add RecognizeGeneralTextImageWarn class for Tencent OCR warning recognition
Add Baidu access_token function for obtaining Baidu API credentials

Enhancements:

Introduce Verification.sanitize_log_data to mask sensitive data in logs
Improve input validation and unify logging format in GeneralBasicOCR and GeneralAccurateOCR
Update IDE Dockerfile to install Python dependencies via yilai.txt and clean up obsolete scripts

CI:

Extend GitHub Actions triggers to all pull_request lifecycle events
Restrict sync, release, and publish jobs to run only on merged pull requests
Fix version extraction regex and rename steps in the release workflow
Remove redundant steps in .cnb.yml and adjust publish-to-pypi conditions

PR Type

Enhancement, Other

Description

Add Baidu API integration with access token functionality
Introduce new OCR warning recognition class RecognizeGeneralTextImageWarn
Enhance logging security with sensitive data sanitization
Refactor CI workflows to restrict operations to merged PRs

Diagram Walkthrough

flowchart LR
  A["Existing Tencent OCR"] --> B["Enhanced with Logging Security"]
  C["New Baidu Module"] --> D["Access Token Function"]
  E["New OCR Warning Class"] --> F["RecognizeGeneralTextImageWarn"]
  G["CI Workflows"] --> H["Restricted to Merged PRs"]

File Walkthrough

Relevant files

Enhancement

8 files

__init__.py `Add Baidu imports and update exports`	+6/-1
__init__.py `Initialize Baidu module structure`	+1/-0
authorization.py `Add Baidu access token function`	+12/-0
GeneralAccurateOCR.py `Enhance logging with data sanitization`	+16/-5
GeneralBasicOCR.py `Improve input validation and logging security`	+15/-4
RecognizeGeneralTextImageWarn.py `Add new OCR warning recognition class`	+112/-0
__init__.py `Export new warning recognition class`	+2/-1
verification.py `Add sensitive data sanitization utility`	+25/-0

Formatting

1 files

ImageValidator.py `Update import statements formatting`	+2/-1

Miscellaneous

1 files

install_sdk.sh `Remove obsolete SDK installation script`	+0/-57

Configuration changes

5 files

.cnb.yml `Remove redundant dependency installation step`	+0/-4
publish-to-pypi.yml `Restrict publishing to merged PRs only`	+3/-3
release.yml `Fix version extraction and restrict releases`	+4/-5
sync-to-coding.yml `Restrict sync operations to merged PRs`	+10/-12
Dockerfile `Add Python dependencies installation via yilai.txt`	+3/-0

Dependencies

1 files

yilai.txt `Add comprehensive SDK dependencies list`	+852/-0

Additional files

1 files

__init__.py	[link]

-新增目录 baidu，用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录，提高安全性 - 更新项目结构，删除冗余代码 - 优化 GitHub Actions 工作流

bolt-new-by-stackblitz · 2025-09-04T15:43:11Z

Run & review this pull request in StackBlitz Codeflow.

sourcery-ai · 2025-09-04T15:43:25Z

Reviewer's Guide

This PR refactors CI/CD pipelines to refine workflow triggers and merge conditions, enhances Tencent OCR modules with parameterized logging and data sanitization, adds a new OCR warning handler and Baidu authorization support, and updates dependency management and container setup.

File-Level Changes

Change	Details	Files
Refine GitHub Actions triggers and conditions	Extend pull_request event types to include opened, synchronize, reopened, closed Add if guards in sync and publish jobs to run only on merged PRs Standardize step names and update version extraction regex in release workflow	`.github/workflows/sync-to-coding.yml` `.github/workflows/release.yml` `.github/workflows/publish-to-pypi.yml`
Enhance Tencent OCR logging and input validation	Introduce Verification.sanitize_log_data for sensitive data masking Switch to parameterized logger calls and truncate long data Strengthen empty ImageBase64/ImageUrl validation	`mzapi/tencent/ocr/GeneralAccurateOCR.py` `mzapi/tencent/ocr/GeneralBasicOCR.py`
Add RecognizeGeneralTextImageWarn OCR handler	Implement new class with input validation, sanitization, and logging Integrate with Tencent OCR SDK and raise detailed errors	`mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py`
Introduce verification utility	Create sanitize_log_data to truncate or mask sensitive/log data	`mzapi/utlis/verification.py`
Integrate Baidu OCR authorization	Add access_token helper for Baidu API Expose baidu package in main init.py and update all	`mzapi/baidu/authorization.py` `mzapi/baidu/__init__.py` `mzapi/__init__.py`
Update dependencies and container setup	Add yilai.txt for Python requirements Modify IDE Dockerfile to install dependencies Remove redundant install steps in .cnb.yml Clean up imports in ImageValidator	`.ide/Dockerfile` `yilai.txt` `.cnb.yml` `mzapi/utlis/ImageValidator.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist

Summary of Changes

Hello @xiaomizhoubaobei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

本次拉取请求对mzapi项目进行了结构性重构和功能扩展。核心变化在于优化了项目依赖管理方式，引入了新的云服务提供商（百度云）的集成能力，并对现有腾讯云OCR功能进行了增强和安全性改进。这些变更旨在提高代码的可维护性、扩展性以及敏感数据处理的安全性。

Highlights

依赖管理优化: 移除了旧的SDK安装脚本（install_sdk.sh），转而通过yilai.txt文件集中管理Python依赖，并在Docker构建过程中自动安装，简化了环境配置。
新增百度云SDK集成: 引入了mzapi/baidu模块，支持百度云的授权认证功能，为后续集成百度云服务（如NLP）奠定基础。
腾讯云OCR功能扩展与优化: 新增了RecognizeGeneralTextImageWarn类，扩展了腾讯云OCR的通用告警文本图像识别能力。同时，对GeneralAccurateOCR和GeneralBasicOCR类进行了重构，引入了日志数据脱敏功能，并优化了输入参数校验逻辑，提升了安全性和健壮性。
日志数据脱敏工具: 新增了mzapi/utlis/verification.py文件，提供sanitize_log_data方法，用于在日志中对敏感数据（如Base64编码的图片数据）进行截断或部分隐藏，增强了日志安全性。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'. (link)

General comments:

There’s a lot of duplicated workflow if conditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML.
The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
The .ide/Dockerfile installs dependencies from yilai.txt, which isn’t a standard name—renaming it to requirements.txt (or aligning with existing naming conventions) would clarify its purpose.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- There’s a lot of duplicated workflow `if` conditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML.
- The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
- The `.ide/Dockerfile` installs dependencies from `yilai.txt`, which isn’t a standard name—renaming it to `requirements.txt` (or aligning with existing naming conventions) would clarify its purpose.

## Individual Comments

### Comment 1
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:43` </location>
<code_context>
+            # 只在没有处理器时添加处理器
+            if not self.logger.handlers:
+                handler = logging.StreamHandler()
+                handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+                self.logger.addHandler(handler)
+            else:
</code_context>

<issue_to_address>
Custom logging format uses non-standard field names.

Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.
</issue_to_address>

### Comment 2
<location> `mzapi/utlis/verification.py:13` </location>
<code_context>
+        if data is None:
+            return None
+        if isinstance(data, str):
+            if len(data) > max_length and 'base64' in data.lower():
+                # 处理Base64数据
+                return data[-20:] if len(data) > 100 else data[-10:]
+            elif len(data) > max_length:
+                # 处理长字符串
</code_context>

<issue_to_address>
Base64 detection logic may not reliably identify base64-encoded data.

Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.
</issue_to_address>

### Comment 3
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:82` </location>
<code_context>
+
+            if ImageUrl:
+                self.logger.debug("验证图片URL: %s", ImageUrl)
+                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
+                self.logger.debug("图片URL验证通过")
+                self.logger.debug("图片Base64验证通过")
</code_context>

<issue_to_address>
PDF format is not included in allowed extensions for URL validation.

The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.
</issue_to_address>

### Comment 4
<location> `yilai.txt:109` </location>
<code_context>
+alibabacloud_ccc20200527
+alibabacloud_ccc20200701
+alibabacloud_cciotgw20210721
+alibabacloud_cd2021127
+alibabacloud_cddc20200320
+alibabacloud_cdn20141111
</code_context>

<issue_to_address>
Possible typo in package name: 'alibabacloud_cd2021127'.

This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
alibabacloud_cd2021127
=======
alibabacloud_cd20211207
>>>>>>> REPLACE

</suggested_fix>

## Security Issues

### Issue 1
<location> `mzapi/baidu/authorization.py:11` </location>

<issue_to_address>
**security (python.requests.best-practice.use-timeout):** Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.

```suggestion
    response = requests.get(url, headers=headers, data=payload, timeout=30)
```

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-09-04T15:44:44Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+            # 只在没有处理器时添加处理器
+            if not self.logger.handlers:
+                handler = logging.StreamHandler()
+                handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))


issue (bug_risk): Custom logging format uses non-standard field names.

Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.

sourcery-ai · 2025-09-04T15:44:44Z

mzapi/utlis/verification.py

+            if len(data) > max_length and 'base64' in data.lower():
+                # 处理Base64数据
+                return data[-20:] if len(data) > 100 else data[-10:]


suggestion: Base64 detection logic may not reliably identify base64-encoded data.

Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.

sourcery-ai · 2025-09-04T15:44:44Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+
+            if ImageUrl:
+                self.logger.debug("验证图片URL: %s", ImageUrl)
+                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])


issue: PDF format is not included in allowed extensions for URL validation.

The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.

sourcery-ai · 2025-09-04T15:44:44Z

yilai.txt

+alibabacloud_ccc20200527
+alibabacloud_ccc20200701
+alibabacloud_cciotgw20210721
+alibabacloud_cd2021127


issue (typo): Possible typo in package name: 'alibabacloud_cd2021127'.

This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.

Suggested change

alibabacloud_cd2021127

alibabacloud_cd20211207

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/baidu/authorization.py

+        'Content-Type': 'application/json',
+        'Accept': 'application/json'
+    }
+    response = requests.get(url, headers=headers, data=payload)


security (python.requests.best-practice.use-timeout): Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.

Suggested change

response = requests.get(url, headers=headers, data=payload)

response = requests.get(url, headers=headers, data=payload, timeout=30)

Source: opengrep

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/tencent/ocr/GeneralAccurateOCR.py

        :param EnableDetectText: 文本检测开关，默认为true。设置为false可直接进行单行识别，适用于仅包含正向单行文本的图片场景。
        :param ConfigID: 配置ID支持：  OCR -- 通用场景  MulOCR--多语种场景
        """
        try:


issue (code-quality): We've found these issues:

Replaces an empty collection equality with a boolean operation [×2] (simplify-empty-collection-comparison)

Explicitly raise from a previous error (raise-from-previous-error)

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/tencent/ocr/GeneralBasicOCR.py

            :param IsWords: 是否返回单字信息，默认false。
            :return: 识别结果，返回为JSON字符串格式，包含文本识别结果、方向信息及可能的错误信息。
            """
        try:


issue (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+            self.logger.info("OCR客户端初始化完成")
+        except Exception as e:
+            self.logger.error(f"初始化失败: {str(e)}")
+            raise TencentCloudSDKException("初始化失败", str(e))


suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change

raise TencentCloudSDKException("初始化失败", str(e))

raise TencentCloudSDKException("初始化失败", str(e)) from e

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+        :param PdfPageNumber: 需要识别的PDF页面的对应页码，仅支持PDF单页识别，当上传文件为PDF且IsPdf参数值为true时有效，默认值为1。
+        :param Type: 识别类型，可选值为General 通用告警（支持所有类型告警）,LicensePlate 车牌告警（支持翻拍告警）
+        """
+        try:


issue (code-quality): We've found these issues:

Replaces an empty collection equality with a boolean operation [×2] (simplify-empty-collection-comparison)

Explicitly raise from a previous error (raise-from-previous-error)

sourcery-ai · 2025-09-04T15:44:45Z

mzapi/utlis/verification.py

+        if data is None:
+            return None
+        if isinstance(data, str):
+            if len(data) > max_length and 'base64' in data.lower():


issue (code-quality): We've found these issues:

Lift repeated conditional into its own if statement (lift-duplicated-conditional)

Remove unnecessary else after guard condition (remove-unnecessary-else)

gemini-code-assist

Code Review

这次的PR重构了项目的依赖管理方式，从 install_sdk.sh 脚本改为使用 yilai.txt 和 pip，这是一个很好的改进。同时，添加了对百度云和腾讯云新功能的支持，并对日志记录进行了优化，以避免记录敏感数据。

代码审查发现了一些可以改进的地方：

依赖文件 yilai.txt 包含了大量非必需的包，这会极大地增加安装时间和环境体积，建议进行优化。
新添加的 RecognizeGeneralTextImageWarn.py 文件中存在一些问题，包括日志格式化字符串的拼写错误，这可能会导致运行时错误。
部分代码的错误处理和日志记录可以更加完善和准确。
一些工具函数（如 verification.py）的逻辑可以更清晰。

总的来说，这次重构方向是正确的，但在细节上还有提升空间。请查看具体的审查评论。

gemini-code-assist · 2025-09-04T15:45:39Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+            if not self.logger.handlers:
+                handler = logging.StreamHandler()
+                handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+                self.logger.addHandler(handler)
+            else:
+                # 确保现有处理器的格式一致
+                for h in self.logger.handlers:
+                    h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))


日志格式化字符串中存在拼写错误。%(pastime)s 和 %(levelness)s 应该是 %(asctime)s 和 %(levelname)s。这些错误会导致在记录日志时抛出 KeyError 异常，从而使程序崩溃。

Suggested change

if not self.logger.handlers:

handler = logging.StreamHandler()

handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))

self.logger.addHandler(handler)

else:

# 确保现有处理器的格式一致

for h in self.logger.handlers:

h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))

if not self.logger.handlers:

handler = logging.StreamHandler()

handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))

self.logger.addHandler(handler)

else:

# 确保现有处理器的格式一致

for h in self.logger.handlers:

h.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))

yilai.txt

gemini-code-assist · 2025-09-04T15:45:39Z

mzapi/baidu/authorization.py

+def access_token(ak, sk):
+    url = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={ak}&client_secret={sk}"
+    payload = ""
+    headers = {
+        'Content-Type': 'application/json',
+        'Accept': 'application/json'
+    }
+    response = requests.get(url, headers=headers, data=payload)
+    return response.json().get("access_token")


access_token 函数缺少对网络请求和响应的错误处理。如果 requests.get 调用失败（例如，网络问题、DNS错误），或者百度API返回非200状态码（例如，ak/sk错误），response.json() 可能会抛出异常，导致整个程序崩溃。建议添加 try...except 块来捕获潜在的 requests.exceptions.RequestException 和 json.JSONDecodeError，并检查响应的状态码。同时，对于GET请求，使用params参数传递查询参数是更常见的做法。

def access_token(ak, sk): url = "https://aip.baidubce.com/oauth/2.0/token" params = { "grant_type": "client_credentials", "client_id": ak, "client_secret": sk } headers = { 'Content-Type': 'application/json', 'Accept': 'application/json' } try: response = requests.get(url, params=params, headers=headers) response.raise_for_status() # 如果状态码不是 2xx，则抛出 HTTPError return response.json().get("access_token") except requests.exceptions.RequestException as e: # 建议使用日志模块记录错误 print(f"请求 access_token 时发生网络错误: {e}") return None except ValueError: # json.JSONDecodeError print("解析百度API响应失败") return None

gemini-code-assist · 2025-09-04T15:45:39Z

mzapi/tencent/ocr/GeneralAccurateOCR.py

+                self.logger.debug("验证图片URL: %s", ImageUrl)
                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])
                self.logger.debug("图片URL验证通过")
+                self.logger.debug("图片Base64验证通过")


这行日志 “图片Base64验证通过” 具有误导性。代码中只对 ImageUrl 进行了验证，但并没有对 ImageBase64 的内容或格式进行任何验证。建议移除这行日志，或者实现一个真正的Base64验证函数。

gemini-code-assist · 2025-09-04T15:45:40Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+
+            if ImageUrl:
+                self.logger.debug("验证图片URL: %s", ImageUrl)
+                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])


validate_url 的调用与文档字符串中的描述不一致。文档字符串（第68-69行）中提到支持 PDF 和 BMP 格式，但这里的验证只允许 ["png", "jpg", "jpeg"]。这会导致用户在使用PDF或BMP格式的URL时遇到非预期的 ValueError。

Suggested change

self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])

self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])

gemini-code-assist · 2025-09-04T15:45:40Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+                self.logger.debug("验证图片URL: %s", ImageUrl)
+                self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
+                self.logger.debug("图片URL验证通过")
+                self.logger.debug("图片Base64验证通过")


这行日志 “图片Base64验证通过” 具有误导性。代码中只对 ImageUrl 进行了验证，但并没有对 ImageBase64 的内容或格式进行任何验证。建议移除这行日志。

gemini-code-assist · 2025-09-04T15:45:40Z

mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py

+            self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",
+                              self.sanitize_log_data.sanitize_log_data(ImageUrl,50),
+                              IsPdf,
+                              PdfPageNumber,
+                              Type)


这条调试日志记录的参数不完整且有误。日志消息中缺少对 ImageBase64 的记录，并且错误地将 ImageUrl 传递给了 sanitize_log_data 函数。这会影响调试效率，因为无法看到传入的 ImageBase64 的部分信息。

Suggested change

self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",

self.sanitize_log_data.sanitize_log_data(ImageUrl,50),

IsPdf,

PdfPageNumber,

Type)

self.logger.debug("请求参数: ImageBase64=%s, ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s",

self.sanitize_log_data.sanitize_log_data(ImageBase64,50),

ImageUrl,

IsPdf,

PdfPageNumber,

Type)

gemini-code-assist · 2025-09-04T15:45:40Z

mzapi/utlis/verification.py

@@ -0,0 +1,25 @@
+class Verification:
+    @staticmethod
+    def   sanitize_log_data(data, max_length=100):


sanitize_log_data 方法定义中有多余的空格，建议移除以保持代码风格一致。

Suggested change

def sanitize_log_data(data, max_length=100):

def sanitize_log_data(data, max_length=100):

gemini-code-assist · 2025-09-04T15:45:40Z

mzapi/utlis/verification.py

+                if isinstance(data, str):
+                    truncated = data[:max_length]
+                    # 确保截断不会破坏UTF-8字符
+                    while len(truncated.encode('utf-8')) > max_length:
+                        truncated = truncated[:-1]
+                    return truncated
+                return data


这里存在一个多余的 isinstance(data, str) 检查。代码已经在外层的 if isinstance(data, str): (第12行) 中确认了 data 是字符串类型。可以移除这个内部的检查来简化代码。

truncated = data[:max_length] # 确保截断不会破坏UTF-8字符 while len(truncated.encode('utf-8')) > max_length: truncated = truncated[:-1] return truncated

qodo-code-review · 2025-09-04T15:55:51Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns Sensitive data in logs: While sanitization is added, current implementation may still log raw URLs and partial Base64 or access tokens. Ensure tokens/credentials are fully masked and avoid logging Base64 unless necessary. External request hardening: The Baidu access_token call lacks timeout and error checks; add timeouts, exception handling, and verify status codes to prevent DoS or leaking error bodies.
⚡ Recommended focus areas for review Logging Format Bug The logging formatter uses nonstandard fields like %(pastime)s and %(levelness)s which will raise KeyError at runtime; replace with standard %(asctime)s and %(levelname)s. handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) self.logger.addHandler(handler) else: # 确保现有处理器的格式一致 for h in self.logger.handlers: h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) self.logger.info("初始化腾讯云OCR客户端，日志级别: %s", logging.getLevelName(log_level)) HTTP Method/Headers The Baidu token request uses GET with JSON headers and empty data; Baidu OAuth expects form-encoded POST or a clean GET without misleading headers. Validate method, params, error handling, and timeouts. url = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={ak}&client_secret={sk}" payload = "" headers = { 'Content-Type': 'application/json', 'Accept': 'application/json' } response = requests.get(url, headers=headers, data=payload) Sanitization Logic Base64 detection relies on substring 'base64' and returns last chars which may still leak; also stray double spaces in method name and unnecessary nested isinstance checks can cause confusion and style issues. def sanitize_log_data(data, max_length=100): """处理日志中的敏感数据 1. 如果是Base64数据，只显示最后10个字符并标记 2. 长字符串自动截断并标记 :param data: 需要处理的数据 :param max_length: 最大长度，超过该长度将被截断 """ if data is None: return None if isinstance(data, str): if len(data) > max_length and 'base64' in data.lower(): # 处理Base64数据 return data[-20:] if len(data) > 100 else data[-10:] elif len(data) > max_length: # 处理长字符串 if isinstance(data, str): truncated = data[:max_length] # 确保截断不会破坏UTF-8字符 while len(truncated.encode('utf-8')) > max_length: truncated = truncated[:-1] return truncated return data return data

qodo-code-review · 2025-09-04T15:58:44Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Fix release workflow conditions The release workflow now triggers on all pull_request events, but the job-level guard was removed, so "release-build" (and the GitHub release step) can run on unmerged PRs and likely fail due to insufficient permissions or create premature releases. Reintroduce a job-level if (or step-level if) to run only on push or merged pull_request events, or restrict the pull_request trigger to closed and check merged == true to align with the PR’s stated intent. Examples: .github/workflows/release.yml [16-17] release-build: runs-on: ubuntu-latest Solution Walkthrough: Before: # .github/workflows/release.yml on: push: branches: - master pull_request: branches: [ "master" ] types: [opened, synchronize, reopened, closed] jobs: release-build: runs-on: ubuntu-latest # No 'if' condition, so it runs on every trigger event. steps: - name: Create GitHub release ... After: # .github/workflows/release.yml on: push: branches: - master pull_request: branches: [ "master" ] types: [closed] # Restrict trigger to 'closed' jobs: release-build: runs-on: ubuntu-latest # Add a condition to only run on a merged PR. if: github.event.pull_request.merged == true \|\| github.event_name == 'push' steps: - name: Create GitHub release ... Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical regression in the `release.yml` workflow where a job-level guard was removed, which would cause the release job to run incorrectly on unmerged pull requests.	High
Possible issue	✅ ~~Fix logging formatter typos~~ Suggestion Impact: The commit updated the logging formatter strings to use %(asctime)s and %(levelname)s in both the new handler and existing handlers, fixing the typos exactly as suggested. code diff: - handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) + handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) self.logger.addHandler(handler) else: # 确保现有处理器的格式一致 for h in self.logger.handlers: - h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) + h.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) The logging formatter contains typos in the format string. `%(pastime)s` should be `%(asctime)s` and `%(levelness)s` should be `%(levelname)s`. This will cause logging to fail or display incorrect information. mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py [43] -handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) +handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies invalid format specifiers (`%(pastime)s` and `%(levelness)s`) that would cause the logging to raise a `KeyError`, fixing a bug in the new class.	Medium
General	Remove misleading log message The code logs "图片Base64验证通过" even when only URL validation was performed. This misleading log message should be removed or moved to the appropriate validation block. mzapi/tencent/ocr/GeneralAccurateOCR.py [83-87] if ImageUrl: self.logger.debug("验证图片URL: %s", ImageUrl) self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"]) self.logger.debug("图片URL验证通过") - self.logger.debug("图片Base64验证通过") Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies and removes a misleading log message that incorrectly states Base64 validation passed, improving log accuracy and preventing potential confusion during debugging.	Low
General	✅ ~~Fix method definition spacing~~ Suggestion Impact: The commit removed the extra spaces between "def" and the method name, matching the suggested formatting change. code diff: - def sanitize_log_data(data, max_length=100): + def sanitize_log_data(data, max_length=100): There are extra spaces between `def` and the method name which violates Python syntax conventions. This should be a single space for proper formatting. mzapi/utlis/verification.py [3] -def sanitize_log_data(data, max_length=100): +def sanitize_log_data(data, max_length=100): `[Suggestion processed]` Suggestion importance[1-10]: 3 __ Why: The suggestion corrects a minor code style issue by removing extra spaces, which improves readability and adheres to PEP 8 conventions without affecting functionality.	Low
Update

-新增目录 baidu，用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录，提高安全性 - 更新项目结构，删除冗余代码 - 优化 GitHub Actions 工作流

refactor(mzapi): 重构项目并添加新功能

d0f4b88

-新增目录 baidu，用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录，提高安全性 - 更新项目结构，删除冗余代码 - 优化 GitHub Actions 工作流

github-actions bot added core/核心代码 ci/持续集成 tencent/腾讯 documentation/文档 baidu/百度 labels Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

sourcery-ai bot suggested changes Sep 4, 2025

View reviewed changes

Merge branch 'master' into develop

c1fc78a

gstraccini bot assigned xiaomizhoubaobei Sep 4, 2025

gstraccini bot approved these changes Sep 4, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

qodo-code-review bot added Possible security concern Review effort 3/5 labels Sep 4, 2025

refactor(mzapi): 重构项目并添加新功能

ba0eb74

-新增目录 baidu，用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录，提高安全性 - 更新项目结构，删除冗余代码 - 优化 GitHub Actions 工作流

github-actions bot added dependencies/依赖 build/构建 labels Sep 4, 2025

xiaomizhoubaobei merged commit f12c8e9 into master Sep 4, 2025
11 checks passed

xiaomizhoubaobei had a problem deploying to pypi September 4, 2025 16:13 — with GitHub Actions Failure

xiaomizhoubaobei deleted the develop branch September 4, 2025 16:15

	response = requests.get(url, headers=headers, data=payload)
	response = requests.get(url, headers=headers, data=payload, timeout=30)

	raise TencentCloudSDKException("初始化失败", str(e))
	raise TencentCloudSDKException("初始化失败", str(e)) from e

	self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
	self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"])

	def sanitize_log_data(data, max_length=100):
	def sanitize_log_data(data, max_length=100):

refactor(mzapi): 重构项目并添加新功能 #2

refactor(mzapi): 重构项目并添加新功能 #2

Uh oh!

Conversation

xiaomizhoubaobei commented Sep 4, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary by Sourcery

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

bolt-new-by-stackblitz bot commented Sep 4, 2025

Uh oh!

sourcery-ai bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

xiaomizhoubaobei commented Sep 4, 2025 •

edited by qodo-code-review bot

Loading

sourcery-ai bot commented Sep 4, 2025 •

edited

Loading

qodo-code-review bot commented Sep 4, 2025 •

edited

Loading