-
Notifications
You must be signed in to change notification settings - Fork 0
refactor(mzapi): 重构项目并添加新功能 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
-新增目录 baidu,用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录,提高安全性 - 更新项目结构,删除冗余代码 - 优化 GitHub Actions 工作流
|
|
Reviewer's GuideThis PR refactors CI/CD pipelines to refine workflow triggers and merge conditions, enhances Tencent OCR modules with parameterized logging and data sanitization, adds a new OCR warning handler and Baidu authorization support, and updates dependency management and container setup. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @xiaomizhoubaobei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
本次拉取请求对mzapi项目进行了结构性重构和功能扩展。核心变化在于优化了项目依赖管理方式,引入了新的云服务提供商(百度云)的集成能力,并对现有腾讯云OCR功能进行了增强和安全性改进。这些变更旨在提高代码的可维护性、扩展性以及敏感数据处理的安全性。
Highlights
- 依赖管理优化: 移除了旧的SDK安装脚本(install_sdk.sh),转而通过yilai.txt文件集中管理Python依赖,并在Docker构建过程中自动安装,简化了环境配置。
- 新增百度云SDK集成: 引入了mzapi/baidu模块,支持百度云的授权认证功能,为后续集成百度云服务(如NLP)奠定基础。
- 腾讯云OCR功能扩展与优化: 新增了RecognizeGeneralTextImageWarn类,扩展了腾讯云OCR的通用告警文本图像识别能力。同时,对GeneralAccurateOCR和GeneralBasicOCR类进行了重构,引入了日志数据脱敏功能,并优化了输入参数校验逻辑,提升了安全性和健壮性。
- 日志数据脱敏工具: 新增了mzapi/utlis/verification.py文件,提供sanitize_log_data方法,用于在日志中对敏感数据(如Base64编码的图片数据)进行截断或部分隐藏,增强了日志安全性。
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'. (link)
General comments:
- There’s a lot of duplicated workflow
ifconditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML. - The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
- The
.ide/Dockerfileinstalls dependencies fromyilai.txt, which isn’t a standard name—renaming it torequirements.txt(or aligning with existing naming conventions) would clarify its purpose.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There’s a lot of duplicated workflow `if` conditions across sync jobs—consider abstracting them into a reusable workflow or using a matrix to DRY up the GitHub Actions YAML.
- The three OCR classes share almost identical initialization, validation, and logging logic—extracting a common base class would reduce duplication and make future changes easier.
- The `.ide/Dockerfile` installs dependencies from `yilai.txt`, which isn’t a standard name—renaming it to `requirements.txt` (or aligning with existing naming conventions) would clarify its purpose.
## Individual Comments
### Comment 1
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:43` </location>
<code_context>
+ # 只在没有处理器时添加处理器
+ if not self.logger.handlers:
+ handler = logging.StreamHandler()
+ handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s'))
+ self.logger.addHandler(handler)
+ else:
</code_context>
<issue_to_address>
Custom logging format uses non-standard field names.
Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.
</issue_to_address>
### Comment 2
<location> `mzapi/utlis/verification.py:13` </location>
<code_context>
+ if data is None:
+ return None
+ if isinstance(data, str):
+ if len(data) > max_length and 'base64' in data.lower():
+ # 处理Base64数据
+ return data[-20:] if len(data) > 100 else data[-10:]
+ elif len(data) > max_length:
+ # 处理长字符串
</code_context>
<issue_to_address>
Base64 detection logic may not reliably identify base64-encoded data.
Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.
</issue_to_address>
### Comment 3
<location> `mzapi/tencent/ocr/RecognizeGeneralTextImageWarn.py:82` </location>
<code_context>
+
+ if ImageUrl:
+ self.logger.debug("验证图片URL: %s", ImageUrl)
+ self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"])
+ self.logger.debug("图片URL验证通过")
+ self.logger.debug("图片Base64验证通过")
</code_context>
<issue_to_address>
PDF format is not included in allowed extensions for URL validation.
The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.
</issue_to_address>
### Comment 4
<location> `yilai.txt:109` </location>
<code_context>
+alibabacloud_ccc20200527
+alibabacloud_ccc20200701
+alibabacloud_cciotgw20210721
+alibabacloud_cd2021127
+alibabacloud_cddc20200320
+alibabacloud_cdn20141111
</code_context>
<issue_to_address>
Possible typo in package name: 'alibabacloud_cd2021127'.
This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
alibabacloud_cd2021127
=======
alibabacloud_cd20211207
>>>>>>> REPLACE
</suggested_fix>
## Security Issues
### Issue 1
<location> `mzapi/baidu/authorization.py:11` </location>
<issue_to_address>
**security (python.requests.best-practice.use-timeout):** Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.
```suggestion
response = requests.get(url, headers=headers, data=payload, timeout=30)
```
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # 只在没有处理器时添加处理器 | ||
| if not self.logger.handlers: | ||
| handler = logging.StreamHandler() | ||
| handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Custom logging format uses non-standard field names.
Using non-standard LogRecord attributes like '%(pastime)s' and '%(levelness)s' may cause runtime errors unless you have a custom logging setup. Prefer standard fields such as '%(asctime)s' and '%(levelname)s'.
mzapi/utlis/verification.py
Outdated
| if len(data) > max_length and 'base64' in data.lower(): | ||
| # 处理Base64数据 | ||
| return data[-20:] if len(data) > 100 else data[-10:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Base64 detection logic may not reliably identify base64-encoded data.
Searching for the substring 'base64' is unreliable, as base64-encoded data may not include it. Use a method that validates the encoding or considers the data's context instead.
|
|
||
| if ImageUrl: | ||
| self.logger.debug("验证图片URL: %s", ImageUrl) | ||
| self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: PDF format is not included in allowed extensions for URL validation.
The docstring indicates PDF support, but the validation only allows image formats. Please add "pdf" to the allowed extensions for consistency.
| alibabacloud_ccc20200527 | ||
| alibabacloud_ccc20200701 | ||
| alibabacloud_cciotgw20210721 | ||
| alibabacloud_cd2021127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Possible typo in package name: 'alibabacloud_cd2021127'.
This entry differs from the standard 8-digit date format used elsewhere. Please confirm whether this is correct.
| alibabacloud_cd2021127 | |
| alibabacloud_cd20211207 |
mzapi/baidu/authorization.py
Outdated
| 'Content-Type': 'application/json', | ||
| 'Accept': 'application/json' | ||
| } | ||
| response = requests.get(url, headers=headers, data=payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.requests.best-practice.use-timeout): Detected a 'requests' call without a timeout set. By default, 'requests' calls wait until the connection is closed. This means a 'requests' call without a timeout will hang the program if a response is never received. Consider setting a timeout for all 'requests'.
| response = requests.get(url, headers=headers, data=payload) | |
| response = requests.get(url, headers=headers, data=payload, timeout=30) |
Source: opengrep
| :param EnableDetectText: 文本检测开关,默认为true。设置为false可直接进行单行识别,适用于仅包含正向单行文本的图片场景。 | ||
| :param ConfigID: 配置ID支持: OCR -- 通用场景 MulOCR--多语种场景 | ||
| """ | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replaces an empty collection equality with a boolean operation [×2] (
simplify-empty-collection-comparison) - Explicitly raise from a previous error (
raise-from-previous-error)
| :param IsWords: 是否返回单字信息,默认false。 | ||
| :return: 识别结果,返回为JSON字符串格式,包含文本识别结果、方向信息及可能的错误信息。 | ||
| """ | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Explicitly raise from a previous error (raise-from-previous-error)
| self.logger.info("OCR客户端初始化完成") | ||
| except Exception as e: | ||
| self.logger.error(f"初始化失败: {str(e)}") | ||
| raise TencentCloudSDKException("初始化失败", str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)
| raise TencentCloudSDKException("初始化失败", str(e)) | |
| raise TencentCloudSDKException("初始化失败", str(e)) from e |
| :param PdfPageNumber: 需要识别的PDF页面的对应页码,仅支持PDF单页识别,当上传文件为PDF且IsPdf参数值为true时有效,默认值为1。 | ||
| :param Type: 识别类型,可选值为General 通用告警(支持所有类型告警),LicensePlate 车牌告警(支持翻拍告警) | ||
| """ | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replaces an empty collection equality with a boolean operation [×2] (
simplify-empty-collection-comparison) - Explicitly raise from a previous error (
raise-from-previous-error)
mzapi/utlis/verification.py
Outdated
| if data is None: | ||
| return None | ||
| if isinstance(data, str): | ||
| if len(data) > max_length and 'base64' in data.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Lift repeated conditional into its own if statement (
lift-duplicated-conditional) - Remove unnecessary else after guard condition (
remove-unnecessary-else)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
这次的PR重构了项目的依赖管理方式,从 install_sdk.sh 脚本改为使用 yilai.txt 和 pip,这是一个很好的改进。同时,添加了对百度云和腾讯云新功能的支持,并对日志记录进行了优化,以避免记录敏感数据。
代码审查发现了一些可以改进的地方:
- 依赖文件
yilai.txt包含了大量非必需的包,这会极大地增加安装时间和环境体积,建议进行优化。 - 新添加的
RecognizeGeneralTextImageWarn.py文件中存在一些问题,包括日志格式化字符串的拼写错误,这可能会导致运行时错误。 - 部分代码的错误处理和日志记录可以更加完善和准确。
- 一些工具函数(如
verification.py)的逻辑可以更清晰。
总的来说,这次重构方向是正确的,但在细节上还有提升空间。请查看具体的审查评论。
| if not self.logger.handlers: | ||
| handler = logging.StreamHandler() | ||
| handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) | ||
| self.logger.addHandler(handler) | ||
| else: | ||
| # 确保现有处理器的格式一致 | ||
| for h in self.logger.handlers: | ||
| h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
日志格式化字符串中存在拼写错误。%(pastime)s 和 %(levelness)s 应该是 %(asctime)s 和 %(levelname)s。这些错误会导致在记录日志时抛出 KeyError 异常,从而使程序崩溃。
| if not self.logger.handlers: | |
| handler = logging.StreamHandler() | |
| handler.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) | |
| self.logger.addHandler(handler) | |
| else: | |
| # 确保现有处理器的格式一致 | |
| for h in self.logger.handlers: | |
| h.setFormatter(logging.Formatter('%(pastime)s - %(name)s - %(levelness)s - %(message)s')) | |
| if not self.logger.handlers: | |
| handler = logging.StreamHandler() | |
| handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) | |
| self.logger.addHandler(handler) | |
| else: | |
| # 确保现有处理器的格式一致 | |
| for h in self.logger.handlers: | |
| h.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) |
| def access_token(ak, sk): | ||
| url = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={ak}&client_secret={sk}" | ||
| payload = "" | ||
| headers = { | ||
| 'Content-Type': 'application/json', | ||
| 'Accept': 'application/json' | ||
| } | ||
| response = requests.get(url, headers=headers, data=payload) | ||
| return response.json().get("access_token") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
access_token 函数缺少对网络请求和响应的错误处理。如果 requests.get 调用失败(例如,网络问题、DNS错误),或者百度API返回非200状态码(例如,ak/sk错误),response.json() 可能会抛出异常,导致整个程序崩溃。建议添加 try...except 块来捕获潜在的 requests.exceptions.RequestException 和 json.JSONDecodeError,并检查响应的状态码。同时,对于GET请求,使用params参数传递查询参数是更常见的做法。
def access_token(ak, sk):
url = "https://aip.baidubce.com/oauth/2.0/token"
params = {
"grant_type": "client_credentials",
"client_id": ak,
"client_secret": sk
}
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
try:
response = requests.get(url, params=params, headers=headers)
response.raise_for_status() # 如果状态码不是 2xx,则抛出 HTTPError
return response.json().get("access_token")
except requests.exceptions.RequestException as e:
# 建议使用日志模块记录错误
print(f"请求 access_token 时发生网络错误: {e}")
return None
except ValueError: # json.JSONDecodeError
print("解析百度API响应失败")
return None| self.logger.debug("验证图片URL: %s", ImageUrl) | ||
| self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg", "bmp", "pdf"]) | ||
| self.logger.debug("图片URL验证通过") | ||
| self.logger.debug("图片Base64验证通过") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| if ImageUrl: | ||
| self.logger.debug("验证图片URL: %s", ImageUrl) | ||
| self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.logger.debug("验证图片URL: %s", ImageUrl) | ||
| self.validate_url.validate_url(ImageUrl, ["png", "jpg", "jpeg"]) | ||
| self.logger.debug("图片URL验证通过") | ||
| self.logger.debug("图片Base64验证通过") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s", | ||
| self.sanitize_log_data.sanitize_log_data(ImageUrl,50), | ||
| IsPdf, | ||
| PdfPageNumber, | ||
| Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这条调试日志记录的参数不完整且有误。日志消息中缺少对 ImageBase64 的记录,并且错误地将 ImageUrl 传递给了 sanitize_log_data 函数。这会影响调试效率,因为无法看到传入的 ImageBase64 的部分信息。
| self.logger.debug("请求参数: ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s", | |
| self.sanitize_log_data.sanitize_log_data(ImageUrl,50), | |
| IsPdf, | |
| PdfPageNumber, | |
| Type) | |
| self.logger.debug("请求参数: ImageBase64=%s, ImageUrl=%s, IsPdf=%s, PdfPageNumber=%s, Type=%s", | |
| self.sanitize_log_data.sanitize_log_data(ImageBase64,50), | |
| ImageUrl, | |
| IsPdf, | |
| PdfPageNumber, | |
| Type) |
mzapi/utlis/verification.py
Outdated
| @@ -0,0 +1,25 @@ | |||
| class Verification: | |||
| @staticmethod | |||
| def sanitize_log_data(data, max_length=100): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mzapi/utlis/verification.py
Outdated
| if isinstance(data, str): | ||
| truncated = data[:max_length] | ||
| # 确保截断不会破坏UTF-8字符 | ||
| while len(truncated.encode('utf-8')) > max_length: | ||
| truncated = truncated[:-1] | ||
| return truncated | ||
| return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||||
-新增目录 baidu,用于存放百度相关代码 - 新增 RecognizeGeneralTextImageWarn 功能 - 重构日志记录,提高安全性 - 更新项目结构,删除冗余代码 - 优化 GitHub Actions 工作流
User description
Summary by Sourcery
Refactor core OCR modules to improve logging and validation, introduce new OCR and Baidu authentication features, and tighten CI workflows across sync, release, and publish pipelines
New Features:
Enhancements:
CI:
PR Type
Enhancement, Other
Description
Add Baidu API integration with access token functionality
Introduce new OCR warning recognition class
RecognizeGeneralTextImageWarnEnhance logging security with sensitive data sanitization
Refactor CI workflows to restrict operations to merged PRs
Diagram Walkthrough
File Walkthrough
8 files
Add Baidu imports and update exportsInitialize Baidu module structureAdd Baidu access token functionEnhance logging with data sanitizationImprove input validation and logging securityAdd new OCR warning recognition classExport new warning recognition classAdd sensitive data sanitization utility1 files
Update import statements formatting1 files
Remove obsolete SDK installation script5 files
Remove redundant dependency installation stepRestrict publishing to merged PRs onlyFix version extraction and restrict releasesRestrict sync operations to merged PRsAdd Python dependencies installation via yilai.txt1 files
Add comprehensive SDK dependencies list1 files