Add Human PPI text dataset and test split#8
Add Human PPI text dataset and test split#8suencgo wants to merge 1 commit intoInternScience:mainfrom
Conversation
Summary of ChangesHello @suencgo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new dataset for evaluating models on Human Protein-Protein Interaction (PPI) text classification. It provides the necessary infrastructure to load, prompt, and evaluate models on this specific task, focusing on a straightforward binary classification approach. The changes enable the system to handle a new type of biological text data with a tailored evaluation mechanism. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a new text dataset, Human PPI, for protein-protein interaction prediction, including the dataset class, evaluation logic, and utility functions for parsing model outputs. While the changes are well-structured, two critical security concerns have been identified: an insecure deserialization vulnerability in the evaluation utility that could lead to remote code execution if a malicious prediction file is loaded, and a path traversal vulnerability in the dataset loading logic that could allow reading arbitrary files with a .tsv extension. Additionally, I've provided suggestions to improve code clarity and maintainability by removing wildcard imports, refactoring a parsing function for simplicity, and using more idiomatic pandas operations for data manipulation. Addressing these security vulnerabilities and code suggestions will enhance the overall quality and safety of this addition.
| - 计算整体准确率,并返回一个只有一行的 DataFrame: | ||
| columns: ['Total', 'Correct', 'Accuracy'] | ||
| """ | ||
| data = load(eval_file) |
There was a problem hiding this comment.
The evaluate_human_ppi_binary function calls the load utility function with the eval_file argument. The load function (defined in scieval/smp/file.py) determines how to process a file based on its extension. If the file extension is .pkl, it uses pickle.load() to deserialize the content. Furthermore, load supports URLs and will automatically download a file if a URL is provided. An attacker can exploit this by providing a URL to a malicious pickle file as the eval_file, leading to arbitrary code execution on the user's system when the evaluation is run. This is a classic insecure deserialization vulnerability.
To remediate this, avoid using pickle.load() for untrusted input. Use safer serialization formats like JSON or safetensors. If pickle must be used, ensure the file comes from a trusted source and validate its integrity. Additionally, consider disabling automatic URL downloading in the load utility or restricting it to a set of trusted domains.
| 不从远程下载,直接从 LMUDataRoot 下读取本地 tsv: | ||
| <LMUDataRoot>/human_ppi_test.tsv | ||
| """ | ||
| data_path = osp.join(LMUDataRoot(), f"{dataset}.tsv") |
There was a problem hiding this comment.
The load_data method in HumanPPIDataset constructs a file path by joining LMUDataRoot() with a user-supplied dataset name and appending .tsv. There is no validation or sanitization of the dataset name to prevent path traversal sequences like ../. An attacker could potentially use this to read arbitrary .tsv files on the system that are outside the intended data directory.
To remediate this, sanitize the dataset input by ensuring it does not contain path traversal characters (e.g., .., /, \). Alternatively, validate that the resulting data_path is still within the expected LMUDataRoot() directory.
|
|
||
| from .text_base import TextBaseDataset | ||
| from .utils.human_ppi import evaluate_human_ppi_binary | ||
| from ..smp import * |
There was a problem hiding this comment.
Using wildcard imports (from ... import *) is generally discouraged as it can lead to namespace pollution and make it difficult to trace where functions and classes originate. It's better to explicitly import the names you need. In this case, you seem to be using osp, LMUDataRoot, and load.
| from ..smp import * | |
| from ..smp import LMUDataRoot, load, osp |
| import re | ||
| from typing import Any | ||
|
|
||
| from ...smp import * |
There was a problem hiding this comment.
Using wildcard imports (from ... import *) is generally discouraged as it can lead to namespace pollution and make it difficult to trace where functions and classes originate. It's better to explicitly import the names you need. In this case, you seem to be using load, dump, and pd.
| from ...smp import * | |
| from ...smp import load, dump, pd |
| def parse_human_ppi_output(output_text: str) -> int: | ||
| """ | ||
| 解析模型输出,判断是 Yes(1) / No(0)。 | ||
|
|
||
| 逻辑基本复用自本地测试脚本 `test_human_ppi.py`: | ||
| - 优先看开头是否是 Yes./No. | ||
| - 再看前若干字符中是否出现 yes/no | ||
| - 再看前 200 字符中是否包含显式 Yes/No 词 | ||
| - 如果文本里出现 interaction/binding 等肯定词,再结合否定词判断 | ||
|
|
||
| 返回: | ||
| 1 表示预测为 Yes(存在 PPI) | ||
| 0 表示预测为 No(不存在 PPI,或无法确定) | ||
| """ | ||
| output_text = (output_text or "").strip() | ||
| if not output_text: | ||
| return 0 | ||
|
|
||
| # 1) 直接看开头 | ||
| if re.match(r"^(Yes|yes|YES)[\s.,;:!?]", output_text): | ||
| return 1 | ||
| if re.match(r"^(No|no|NO)[\s.,;:!?]", output_text): | ||
| return 0 | ||
|
|
||
| # 2) 看前若干字符 | ||
| first_words = output_text[:50].lower() | ||
| if first_words.startswith("yes"): | ||
| return 1 | ||
| if first_words.startswith("no"): | ||
| return 0 | ||
|
|
||
| # 3) 在前 200 字符中查找显式 Yes/No | ||
| head = output_text[:200] | ||
| if re.search(r"\b(No|no|NO)\b", head): | ||
| return 0 | ||
| if re.search(r"\b(Yes|yes|YES)\b", head): | ||
| return 1 | ||
|
|
||
| # 4) 检查 interaction/binding 等相关表述 | ||
| if re.search(r"\b(interaction|binding|interact|bind|will interact|can interact)\b", output_text, re.IGNORECASE): | ||
| # 若包含明显否定词,则判为 0 | ||
| if re.search( | ||
| r"\b(not|no|unlikely|cannot|doesn\'t|don\'t|won\'t|no interaction|unlikely to)\b", | ||
| output_text, | ||
| re.IGNORECASE, | ||
| ): | ||
| return 0 | ||
| return 1 | ||
|
|
||
| # 无法判断时,默认 No | ||
| return 0 |
There was a problem hiding this comment.
The logic for parsing the model's text output is a bit complex and can be simplified for better readability and robustness. Using re.IGNORECASE can simplify the regular expressions, and combining the initial checks can reduce redundancy. This refactoring preserves the original logic while making it cleaner.
def parse_human_ppi_output(output_text: str) -> int:
"""
解析模型输出,判断是 Yes(1) / No(0)。
逻辑基本复用自本地测试脚本 `test_human_ppi.py`:
- 优先看开头是否是 Yes./No.
- 再看前若干字符中是否出现 yes/no
- 再看前 200 字符中是否包含显式 Yes/No 词
- 如果文本里出现 interaction/binding 等肯定词,再结合否定词判断
返回:
1 表示预测为 Yes(存在 PPI)
0 表示预测为 No(不存在 PPI,或无法确定)
"""
output_text = (output_text or "").strip()
if not output_text:
return 0
# 1) Check for "yes" or "no" at the beginning of the string, using word boundaries.
if re.match(r"^yes\\b", output_text, re.IGNORECASE):
return 1
if re.match(r"^no\\b", output_text, re.IGNORECASE):
return 0
# 2) Search for explicit Yes/No words in the first 200 characters.
# The order is important: 'no' is checked first.
head = output_text[:200]
if re.search(r"\\bno\\b", head, re.IGNORECASE):
return 0
if re.search(r"\\byes\\b", head, re.IGNORECASE):
return 1
# 3) Check for interaction/binding related expressions.
if re.search(r"\\b(interaction|binding|interact|bind|will interact|can interact)\\b", output_text, re.IGNORECASE):
# If it contains obvious negative words, it's a 0.
if re.search(
r"\\b(not|no|unlikely|cannot|doesn't|don't|won't|no interaction|unlikely to)\\b",
output_text,
re.IGNORECASE,
):
return 0
return 1
# Default to No if unable to determine.
return 0| # 只保留在 meta 里出现的索引 | ||
| data = data[data["index"].isin(meta["index"])] | ||
|
|
||
| # 对 prediction 做解析 | ||
| preds = [parse_human_ppi_output(str(x)) for x in data["prediction"]] | ||
| data["pred_label"] = preds | ||
|
|
||
| # 合并真实标签 | ||
| idx2label = {int(i): int(c) for i, c in zip(meta["index"], meta["category"])} | ||
| data["true_label"] = [idx2label[int(i)] for i in data["index"]] | ||
|
|
||
| data["correct"] = data["pred_label"] == data["true_label"] |
There was a problem hiding this comment.
The current implementation for merging true labels is correct, but it can be made more concise and efficient by using pandas.merge. This is more idiomatic for pandas operations and can be more performant on larger datasets.
# 只保留在 meta 里出现的索引, 并合并真实标签
data = pd.merge(data, meta_df[['index', 'category']], on='index', how='inner')
data.rename(columns={'category': 'true_label'}, inplace=True)
# 对 prediction 做解析
data['pred_label'] = [parse_human_ppi_output(str(x)) for x in data['prediction']]
data['correct'] = data['pred_label'] == data['true_label']
No description provided.