Skip to content

Add Human PPI text dataset and test split#8

Open
suencgo wants to merge 1 commit intoInternScience:mainfrom
suencgo:feature-human-ppi
Open

Add Human PPI text dataset and test split#8
suencgo wants to merge 1 commit intoInternScience:mainfrom
suencgo:feature-human-ppi

Conversation

@suencgo
Copy link

@suencgo suencgo commented Feb 5, 2026

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @suencgo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new dataset for evaluating models on Human Protein-Protein Interaction (PPI) text classification. It provides the necessary infrastructure to load, prompt, and evaluate models on this specific task, focusing on a straightforward binary classification approach. The changes enable the system to handle a new type of biological text data with a tailored evaluation mechanism.

Highlights

  • New Dataset Introduction: A new HumanPPIDataset has been added, specifically designed for binary classification of Human Protein-Protein Interaction (PPI) from text.
  • Custom Evaluation Logic: The dataset includes a specialized evaluation method that parses natural language model outputs (e.g., "Yes" or "No") into binary predictions (0/1) to calculate accuracy, bypassing the need for external judges or tokenizers.
  • Integration into System: The HumanPPIDataset is now integrated into the TEXT_DATASET registry, making it discoverable and usable within the scieval framework.
Changelog
  • scieval/dataset/init.py
    • Imported HumanPPIDataset to make it available.
    • Added HumanPPIDataset to the TEXT_DATASET list for automatic registration.
  • scieval/dataset/human_ppi.py
    • Added a new file defining the HumanPPIDataset class.
    • Implemented load_data to read human_ppi_test.tsv from LMUDataRoot().
    • Defined build_prompt to use the 'question' field directly.
    • Implemented evaluate to use the custom evaluate_human_ppi_binary function.
  • scieval/dataset/utils/human_ppi.py
    • Added a new utility file containing parse_human_ppi_output and evaluate_human_ppi_binary.
    • parse_human_ppi_output extracts binary (0/1) predictions from free-form text outputs.
    • evaluate_human_ppi_binary calculates accuracy based on these parsed predictions and true labels.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new text dataset, Human PPI, for protein-protein interaction prediction, including the dataset class, evaluation logic, and utility functions for parsing model outputs. While the changes are well-structured, two critical security concerns have been identified: an insecure deserialization vulnerability in the evaluation utility that could lead to remote code execution if a malicious prediction file is loaded, and a path traversal vulnerability in the dataset loading logic that could allow reading arbitrary files with a .tsv extension. Additionally, I've provided suggestions to improve code clarity and maintainability by removing wildcard imports, refactoring a parsing function for simplicity, and using more idiomatic pandas operations for data manipulation. Addressing these security vulnerabilities and code suggestions will enhance the overall quality and safety of this addition.

- 计算整体准确率,并返回一个只有一行的 DataFrame:
columns: ['Total', 'Correct', 'Accuracy']
"""
data = load(eval_file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The evaluate_human_ppi_binary function calls the load utility function with the eval_file argument. The load function (defined in scieval/smp/file.py) determines how to process a file based on its extension. If the file extension is .pkl, it uses pickle.load() to deserialize the content. Furthermore, load supports URLs and will automatically download a file if a URL is provided. An attacker can exploit this by providing a URL to a malicious pickle file as the eval_file, leading to arbitrary code execution on the user's system when the evaluation is run. This is a classic insecure deserialization vulnerability.

To remediate this, avoid using pickle.load() for untrusted input. Use safer serialization formats like JSON or safetensors. If pickle must be used, ensure the file comes from a trusted source and validate its integrity. Additionally, consider disabling automatic URL downloading in the load utility or restricting it to a set of trusted domains.

不从远程下载,直接从 LMUDataRoot 下读取本地 tsv:
<LMUDataRoot>/human_ppi_test.tsv
"""
data_path = osp.join(LMUDataRoot(), f"{dataset}.tsv")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The load_data method in HumanPPIDataset constructs a file path by joining LMUDataRoot() with a user-supplied dataset name and appending .tsv. There is no validation or sanitization of the dataset name to prevent path traversal sequences like ../. An attacker could potentially use this to read arbitrary .tsv files on the system that are outside the intended data directory.

To remediate this, sanitize the dataset input by ensuring it does not contain path traversal characters (e.g., .., /, \). Alternatively, validate that the resulting data_path is still within the expected LMUDataRoot() directory.


from .text_base import TextBaseDataset
from .utils.human_ppi import evaluate_human_ppi_binary
from ..smp import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using wildcard imports (from ... import *) is generally discouraged as it can lead to namespace pollution and make it difficult to trace where functions and classes originate. It's better to explicitly import the names you need. In this case, you seem to be using osp, LMUDataRoot, and load.

Suggested change
from ..smp import *
from ..smp import LMUDataRoot, load, osp

import re
from typing import Any

from ...smp import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using wildcard imports (from ... import *) is generally discouraged as it can lead to namespace pollution and make it difficult to trace where functions and classes originate. It's better to explicitly import the names you need. In this case, you seem to be using load, dump, and pd.

Suggested change
from ...smp import *
from ...smp import load, dump, pd

Comment on lines +10 to +60
def parse_human_ppi_output(output_text: str) -> int:
"""
解析模型输出,判断是 Yes(1) / No(0)。

逻辑基本复用自本地测试脚本 `test_human_ppi.py`:
- 优先看开头是否是 Yes./No.
- 再看前若干字符中是否出现 yes/no
- 再看前 200 字符中是否包含显式 Yes/No 词
- 如果文本里出现 interaction/binding 等肯定词,再结合否定词判断

返回:
1 表示预测为 Yes(存在 PPI)
0 表示预测为 No(不存在 PPI,或无法确定)
"""
output_text = (output_text or "").strip()
if not output_text:
return 0

# 1) 直接看开头
if re.match(r"^(Yes|yes|YES)[\s.,;:!?]", output_text):
return 1
if re.match(r"^(No|no|NO)[\s.,;:!?]", output_text):
return 0

# 2) 看前若干字符
first_words = output_text[:50].lower()
if first_words.startswith("yes"):
return 1
if first_words.startswith("no"):
return 0

# 3) 在前 200 字符中查找显式 Yes/No
head = output_text[:200]
if re.search(r"\b(No|no|NO)\b", head):
return 0
if re.search(r"\b(Yes|yes|YES)\b", head):
return 1

# 4) 检查 interaction/binding 等相关表述
if re.search(r"\b(interaction|binding|interact|bind|will interact|can interact)\b", output_text, re.IGNORECASE):
# 若包含明显否定词,则判为 0
if re.search(
r"\b(not|no|unlikely|cannot|doesn\'t|don\'t|won\'t|no interaction|unlikely to)\b",
output_text,
re.IGNORECASE,
):
return 0
return 1

# 无法判断时,默认 No
return 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for parsing the model's text output is a bit complex and can be simplified for better readability and robustness. Using re.IGNORECASE can simplify the regular expressions, and combining the initial checks can reduce redundancy. This refactoring preserves the original logic while making it cleaner.

def parse_human_ppi_output(output_text: str) -> int:
    """
    解析模型输出,判断是 Yes(1) / No(0)。

    逻辑基本复用自本地测试脚本 `test_human_ppi.py`:
    - 优先看开头是否是 Yes./No.
    - 再看前若干字符中是否出现 yes/no
    - 再看前 200 字符中是否包含显式 Yes/No 词
    - 如果文本里出现 interaction/binding 等肯定词,再结合否定词判断

    返回:
        1 表示预测为 Yes(存在 PPI)
        0 表示预测为 No(不存在 PPI,或无法确定)
    """
    output_text = (output_text or "").strip()
    if not output_text:
        return 0

    # 1) Check for "yes" or "no" at the beginning of the string, using word boundaries.
    if re.match(r"^yes\\b", output_text, re.IGNORECASE):
        return 1
    if re.match(r"^no\\b", output_text, re.IGNORECASE):
        return 0

    # 2) Search for explicit Yes/No words in the first 200 characters.
    # The order is important: 'no' is checked first.
    head = output_text[:200]
    if re.search(r"\\bno\\b", head, re.IGNORECASE):
        return 0
    if re.search(r"\\byes\\b", head, re.IGNORECASE):
        return 1

    # 3) Check for interaction/binding related expressions.
    if re.search(r"\\b(interaction|binding|interact|bind|will interact|can interact)\\b", output_text, re.IGNORECASE):
        # If it contains obvious negative words, it's a 0.
        if re.search(
            r"\\b(not|no|unlikely|cannot|doesn't|don't|won't|no interaction|unlikely to)\\b",
            output_text,
            re.IGNORECASE,
        ):
            return 0
        return 1

    # Default to No if unable to determine.
    return 0

Comment on lines +87 to +98
# 只保留在 meta 里出现的索引
data = data[data["index"].isin(meta["index"])]

# 对 prediction 做解析
preds = [parse_human_ppi_output(str(x)) for x in data["prediction"]]
data["pred_label"] = preds

# 合并真实标签
idx2label = {int(i): int(c) for i, c in zip(meta["index"], meta["category"])}
data["true_label"] = [idx2label[int(i)] for i in data["index"]]

data["correct"] = data["pred_label"] == data["true_label"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for merging true labels is correct, but it can be made more concise and efficient by using pandas.merge. This is more idiomatic for pandas operations and can be more performant on larger datasets.

    # 只保留在 meta 里出现的索引, 并合并真实标签
    data = pd.merge(data, meta_df[['index', 'category']], on='index', how='inner')
    data.rename(columns={'category': 'true_label'}, inplace=True)

    # 对 prediction 做解析
    data['pred_label'] = [parse_human_ppi_output(str(x)) for x in data['prediction']]

    data['correct'] = data['pred_label'] == data['true_label']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant