Skip to content

Milestone 2: Unified Test Prompts for Model Evaluation #7

@doctorbanu

Description

@doctorbanu

Summary

Standardize our model testing using Apollo 11 source text with 15 prompts covering summarization, reasoning, and RAG tasks.

Proposal

Source Text: Wikipedia Apollo 11 excerpts (from “Lunar Landing” and “Lunar Surface Operations” sections) with permanent link (~1,400 words, CC BY-SA 3.0)

15 Test Prompts:

  • 5 Summarization (easy → hard)
  • 5 Reasoning (causal, analytical, hypothetical)
  • 5 RAG (fact retrieval with ground truth answers)

Why Apollo 11?

  • Works for all model types (DistilBERT, SLMs, commercial)
  • Fact-dense for RAG testing
  • Properly licensed and reproducible
  • Hardware-friendly length

Open Questions for Team

Prompt format: JSON for automation or plain text for simplicity?
Text length: Is 1,400 words optimal, or should we go shorter/longer(full sections?)?
Multiple sources: Start with one text or prepare multiple examples?

Next Steps

  • Please review detailed documentation here
  • Discuss and feedback for open questions
  • Implementation after approval

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions