Milestone 2: Unified Test Prompts for Model Evaluation

### Summary

Standardize our model testing using Apollo 11 source text with 15 prompts covering summarization, reasoning, and RAG tasks.

### Proposal

**Source Text:** Wikipedia Apollo 11 excerpts (from “Lunar Landing” and “Lunar Surface Operations” sections) with permanent link (~1,400 words, CC BY-SA 3.0)

### 15 Test Prompts:

- 5 Summarization (easy → hard)
- 5 Reasoning (causal, analytical, hypothetical)
- 5 RAG (fact retrieval with ground truth answers)

### Why Apollo 11?

- Works for all model types (DistilBERT, SLMs, commercial)
- Fact-dense for RAG testing
- Properly licensed and reproducible
- Hardware-friendly length

### Open Questions for Team

**Prompt format:** JSON for automation or plain text for simplicity?
**Text length:** Is 1,400 words optimal, or should we go shorter/longer(full sections?)?
**Multiple sources:** Start with one text or prepare multiple examples?

### Next Steps

- Please review detailed documentation [here](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?pli=1&tab=t.grbusea2zi57)
- Discuss and feedback for open questions
- Implementation after approval


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Milestone 2: Unified Test Prompts for Model Evaluation #7

Summary

Proposal

15 Test Prompts:

Why Apollo 11?

Open Questions for Team

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Milestone 2: Unified Test Prompts for Model Evaluation #7

Description

Summary

Proposal

15 Test Prompts:

Why Apollo 11?

Open Questions for Team

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions