Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions test_dataset_apollo11/RATIONALE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Rationale for Text Selection

## Overview

Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing”
and “Lunar surface operations” sections were selected as the unified test
dataset for the ELO2 - Green AI project.

---

## Why Apollo 11?

**Universal Knowledge:**

All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their
training data, enabling fair comparison with and without RAG.

**Rich Factual Content:**

Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC),
numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton),
and technical terms (LGC, PLSS, EASEP).

**Accessibility:**

Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable
via permanent links.

**Appropriate Length:**

Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for
evaluation yet processable by smaller models on standard hardware. This length
aligns with standard benchmarks:

- Summarization tasks typically use 500-2,000 words,
- QA benchmarks 300-1,500 words,
- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015).

The excerpted length balances comprehensiveness with practical testability.

---

## Why These Excerpted Passages?

**Continuous Narrative:**

Selected passages flow from descent through surface activities, forming a natural
story arc ideal for summarization tasks requiring temporal understanding.

**Balanced Complexity:**

- Simple facts (times, names, quotes) suitable for smaller and distilled models
- Complex elements (technical problems, decision-making, procedures) challenging
for all models

**Optimal for RAG:**

Dense with retrievable facts across categories—times, quantities, names, equipment,quotes.

**Reasoning Opportunities:**

Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?),
and analytical reasoning.

**Verified Coverage:**

All 15 test prompts confirmed answerable with excerpted passages through
preliminary testing.

**Length Management:**

Complete sections (~3,800 words) would require extensive chunking for distilled models
with limited token capacity. Excerpted passages (~1,400 words) are more manageable
while maintaining comprehensive content for all test scenarios.

---

## Alignment with Project Goals

**Fair Comparison:**

- Commercial models tested on likely training data
- RAG systems given the same information
- All models evaluated on identical input

**Reproducibility:**

Permanent Wikipedia link, documented excerpt selections, license documentation.

**Why Not Other Approaches?**

- Entire Wikipedia article (all sections)?
Too long (~10,000+ words)—exceeds processing capacity of smaller models,
impractical for manual verification.
- Self-written summary?
Custom summaries cannot be reproduced by others and raise objectivity concerns
plus potential copyright issues.
- Multiple unrelated passages?
Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow,
prevent reasoning questions requiring connected context.
- Technical manuals or engineering documents?
NASA reports are too specialized, likely absent from training data, and limit
question diversity to technical retrieval.
- Complete sections without excerpting?
While more comprehensive, ~3,800 words presents practical challenges for smaller
models and extends testing time. Excerpting maintains essential information
while improving testability across architectures.

---

## Excerpt Selection Methodology

**From “Lunar landing” section:**

- Descent problems and trajectory issues
- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation
- Manual landing sequence with fuel concerns
- Landing confirmation moment

**From “Lunar surface operations” section:**

- EVA preparation and first step
- Armstrong’s famous quote and its controversy
- Surface activities and movement
- Flag planting and Nixon communication
- Scientific equipment deployment (EASEP)
- Sample collection activities
- Return to lunar module

**Omitted content:**

- Extended technical explanations of radar systems
- Detailed crew dialogue transcripts
- Some procedural minutiae

**Selection criteria:**

- Information density for prompts
- Narrative continuity
- Factual richness for RAG tasks
- Reasoning opportunities

---

## Limitations

**Excerpt nature:**

Using selected passages rather than complete sections reduces some contextual richness,
though all test prompts remain fully answerable.

**Single domain:**

Results may not generalize beyond this topic.

- *Acknowledgment:* This is a focused benchmark within defined scope.

---

## Conclusion

The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”*
sections provide:

✅ Practical content for all model types
✅ Reproducibility through permanent links and documented selections
✅ Balance of factual density and narrative coherence
✅ Support for diverse question types
✅ Academic integrity through proper licensing and attribution
✅ Alignment with Green AI benchmarking objectives

This selection enables fair, transparent comparison of AI model accuracy and environmental
efficiency while maintaining practical testability on available hardware.

---

## References

**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman,
M.,& Blunsom, P. (2015).**
*Teaching Machines to Read and Comprehend.*
*Advances in Neural Information Processing Systems, 28.*
[arXiv:1506.03340](https://arxiv.org/abs/1506.03340)

**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).**
*SQuAD: 100,000+ Questions for Machine Comprehension of Text.*
*Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 2383–2392.*
[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/)
181 changes: 181 additions & 0 deletions test_dataset_apollo11/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# 🚀 Apollo 11 Test Dataset

## 🌕 Overview

This is the unified test dataset for comparing different AI models (commercial,
distilled, SLM, and RAG systems) in the ELO2 - Green AI project.

The dataset consists of selected passages from Wikipedia's Apollo 11 article,
accompanied by 15 standardized prompts testing summarization, reasoning, and
retrieval-augmented generation capabilities.

---

## 📂 Dataset Contents

- **[README.md][readme]** - This file (overview and instructions)
- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text)
- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
- **[test_data.json][json]** - Complete dataset (structured format for automated
testing)
- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions

📌 **Process documentation:** For background on dataset creation decisions and
team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)**

[readme]: /test_dataset_apollo11/README.md
[source]: /test_dataset_apollo11/source_text.txt
[prompts]: /test_dataset_apollo11/test_prompts.md
[json]: /test_dataset_apollo11/test_data.json
[rationale]: /test_dataset_apollo11/RATIONALE.md

---

## 📄 Source & License

**Source:** Wikipedia - Apollo 11 article
**URL:** <https://en.wikipedia.org/wiki/Apollo_11>
**Permanent Link:** <https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845>
**Revision ID:** 1252473845 (Wikipedia internal revision number)
**Date Accessed:** October 22, 2025
**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface
operations"
**Word Count:** ~1,400 words
**Language:** English

**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)

- ✅ Content can be used freely for research
- ✅ Wikipedia must be attributed as the source
- ✅ Derivative works must be shared under the same license

**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0

**Text Structure:** Selected passages from Wikipedia sections.

- Individual sentences are unchanged; some paragraphs omitted for length management.
- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
practical testing while maintaining all information necessary for the 15 test prompts.

📌 See [source_text.txt][source] for the complete excerpted text.

---

## 🎯 Selection Rationale

✅ **Practical length** - ~1,400 words manageable for all model types including
distilled models with standard chunking
✅ **Rich in specific details** - Ideal for RAG testing (times, names, numbers,
technical terms)
✅ **Multiple complexity levels** - Both simple recall and complex reasoning can
be tested
✅ **Narrative structure** - Clear sequence from descent through surface
activities
✅ **All prompts answerable** - 15 test prompts verified to work with selected
passages

The excerpts cover the dramatic descent and landing sequence, followed by
moonwalk activities, ensuring comprehensive testing across summarization,
reasoning, and RAG tasks.

📌 See [RATIONALE.md][rationale] for detailed selection methodology.

---

## 📝 Test Structure

**15 Standardized Prompts** across three categories:

### Summarization (5 prompts)

Tests model's ability to condense and extract key information

**Difficulty:** Easy → Medium → Hard
**Examples:** Main events, challenges faced, activities performed, equipment
deployed

### Reasoning (5 prompts)

Tests model's ability to analyze, infer, and make connections

**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
analysis
**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
manual control? What does Margaret Hamilton's statement reveal?

### RAG - Retrieval (5 prompts)

Tests model's ability to retrieve specific information from source text

**Types:** Times, quotes, numbers, lists, complex multi-part facts
**Examples:** Landing time? Material collected? Scientific instruments deployed?

📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
for its structured data version.

---

## 🔧 How to Use

### General Instructions

- **All 15 prompts** should be tested across all models to ensure a fair comparison.
- Some prompts can be more challenging for smaller models,
but attempting all prompts provides comprehensive evaluation data.

**Testing Protocol:**

**1.** Use the source text from **[source_text.txt][source]** exactly as provided
**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification
**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
testing workflows
**4.** Record responses for each prompt with model configuration details
**5.** Note any errors, failures, or unusual behaviors

---

## 📊 Evaluation

For each prompt, record:

**1. Accuracy** - Is the answer factually correct?
**2. Completeness** - Are all key points covered?
**3. Specificity** - Are specific details included (times, names, numbers)?
**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
well-supported?

Maintain consistent evaluation criteria across all models for fair comparison.

---

## ⚠️ Guidelines

**Critical Rules:**

- **DO NOT modify** the source text
- **DO NOT modify** the prompts
- **DO record** all test configurations (model version, parameters, hardware)
- **DO note** any failures as "No response" or "Error" with details

**Technical Notes:**

- For RAG systems: Load the source text into the database and verify indexing
before testing
- For models with token limits: Chunking may be required
- Environment: Use consistent hardware and settings when possible
- Environmental measurements: Use standardized protocols

---

## 📖 How to Cite This Dataset

When referencing this dataset in reports or publications:

> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article
> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0.
> Available at: <https://en.wikipedia.org/wiki/Apollo_11>

---

*For questions or issues, please contact the project team.
Good luck with testing!* 🚀
Loading
Loading