MIT-Emerging-Talent · CaesarGhazi · Oct 28, 2025 · Oct 28, 2025 · Oct 28, 2025 · Oct 28, 2025
diff --git a/test_dataset_apollo11/RATIONALE.md b/test_dataset_apollo11/RATIONALE.md
@@ -0,0 +1,189 @@
+# Rationale for Text Selection
+
+## Overview  
+
+Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing”
+and “Lunar surface operations” sections were selected as the unified test
+dataset for the ELO2 - Green AI project.
+
+---
+
+## Why Apollo 11?  
+
+**Universal Knowledge:**
+
+All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their
+training data, enabling fair comparison with and without RAG.
+
+**Rich Factual Content:**
+
+Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC),
+numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton),
+and technical terms (LGC, PLSS, EASEP).
+
+**Accessibility:**
+
+Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable
+via permanent links.
+
+**Appropriate Length:**
+
+Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for
+evaluation yet processable by smaller models on standard hardware. This length
+aligns with standard benchmarks:
+
+- Summarization tasks typically use 500-2,000 words,
+- QA benchmarks 300-1,500 words,
+- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015).
+
+The excerpted length balances comprehensiveness with practical testability.
+
+---
+
+## Why These Excerpted Passages?  
+
+**Continuous Narrative:**
+
+Selected passages flow from descent through surface activities, forming a natural
+story arc ideal for summarization tasks requiring temporal understanding.  
+
+**Balanced Complexity:**  
+
+- Simple facts (times, names, quotes) suitable for smaller and distilled models
+- Complex elements (technical problems, decision-making, procedures) challenging
+  for all models
+
+**Optimal for RAG:**
+
+Dense with retrievable facts across categories—times, quantities, names, equipment,quotes.
+
+**Reasoning Opportunities:**
+
+Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?),
+and analytical reasoning.
+
+**Verified Coverage:**
+
+All 15 test prompts confirmed answerable with excerpted passages through
+preliminary testing.
+
+**Length Management:**
+
+Complete sections (~3,800 words) would require extensive chunking for distilled models
+with limited token capacity. Excerpted passages (~1,400 words) are more manageable
+while maintaining comprehensive content for all test scenarios.
+
+---
+
+## Alignment with Project Goals  
+
+**Fair Comparison:**  
+
+- Commercial models tested on likely training data  
+- RAG systems given the same information  
+- All models evaluated on identical input
+
+**Reproducibility:**
+
+Permanent Wikipedia link, documented excerpt selections, license documentation.
+
+**Why Not Other Approaches?**  
+
+- Entire Wikipedia article (all sections)?
+  Too long (~10,000+ words)—exceeds processing capacity of smaller models,
+  impractical for manual verification.  
+- Self-written summary?
+  Custom summaries cannot be reproduced by others and raise objectivity concerns
+  plus potential copyright issues.  
+- Multiple unrelated passages?
+  Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow,
+  prevent reasoning questions requiring connected context.  
+- Technical manuals or engineering documents?
+  NASA reports are too specialized, likely absent from training data, and limit
+  question diversity to technical retrieval.  
+- Complete sections without excerpting?
+  While more comprehensive, ~3,800 words presents practical challenges for smaller
+  models and extends testing time. Excerpting maintains essential information  
+  while improving testability across architectures.
+
+---
+
+## Excerpt Selection Methodology  
+
+**From “Lunar landing” section:**  
+
+- Descent problems and trajectory issues  
+- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation  
+- Manual landing sequence with fuel concerns  
+- Landing confirmation moment  
+
+**From “Lunar surface operations” section:**  
+
+- EVA preparation and first step  
+- Armstrong’s famous quote and its controversy  
+- Surface activities and movement  
+- Flag planting and Nixon communication  
+- Scientific equipment deployment (EASEP)  
+- Sample collection activities  
+- Return to lunar module
+
+**Omitted content:**  
+
+- Extended technical explanations of radar systems  
+- Detailed crew dialogue transcripts  
+- Some procedural minutiae  
+
+**Selection criteria:**  
+
+- Information density for prompts  
+- Narrative continuity  
+- Factual richness for RAG tasks  
+- Reasoning opportunities  
+
+---
+
+## Limitations  
+
+**Excerpt nature:**
+
+Using selected passages rather than complete sections reduces some contextual richness,
+though all test prompts remain fully answerable.  
+
+**Single domain:**
+
+Results may not generalize beyond this topic.  
+
+- *Acknowledgment:* This is a focused benchmark within defined scope.
+
+---
+
+## Conclusion  
+
+The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”*
+sections provide:  
+
+✅ Practical content for all model types  
+✅ Reproducibility through permanent links and documented selections  
+✅ Balance of factual density and narrative coherence  
+✅ Support for diverse question types  
+✅ Academic integrity through proper licensing and attribution  
+✅ Alignment with Green AI benchmarking objectives  
+
+This selection enables fair, transparent comparison of AI model accuracy and environmental
+efficiency while maintaining practical testability on available hardware.
+
+---
+
+## References  
+
+**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman,
+M.,& Blunsom, P. (2015).**
+*Teaching Machines to Read and Comprehend.*  
+*Advances in Neural Information Processing Systems, 28.*  
+[arXiv:1506.03340](https://arxiv.org/abs/1506.03340)  
+
+**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).**
+*SQuAD: 100,000+ Questions for Machine Comprehension of Text.*  
+*Proceedings of the 2016 Conference on Empirical Methods in Natural Language
+Processing (EMNLP), pp. 2383–2392.*  
+[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/)  
diff --git a/test_dataset_apollo11/README.md b/test_dataset_apollo11/README.md
@@ -0,0 +1,181 @@
+# 🚀 Apollo 11 Test Dataset
+
+## 🌕 Overview
+
+This is the unified test dataset for comparing different AI models (commercial,
+distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
+
+The dataset consists of selected passages from Wikipedia's Apollo 11 article,
+accompanied by 15 standardized prompts testing summarization, reasoning, and
+retrieval-augmented generation capabilities.
+
+---
+
+## 📂 Dataset Contents
+
+- **[README.md][readme]** - This file (overview and instructions)
+- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text)
+- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
+- **[test_data.json][json]** - Complete dataset (structured format for automated
+  testing)
+- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions
+
+📌 **Process documentation:** For background on dataset creation decisions and
+team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)**
+
+[readme]: /test_dataset_apollo11/README.md
+[source]: /test_dataset_apollo11/source_text.txt
+[prompts]: /test_dataset_apollo11/test_prompts.md
+[json]: /test_dataset_apollo11/test_data.json
+[rationale]: /test_dataset_apollo11/RATIONALE.md
+
+---
+
+## 📄 Source & License
+
+**Source:** Wikipedia - Apollo 11 article
+**URL:** <https://en.wikipedia.org/wiki/Apollo_11>
+**Permanent Link:** <https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845>
+**Revision ID:** 1252473845 (Wikipedia internal revision number)
+**Date Accessed:** October 22, 2025
+**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface
+operations"
+**Word Count:** ~1,400 words
+**Language:** English
+
+**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)
+
+- ✅ Content can be used freely for research
+- ✅ Wikipedia must be attributed as the source
+- ✅ Derivative works must be shared under the same license
+
+**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0
+
+**Text Structure:** Selected passages from Wikipedia sections.
+
+- Individual sentences are unchanged; some paragraphs omitted for length management.
+- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
+practical testing while maintaining all information necessary for the 15 test prompts.
+
+📌 See [source_text.txt][source] for the complete excerpted text.
+
+---
+
+## 🎯 Selection Rationale
+
+✅ **Practical length** - ~1,400 words manageable for all model types including
+distilled models with standard chunking
+✅ **Rich in specific details** - Ideal for RAG testing (times, names, numbers,
+technical terms)
+✅ **Multiple complexity levels** - Both simple recall and complex reasoning can
+be tested
+✅ **Narrative structure** - Clear sequence from descent through surface
+activities
+✅ **All prompts answerable** - 15 test prompts verified to work with selected
+passages
+
+The excerpts cover the dramatic descent and landing sequence, followed by
+moonwalk activities, ensuring comprehensive testing across summarization,
+reasoning, and RAG tasks.
+
+📌 See [RATIONALE.md][rationale] for detailed selection methodology.
+
+---
+
+## 📝 Test Structure
+
+**15 Standardized Prompts** across three categories:
+
+### Summarization (5 prompts)
+
+Tests model's ability to condense and extract key information
+
+**Difficulty:** Easy → Medium → Hard
+**Examples:** Main events, challenges faced, activities performed, equipment
+deployed
+
+### Reasoning (5 prompts)
+
+Tests model's ability to analyze, infer, and make connections
+
+**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
+analysis
+**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
+manual control? What does Margaret Hamilton's statement reveal?
+
+### RAG - Retrieval (5 prompts)
+
+Tests model's ability to retrieve specific information from source text
+
+**Types:** Times, quotes, numbers, lists, complex multi-part facts
+**Examples:** Landing time? Material collected? Scientific instruments deployed?
+
+📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
+for its structured data version.
+
+---
+
+## 🔧 How to Use
+
+### General Instructions
+
+- **All 15 prompts** should be tested across all models to ensure a fair comparison.
+- Some prompts can be more challenging for smaller models,
+but attempting all prompts provides comprehensive evaluation data.
+
+**Testing Protocol:**
+
+**1.** Use the source text from **[source_text.txt][source]** exactly as provided
+**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification
+**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
+   testing workflows
+**4.** Record responses for each prompt with model configuration details
+**5.** Note any errors, failures, or unusual behaviors
+
+---
+
+## 📊 Evaluation
+
+For each prompt, record:
+
+**1. Accuracy** - Is the answer factually correct?
+**2. Completeness** - Are all key points covered?
+**3. Specificity** - Are specific details included (times, names, numbers)?
+**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
+   well-supported?  
+
+Maintain consistent evaluation criteria across all models for fair comparison.
+
+---
+
+## ⚠️ Guidelines
+
+**Critical Rules:**
+
+- **DO NOT modify** the source text
+- **DO NOT modify** the prompts
+- **DO record** all test configurations (model version, parameters, hardware)
+- **DO note** any failures as "No response" or "Error" with details
+
+**Technical Notes:**
+
+- For RAG systems: Load the source text into the database and verify indexing
+  before testing
+- For models with token limits: Chunking may be required
+- Environment: Use consistent hardware and settings when possible
+- Environmental measurements: Use standardized protocols
+
+---
+
+## 📖 How to Cite This Dataset
+
+When referencing this dataset in reports or publications:
+
+> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article
+> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0.
+> Available at: <https://en.wikipedia.org/wiki/Apollo_11>
+
+---
+
+*For questions or issues, please contact the project team.  
+Good luck with testing!* 🚀