diff --git a/test_dataset_apollo11/RATIONALE.md b/test_dataset_apollo11/RATIONALE.md new file mode 100644 index 0000000..a89f132 --- /dev/null +++ b/test_dataset_apollo11/RATIONALE.md @@ -0,0 +1,189 @@ +# Rationale for Text Selection + +## Overview + +Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing” +and “Lunar surface operations” sections were selected as the unified test +dataset for the ELO2 - Green AI project. + +--- + +## Why Apollo 11? + +**Universal Knowledge:** + +All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their +training data, enabling fair comparison with and without RAG. + +**Rich Factual Content:** + +Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC), +numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton), +and technical terms (LGC, PLSS, EASEP). + +**Accessibility:** + +Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable +via permanent links. + +**Appropriate Length:** + +Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for +evaluation yet processable by smaller models on standard hardware. This length +aligns with standard benchmarks: + +- Summarization tasks typically use 500-2,000 words, +- QA benchmarks 300-1,500 words, +- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015). + +The excerpted length balances comprehensiveness with practical testability. + +--- + +## Why These Excerpted Passages? + +**Continuous Narrative:** + +Selected passages flow from descent through surface activities, forming a natural +story arc ideal for summarization tasks requiring temporal understanding. + +**Balanced Complexity:** + +- Simple facts (times, names, quotes) suitable for smaller and distilled models +- Complex elements (technical problems, decision-making, procedures) challenging + for all models + +**Optimal for RAG:** + +Dense with retrievable facts across categories—times, quantities, names, equipment,quotes. + +**Reasoning Opportunities:** + +Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?), +and analytical reasoning. + +**Verified Coverage:** + +All 15 test prompts confirmed answerable with excerpted passages through +preliminary testing. + +**Length Management:** + +Complete sections (~3,800 words) would require extensive chunking for distilled models +with limited token capacity. Excerpted passages (~1,400 words) are more manageable +while maintaining comprehensive content for all test scenarios. + +--- + +## Alignment with Project Goals + +**Fair Comparison:** + +- Commercial models tested on likely training data +- RAG systems given the same information +- All models evaluated on identical input + +**Reproducibility:** + +Permanent Wikipedia link, documented excerpt selections, license documentation. + +**Why Not Other Approaches?** + +- Entire Wikipedia article (all sections)? + Too long (~10,000+ words)—exceeds processing capacity of smaller models, + impractical for manual verification. +- Self-written summary? + Custom summaries cannot be reproduced by others and raise objectivity concerns + plus potential copyright issues. +- Multiple unrelated passages? + Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow, + prevent reasoning questions requiring connected context. +- Technical manuals or engineering documents? + NASA reports are too specialized, likely absent from training data, and limit + question diversity to technical retrieval. +- Complete sections without excerpting? + While more comprehensive, ~3,800 words presents practical challenges for smaller + models and extends testing time. Excerpting maintains essential information + while improving testability across architectures. + +--- + +## Excerpt Selection Methodology + +**From “Lunar landing” section:** + +- Descent problems and trajectory issues +- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation +- Manual landing sequence with fuel concerns +- Landing confirmation moment + +**From “Lunar surface operations” section:** + +- EVA preparation and first step +- Armstrong’s famous quote and its controversy +- Surface activities and movement +- Flag planting and Nixon communication +- Scientific equipment deployment (EASEP) +- Sample collection activities +- Return to lunar module + +**Omitted content:** + +- Extended technical explanations of radar systems +- Detailed crew dialogue transcripts +- Some procedural minutiae + +**Selection criteria:** + +- Information density for prompts +- Narrative continuity +- Factual richness for RAG tasks +- Reasoning opportunities + +--- + +## Limitations + +**Excerpt nature:** + +Using selected passages rather than complete sections reduces some contextual richness, +though all test prompts remain fully answerable. + +**Single domain:** + +Results may not generalize beyond this topic. + +- *Acknowledgment:* This is a focused benchmark within defined scope. + +--- + +## Conclusion + +The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”* +sections provide: + +✅ Practical content for all model types +✅ Reproducibility through permanent links and documented selections +✅ Balance of factual density and narrative coherence +✅ Support for diverse question types +✅ Academic integrity through proper licensing and attribution +✅ Alignment with Green AI benchmarking objectives + +This selection enables fair, transparent comparison of AI model accuracy and environmental +efficiency while maintaining practical testability on available hardware. + +--- + +## References + +**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, +M.,& Blunsom, P. (2015).** +*Teaching Machines to Read and Comprehend.* +*Advances in Neural Information Processing Systems, 28.* +[arXiv:1506.03340](https://arxiv.org/abs/1506.03340) + +**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).** +*SQuAD: 100,000+ Questions for Machine Comprehension of Text.* +*Proceedings of the 2016 Conference on Empirical Methods in Natural Language +Processing (EMNLP), pp. 2383–2392.* +[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/) diff --git a/test_dataset_apollo11/README.md b/test_dataset_apollo11/README.md new file mode 100644 index 0000000..01cf60c --- /dev/null +++ b/test_dataset_apollo11/README.md @@ -0,0 +1,181 @@ +# 🚀 Apollo 11 Test Dataset + +## 🌕 Overview + +This is the unified test dataset for comparing different AI models (commercial, +distilled, SLM, and RAG systems) in the ELO2 - Green AI project. + +The dataset consists of selected passages from Wikipedia's Apollo 11 article, +accompanied by 15 standardized prompts testing summarization, reasoning, and +retrieval-augmented generation capabilities. + +--- + +## 📂 Dataset Contents + +- **[README.md][readme]** - This file (overview and instructions) +- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text) +- **[test_prompts.md][prompts]** - 15 test prompts (readable format) +- **[test_data.json][json]** - Complete dataset (structured format for automated + testing) +- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions + +📌 **Process documentation:** For background on dataset creation decisions and +team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)** + +[readme]: /test_dataset_apollo11/README.md +[source]: /test_dataset_apollo11/source_text.txt +[prompts]: /test_dataset_apollo11/test_prompts.md +[json]: /test_dataset_apollo11/test_data.json +[rationale]: /test_dataset_apollo11/RATIONALE.md + +--- + +## 📄 Source & License + +**Source:** Wikipedia - Apollo 11 article +**URL:** +**Permanent Link:** +**Revision ID:** 1252473845 (Wikipedia internal revision number) +**Date Accessed:** October 22, 2025 +**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface +operations" +**Word Count:** ~1,400 words +**Language:** English + +**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) + +- ✅ Content can be used freely for research +- ✅ Wikipedia must be attributed as the source +- ✅ Derivative works must be shared under the same license + +**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0 + +**Text Structure:** Selected passages from Wikipedia sections. + +- Individual sentences are unchanged; some paragraphs omitted for length management. +- Complete original sections total ~3,800 words; excerpted to ~1,400 words for +practical testing while maintaining all information necessary for the 15 test prompts. + +📌 See [source_text.txt][source] for the complete excerpted text. + +--- + +## 🎯 Selection Rationale + +✅ **Practical length** - ~1,400 words manageable for all model types including +distilled models with standard chunking +✅ **Rich in specific details** - Ideal for RAG testing (times, names, numbers, +technical terms) +✅ **Multiple complexity levels** - Both simple recall and complex reasoning can +be tested +✅ **Narrative structure** - Clear sequence from descent through surface +activities +✅ **All prompts answerable** - 15 test prompts verified to work with selected +passages + +The excerpts cover the dramatic descent and landing sequence, followed by +moonwalk activities, ensuring comprehensive testing across summarization, +reasoning, and RAG tasks. + +📌 See [RATIONALE.md][rationale] for detailed selection methodology. + +--- + +## 📝 Test Structure + +**15 Standardized Prompts** across three categories: + +### Summarization (5 prompts) + +Tests model's ability to condense and extract key information + +**Difficulty:** Easy → Medium → Hard +**Examples:** Main events, challenges faced, activities performed, equipment +deployed + +### Reasoning (5 prompts) + +Tests model's ability to analyze, infer, and make connections + +**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep +analysis +**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken +manual control? What does Margaret Hamilton's statement reveal? + +### RAG - Retrieval (5 prompts) + +Tests model's ability to retrieve specific information from source text + +**Types:** Times, quotes, numbers, lists, complex multi-part facts +**Examples:** Landing time? Material collected? Scientific instruments deployed? + +📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json] +for its structured data version. + +--- + +## 🔧 How to Use + +### General Instructions + +- **All 15 prompts** should be tested across all models to ensure a fair comparison. +- Some prompts can be more challenging for smaller models, +but attempting all prompts provides comprehensive evaluation data. + +**Testing Protocol:** + +**1.** Use the source text from **[source_text.txt][source]** exactly as provided +**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification +**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted + testing workflows +**4.** Record responses for each prompt with model configuration details +**5.** Note any errors, failures, or unusual behaviors + +--- + +## 📊 Evaluation + +For each prompt, record: + +**1. Accuracy** - Is the answer factually correct? +**2. Completeness** - Are all key points covered? +**3. Specificity** - Are specific details included (times, names, numbers)? +**4. Reasoning Quality** - For reasoning prompts, is the logic sound and + well-supported? + +Maintain consistent evaluation criteria across all models for fair comparison. + +--- + +## ⚠️ Guidelines + +**Critical Rules:** + +- **DO NOT modify** the source text +- **DO NOT modify** the prompts +- **DO record** all test configurations (model version, parameters, hardware) +- **DO note** any failures as "No response" or "Error" with details + +**Technical Notes:** + +- For RAG systems: Load the source text into the database and verify indexing + before testing +- For models with token limits: Chunking may be required +- Environment: Use consistent hardware and settings when possible +- Environmental measurements: Use standardized protocols + +--- + +## 📖 How to Cite This Dataset + +When referencing this dataset in reports or publications: + +> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article +> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0. +> Available at: + +--- + +*For questions or issues, please contact the project team. +Good luck with testing!* 🚀 diff --git a/test_dataset_apollo11/source_text.txt b/test_dataset_apollo11/source_text.txt new file mode 100644 index 0000000..51a6874 --- /dev/null +++ b/test_dataset_apollo11/source_text.txt @@ -0,0 +1,127 @@ +Apollo 11 – Lunar Descent and Moonwalk + +As the descent began, Armstrong and Aldrin found themselves passing landmarks on the +surface two or three seconds early, and reported that they were “long”; they would land +miles west of their target point. Eagle was traveling too fast. The problem could have been +mascons—concentrations of high mass in a region or regions of the Moon’s crust that +contains a gravitational anomaly, potentially altering Eagle’s trajectory. + +Five minutes into the descent burn, and 6,000 feet (1,800 m) above the surface of the +Moon, the LM guidance computer (LGC) distracted the crew with the first of several +unexpected 1201 and 1202 program alarms. Inside Mission Control Center, computer +engineer Jack Garman told Guidance Officer Steve Bales it was safe to continue the +descent, and this was relayed to the crew. The program alarms indicated “executive +overflows”, meaning the guidance computer could not complete all its tasks in real-time and +had to postpone some of them. Margaret Hamilton, the Director of Apollo Flight Computer +Programming at the MIT Charles Stark Draper Laboratory later recalled: “To blame the +computer for the Apollo 11 problems is like blaming the person who spots a fire and calls +the fire department. Actually, the computer was programmed to do more than recognize +error conditions. A complete set of recovery programs was incorporated into the software. +The software’s action, in this case, was to eliminate lower priority tasks and re-establish +the more important ones. The computer, rather than almost forcing an abort, prevented an +abort. If the computer hadn’t recognized this problem and taken recovery action, I doubt if +Apollo 11 would have been the successful Moon landing it was.” + +When Armstrong again looked outside, he saw that the computer’s landing target was in a +boulder-strewn area just north and east of a 300-foot-diameter (91 m) crater, so he took +semi-automatic control. Throughout the descent, Aldrin called out navigation data to +Armstrong, who was busy piloting Eagle. Now 107 feet (33 m) above the surface, +Armstrong knew their propellant supply was dwindling and was determined to land at the +first possible landing site. + +Armstrong found a clear patch of ground and maneuvered the spacecraft towards it. They +were now 100 feet (30 m) from the surface, with only 90 seconds of propellant remaining. +Lunar dust kicked up by the LM’s engine began to impair his ability to determine the +spacecraft’s motion. + +A light informed Aldrin that at least one of the 67-inch (170 cm) probes hanging from +Eagle’s footpads had touched the surface and he said: “Contact light!” Three seconds later, +Eagle landed and Armstrong shut the engine down. Aldrin immediately said “Okay, engine +stop.” + +Eagle landed at 20:17:40 UTC on Sunday July 20 with 216 pounds (98 kg) of usable fuel +remaining. Information available to the crew and mission controllers during the landing +showed the LM had enough fuel for another 25 seconds of powered flight before an abort +without touchdown would have become unsafe, but post-mission analysis showed that the +real figure was probably closer to 50 seconds. + +Armstrong acknowledged Aldrin’s completion of the post-landing checklist with “Engine +arm is off”, before responding to the CAPCOM, Charles Duke, with the words, “Houston, +Tranquility Base here. The Eagle has landed.” Duke expressed the relief at Mission Control: +“Roger, Twan—Tranquility, we copy you on the ground. You got a bunch of guys about to +turn blue. We’re breathing again. Thanks a lot.” + +Preparations for Neil Armstrong and Buzz Aldrin to walk on the Moon began at 23:43 UTC. +These took longer than expected; three and a half hours instead of two. Six hours and +thirty-nine minutes after landing, Armstrong and Aldrin were ready to go outside, and +Eagle was depressurized. + +Eagle’s hatch was opened at 02:39:33. Armstrong initially had some difficulties squeezing +through the hatch with his portable life support system (PLSS). At 02:51 Armstrong began +his descent to the lunar surface. Climbing down the nine-rung ladder, Armstrong pulled a +D-ring to deploy the modular equipment stowage assembly (MESA) folded against Eagle’s +side and activate the TV camera. + +Despite some technical and weather difficulties, black and white images of the first lunar +EVA were received and broadcast to at least 600 million people on Earth. + +After describing the surface dust as “very fine-grained” and “almost like a powder”, at +02:56:15, six and a half hours after landing, Armstrong stepped off Eagle’s landing pad and +declared: “That’s one small step for [a] man, one giant leap for mankind.” + +Armstrong intended to say “That’s one small step for a man”, but the word “a” is not +audible in the transmission, and thus was not initially reported by most observers of the live +broadcast. When later asked about his quote, Armstrong said he believed he said “for a +man”, and subsequent printed versions of the quote included the “a” in square brackets. + +About seven minutes after stepping onto the Moon’s surface, Armstrong collected a +contingency soil sample using a sample bag on a stick. Twelve minutes after the sample +was collected, he removed the TV camera from the MESA and made a panoramic sweep, +then mounted it on a tripod. Aldrin joined Armstrong on the surface. He described the view +with the simple phrase: “Magnificent desolation.” + +Armstrong said moving in the lunar gravity, one-sixth of Earth’s, was “even perhaps easier +than the simulations … It’s absolutely no trouble to walk around.” Aldrin joined him on the +surface and tested methods for moving around, including two-footed kangaroo hops. The +PLSS backpack created a tendency to tip backward, but neither astronaut had serious +problems maintaining balance. The fine soil was quite slippery. + +The astronauts planted the Lunar Flag Assembly containing a flag of the United States on +the lunar surface, in clear view of the TV camera. Aldrin remembered, “Of all the jobs I had +to do on the Moon the one I wanted to go the smoothest was the flag raising.” But the +astronauts struggled with the telescoping rod and could only insert the pole about 2 inches +(5 cm) into the hard lunar surface. Before Aldrin could take a photo of Armstrong with the +flag, President Richard Nixon spoke to them through a telephone-radio transmission, which +Nixon called “the most historic phone call ever made from the White House.” + +They deployed the EASEP, which included a Passive Seismic Experiment Package used to +measure moonquakes and a retroreflector array used for the lunar laser ranging +experiment. Then Armstrong walked 196 feet (60 m) from the LM to take photographs at +the rim of Little West Crater while Aldrin collected two core samples. He used the +geologist’s hammer to pound in the tubes—the only time the hammer was used on Apollo +11—but was unable to penetrate more than 6 inches (15 cm) deep. + +The astronauts then collected rock samples using scoops and tongs on extension handles. +Many of the surface activities took longer than expected, so they had to stop documenting +sample collection halfway through the allotted 34 minutes. Aldrin shoveled 6 kilograms +(13 lb) of soil into the box of rocks to pack them in tightly. Two types of rocks were found in +the geological samples: basalt and breccia. + +While on the surface, Armstrong uncovered a plaque mounted on the LM ladder, bearing +two drawings of Earth, an inscription, and signatures of the astronauts and President Nixon. +The inscription read: “Here men from the planet Earth first set foot upon the Moon July +1969, A. D. We came in peace for all mankind.” + +Mission Control used a coded phrase to warn Armstrong his metabolic rates were high, and +that he should slow down. As metabolic rates remained generally lower than expected for +both astronauts throughout the walk, Mission Control granted the astronauts a 15-minute +extension. + +Aldrin entered Eagle first. With some difficulty the astronauts lifted film and two sample +boxes containing 21.55 kilograms (47.5 lb) of lunar surface material to the LM hatch using a +flat cable pulley device called the Lunar Equipment Conveyor (LEC). Armstrong then +jumped onto the ladder’s third rung, and climbed into the LM. After transferring to LM life +support, the explorers lightened the ascent stage for the return to lunar orbit by tossing out +their PLSS backpacks, lunar overshoes, an empty Hasselblad camera, and other equipment. +The hatch was closed again at 05:11:13. They then pressurized the LM and settled down to +sleep. diff --git a/test_dataset_apollo11/test_data.json b/test_dataset_apollo11/test_data.json new file mode 100644 index 0000000..b4d9139 --- /dev/null +++ b/test_dataset_apollo11/test_data.json @@ -0,0 +1,139 @@ +{ + "metadata": { + "source": "Wikipedia - Apollo 11", + "url": "https://en.wikipedia.org/wiki/Apollo_11", + "permanent_link": "https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845", + "revision_id": "1252473845", + "sections": ["Lunar landing", "Lunar surface operations"], + "date_accessed": "2025-10-22", + "license": "CC BY-SA 3.0", + "note": "Excerpted passages from Wikipedia sections; individual sentences unchanged, some paragraphs omitted for length", + "word_count": "approximately 1,400 words", + "language": "English" + }, + + "source_text": "As the descent began, Armstrong and Aldrin found themselves passing landmarks on the surface two or three seconds early, and reported that they were \"long\"; they would land miles west of their target point. Eagle was traveling too fast. The problem could have been mascons—concentrations of high mass in a region or regions of the Moon's crust that contains a gravitational anomaly, potentially altering Eagle's trajectory.\n\nFive minutes into the descent burn, and 6,000 feet (1,800 m) above the surface of the Moon, the LM guidance computer (LGC) distracted the crew with the first of several unexpected 1201 and 1202 program alarms. Inside Mission Control Center, computer engineer Jack Garman told Guidance Officer Steve Bales it was safe to continue the descent, and this was relayed to the crew. The program alarms indicated \"executive overflows\", meaning the guidance computer could not complete all its tasks in real-time and had to postpone some of them. Margaret Hamilton, the Director of Apollo Flight Computer Programming at the MIT Charles Stark Draper Laboratory later recalled: \"To blame the computer for the Apollo 11 problems is like blaming the person who spots a fire and calls the fire department. Actually, the computer was programmed to do more than recognize error conditions. A complete set of recovery programs was incorporated into the software. The software's action, in this case, was to eliminate lower priority tasks and re-establish the more important ones. The computer, rather than almost forcing an abort, prevented an abort. If the computer hadn't recognized this problem and taken recovery action, I doubt if Apollo 11 would have been the successful Moon landing it was.\"\n\nWhen Armstrong again looked outside, he saw that the computer's landing target was in a boulder-strewn area just north and east of a 300-foot-diameter (91 m) crater, so he took semi-automatic control. Throughout the descent, Aldrin called out navigation data to Armstrong, who was busy piloting Eagle. Now 107 feet (33 m) above the surface, Armstrong knew their propellant supply was dwindling and was determined to land at the first possible landing site.\n\nArmstrong found a clear patch of ground and maneuvered the spacecraft towards it. They were now 100 feet (30 m) from the surface, with only 90 seconds of propellant remaining. Lunar dust kicked up by the LM's engine began to impair his ability to determine the spacecraft's motion.\n\nA light informed Aldrin that at least one of the 67-inch (170 cm) probes hanging from Eagle's footpads had touched the surface and he said: \"Contact light!\" Three seconds later, Eagle landed and Armstrong shut the engine down. Aldrin immediately said \"Okay, engine stop.\"\n\nEagle landed at 20:17:40 UTC on Sunday July 20 with 216 pounds (98 kg) of usable fuel remaining. Information available to the crew and mission controllers during the landing showed the LM had enough fuel for another 25 seconds of powered flight before an abort without touchdown would have become unsafe, but post-mission analysis showed that the real figure was probably closer to 50 seconds.\n\nArmstrong acknowledged Aldrin's completion of the post-landing checklist with \"Engine arm is off\", before responding to the CAPCOM, Charles Duke, with the words, \"Houston, Tranquility Base here. The Eagle has landed.\" Duke expressed the relief at Mission Control: \"Roger, Twan—Tranquility, we copy you on the ground. You got a bunch of guys about to turn blue. We're breathing again. Thanks a lot.\"\n\nPreparations for Neil Armstrong and Buzz Aldrin to walk on the Moon began at 23:43 UTC. These took longer than expected; three and a half hours instead of two. Six hours and thirty-nine minutes after landing, Armstrong and Aldrin were ready to go outside, and Eagle was depressurized.\n\nEagle's hatch was opened at 02:39:33. Armstrong initially had some difficulties squeezing through the hatch with his portable life support system (PLSS). At 02:51 Armstrong began his descent to the lunar surface. Climbing down the nine-rung ladder, Armstrong pulled a D-ring to deploy the modular equipment stowage assembly (MESA) folded against Eagle's side and activate the TV camera.\n\nDespite some technical and weather difficulties, black and white images of the first lunar EVA were received and broadcast to at least 600 million people on Earth.\n\nAfter describing the surface dust as \"very fine-grained\" and \"almost like a powder\", at 02:56:15, six and a half hours after landing, Armstrong stepped off Eagle's landing pad and declared: \"That's one small step for [a] man, one giant leap for mankind.\"\n\nArmstrong intended to say \"That's one small step for a man\", but the word \"a\" is not audible in the transmission, and thus was not initially reported by most observers of the live broadcast. When later asked about his quote, Armstrong said he believed he said \"for a man\", and subsequent printed versions of the quote included the \"a\" in square brackets.\n\nAbout seven minutes after stepping onto the Moon's surface, Armstrong collected a contingency soil sample using a sample bag on a stick. Twelve minutes after the sample was collected, he removed the TV camera from the MESA and made a panoramic sweep, then mounted it on a tripod. Aldrin joined Armstrong on the surface. He described the view with the simple phrase: \"Magnificent desolation.\"\n\nArmstrong said moving in the lunar gravity, one-sixth of Earth's, was \"even perhaps easier than the simulations ... It's absolutely no trouble to walk around.\" Aldrin joined him on the surface and tested methods for moving around, including two-footed kangaroo hops. The PLSS backpack created a tendency to tip backward, but neither astronaut had serious problems maintaining balance. The fine soil was quite slippery.\n\nThe astronauts planted the Lunar Flag Assembly containing a flag of the United States on the lunar surface, in clear view of the TV camera. Aldrin remembered, \"Of all the jobs I had to do on the Moon the one I wanted to go the smoothest was the flag raising.\" But the astronauts struggled with the telescoping rod and could only insert the pole about 2 inches (5 cm) into the hard lunar surface. Before Aldrin could take a photo of Armstrong with the flag, President Richard Nixon spoke to them through a telephone-radio transmission, which Nixon called \"the most historic phone call ever made from the White House.\"\n\nThey deployed the EASEP, which included a Passive Seismic Experiment Package used to measure moonquakes and a retroreflector array used for the lunar laser ranging experiment. Then Armstrong walked 196 feet (60 m) from the LM to take photographs at the rim of Little West Crater while Aldrin collected two core samples. He used the geologist's hammer to pound in the tubes—the only time the hammer was used on Apollo 11—but was unable to penetrate more than 6 inches (15 cm) deep.\n\nThe astronauts then collected rock samples using scoops and tongs on extension handles. Many of the surface activities took longer than expected, so they had to stop documenting sample collection halfway through the allotted 34 minutes. Aldrin shoveled 6 kilograms (13 lb) of soil into the box of rocks to pack them in tightly. Two types of rocks were found in the geological samples: basalt and breccia.\n\nWhile on the surface, Armstrong uncovered a plaque mounted on the LM ladder, bearing two drawings of Earth, an inscription, and signatures of the astronauts and President Nixon. The inscription read: \"Here men from the planet Earth first set foot upon the Moon July 1969, A. D. We came in peace for all mankind.\"\n\nMission Control used a coded phrase to warn Armstrong his metabolic rates were high, and that he should slow down. As metabolic rates remained generally lower than expected for both astronauts throughout the walk, Mission Control granted the astronauts a 15-minute extension.\n\nAldrin entered Eagle first. With some difficulty the astronauts lifted film and two sample boxes containing 21.55 kilograms (47.5 lb) of lunar surface material to the LM hatch using a flat cable pulley device called the Lunar Equipment Conveyor (LEC). Armstrong then jumped onto the ladder's third rung, and climbed into the LM. After transferring to LM life support, the explorers lightened the ascent stage for the return to lunar orbit by tossing out their PLSS backpacks, lunar overshoes, an empty Hasselblad camera, and other equipment. The hatch was closed again at 05:11:13. They then pressurized the LM and settled down to sleep.", + + "prompts": [ + { + "id": 1, + "category": "summarization", + "difficulty": "easy", + "prompt": "Summarize the main events during the Apollo 11 lunar landing in 3 sentences.", + "type": "general_summary" + }, + { + "id": 2, + "category": "summarization", + "difficulty": "easy", + "prompt": "What were the main challenges Armstrong faced while landing the Eagle?", + "type": "problem_identification" + }, + { + "id": 3, + "category": "summarization", + "difficulty": "medium", + "prompt": "Describe the activities the astronauts performed on the lunar surface.", + "type": "activity_summary" + }, + { + "id": 4, + "category": "summarization", + "difficulty": "medium", + "prompt": "Explain what scientific equipment the astronauts deployed on the Moon.", + "type": "technical_summary" + }, + { + "id": 5, + "category": "summarization", + "difficulty": "hard", + "prompt": "Compare the planned timeline for the lunar surface operations with what actually happened.", + "type": "comparative_summary" + }, + { + "id": 6, + "category": "reasoning", + "difficulty": "easy", + "prompt": "Why did the computer alarms (1201 and 1202) occur during the descent?", + "type": "causal_reasoning" + }, + { + "id": 7, + "category": "reasoning", + "difficulty": "medium", + "prompt": "What would have happened if Armstrong had not taken manual control during the landing?", + "type": "hypothetical_reasoning" + }, + { + "id": 8, + "category": "reasoning", + "difficulty": "medium", + "prompt": "Why did Armstrong's famous quote become controversial?", + "type": "interpretive_reasoning" + }, + { + "id": 9, + "category": "reasoning", + "difficulty": "hard", + "prompt": "Analyze how the fuel situation during landing reflects the risk management challenges of the mission.", + "type": "analytical_reasoning" + }, + { + "id": 10, + "category": "reasoning", + "difficulty": "hard", + "prompt": "Based on the text, what does Margaret Hamilton's statement reveal about the Apollo Guidance Computer's design philosophy?", + "type": "deep_analysis" + }, + { + "id": 11, + "category": "rag", + "difficulty": "easy", + "prompt": "At what time (UTC) did Eagle land on the Moon?", + "type": "factual_retrieval", + "expected_answer": "20:17:40 UTC on July 20" + }, + { + "id": 12, + "category": "rag", + "difficulty": "easy", + "prompt": "How much lunar material did the astronauts collect?", + "type": "numerical_retrieval", + "expected_answer": "21.55 kilograms (47.5 lb)" + }, + { + "id": 13, + "category": "rag", + "difficulty": "medium", + "prompt": "What was Armstrong's famous first words when stepping on the Moon?", + "type": "quote_retrieval", + "expected_answer": "That's one small step for [a] man, one giant leap for mankind" + }, + { + "id": 14, + "category": "rag", + "difficulty": "medium", + "prompt": "What scientific instruments were included in the EASEP package?", + "type": "list_retrieval", + "expected_answer": "Passive Seismic Experiment Package and retroreflector array" + }, + { + "id": 15, + "category": "rag", + "difficulty": "hard", + "prompt": "How much usable fuel remained when Eagle landed, and how many seconds of powered flight did this represent?", + "type": "complex_retrieval", + "expected_answer": "216 pounds (98 kg); about 25 seconds according to initial estimates, but post-mission analysis showed closer to 50 seconds" + } + ], + + "evaluation_notes": { + "testing_approach": "All 15 prompts should be tested across all models to ensure a fair comparison.", + "prompt_categories": { + "summarization": "Prompts 1-5 test condensing and extracting key information", + "reasoning": "Prompts 6-10 test analysis, inference, and logical connections", + "rag": "Prompts 11-15 test retrieval accuracy from source text" + }, + "note": "Some prompts may be more challenging for smaller models, but attempting all prompts provides complete evaluation data" + } +} diff --git a/test_dataset_apollo11/test_prompts.md b/test_dataset_apollo11/test_prompts.md new file mode 100644 index 0000000..e4c2a69 --- /dev/null +++ b/test_dataset_apollo11/test_prompts.md @@ -0,0 +1,86 @@ + + + +# 15 Standardized Test Prompts + +### Summarization Tasks (5 prompts) + +#### Prompt 1 (Easy) + +Summarize the main events during the Apollo 11 lunar landing in 3 sentences. + +#### Prompt 2 (Easy) + +What were the main challenges Armstrong faced while landing the Eagle? + +#### Prompt 3 (Medium) + +Describe the activities the astronauts performed on the lunar surface. + +#### Prompt 4 (Medium) + +Explain what scientific equipment the astronauts deployed on the Moon. + +#### Prompt 5 (Hard) + +Compare the planned timeline for the lunar surface operations with what actually +happened. + +### Reasoning Tasks (5 prompts) + +#### Prompt 6 (Easy) + +Why did the computer alarms (1201 and 1202) occur during the descent? + +#### Prompt 7 (Medium) + +What would have happened if Armstrong had not taken manual control during the landing? + +#### Prompt 8 (Medium) + +Why did Armstrong's famous quote become controversial? + +#### Prompt 9 (Hard) + +Analyze how the fuel situation during landing reflects the risk management challenges +of the mission. + +#### Prompt 10 (Hard) + +Based on the text, what does Margaret Hamilton's statement reveal about the Apollo +Guidance Computer's design philosophy? + +### RAG Tasks (5 prompts) + +#### Prompt 11 (Easy) + +At what time (UTC) did Eagle land on the Moon? + +#### Prompt 12 (Easy) + +How much lunar material did the astronauts collect? + +#### Prompt 13 (Medium) + +What was Armstrong's famous first words when stepping on the Moon? + +#### Prompt 14 (Medium) + +What scientific instruments were included in the EASEP package? + +#### Prompt 15 (Hard) + +How much usable fuel remained when Eagle landed, and how many seconds of powered +flight did this represent? + +--- + +### Expected Answers for RAG Tasks + +**Prompt 11:** 20:17:40 UTC on July 20, 1969 +**Prompt 12:** 21.55 kilograms (47.5 lb) +**Prompt 13:** "That's one small step for [a] man, one giant leap for mankind" +**Prompt 14:** Passive Seismic Experiment Package and retroreflector array +(for lunar laser ranging experiment) +**Prompt 15:** 216 pounds (98 kg); estimated 25 seconds according to initial data, +but post-mission analysis showed closer to 50 seconds