Comparing Commercial and Open-Source Language Models for Sustainable AI
This repository presents the ELO2 โ GREEN AI Project, developed within the MIT Emerging Talent โ AI & ML Program (2025). The work investigates the technical performance, sustainability traits, and human-perceived quality of open-source language models compared to commercial systems.
To what extent can open-source LLMs provide competitive output quality while operating at significantly lower environmental cost?
Large commercial LLMs deliver strong performance but demand substantial compute and energy. This project examines whether small, accessible, and environmentally efficient open-source modelsโespecially when enhanced with retrieval and refinement pipelinesโcan offer practical alternatives for everyday tasks.
The study evaluates several open-source model groups:
- Quantized Model: Mistral-7B (GGUF)
- Distilled Model: LaMini-Flan-T5-248M
- Small Models: Qwen, Gemma
- Enhanced Pipelines (applied to all model families):
- RAG (Retrieval-Augmented Generation)
- Recursive Editing
- includes AI-based critique and iterative refinement
These configurations serve as the optimized open-source setups used in the comparison against commercial models.
Evaluation tasks include:
- summarization
- factual reasoning
- paraphrasing
- short creative writing
- instruction following
- question answering
A targeted excerpt from the Apollo-11 mission transcripts served as the central reference text for all evaluation tasks. All prompts were constructed directly from this shared material. Using a single, consistent source ensured that every model was tested under identical informational conditions, allowing clear and fair comparison of output quality and relevance.
Retrieval-Augmented Generation (RAG) was applied to multiple model families. The pipeline includes:
- document indexing
- dense similarity retrieval
- context injection through prompt augmentation
- answer synthesis using guidance prompts
RAG improved factual grounding in nearly all models.
A lightweight iterative refinement procedure was implemented:
-
Draft Generation:
The primary model produces an initial output. -
AI-Based Critique:
A secondary SLM evaluates clarity, accuracy, faithfulness and relevance. -
Refinement Step:
A revision prompt integrates critique and generates an improved text. -
Stopping Condition:
The cycle ends after a fixed number of iterations or when critique stabilizes.
This approach allowed weaker SLMs to yield higher-quality results without relying on large models.
Environmental footprint data was captured with CodeCarbon, recording:
- CPU/GPU energy usage
- Carbon emissions
- PUE-adjusted overhead
These measurements enabled comparison with published metrics for commercial LLMs.
A structured Google Form experiment collected:
- source identification (commercial vs. open-source)
- quality ratings on accuracy, faithfulness, relevance, and clarity
(1โ5 scale)
Outputs were randomized and anonymized to avoid bias. This provided a perception-based counterpart to technical evaluation.
-
Through our work and experiments, we were able to raise an important question and demonstrate that open-source models, when carefully optimized, have significant untapped potential and with the right tweaks, they can meaningfully compete with commercial LLMs.
-
The results show a promising trend: nearly 40% of respondents felt that open- source models are either fully comparable or only slightly behind commercial systems. Meanwhile, 45.2% indicated that performance differences depend on the specific task suggesting that open-source models can match commercial quality in many real-world scenarios. Only a small minority felt that commercial models were clearly superior. These findings reinforce our conclusion that, with the right optimizations and configurations, open-source models have the potential to compete meaningfully with commercial AI systems.
The research findings will be shared through formats designed for different audiences and purposes:
A comprehensive research article will document the complete experimental design, statistical analysis, and implications.
๐ View Article
An executive presentation provides a visual overview of the research question, methodology, and key findings without requiring deep technical background.
๐ View Presentation
A public evaluation study invites participation in assessing AI-generated texts. This crowd sourced data forms a critical component of the research.
๐ The Study
- Evaluate additional open-source model families across diverse tasks
- Test optimized pipelines in specialized domains (medical, legal, technical writing)
- Track carbon footprint across full lifecycle (training to deployment)
- Conduct ablation studies isolating RAG vs. recursive editing contributions
Special thanks to the MIT Emerging Talent Program for their guidance and feedback throughout the project.


