Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
199 changes: 199 additions & 0 deletions wikipedia-analysis-#229lixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Wikipedia Language Equity Analysis

![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)
![Status](https://img.shields.io/badge/status-complete-green.svg)

**Quantifying the "Information Gap": How long does critical information take to reach different language communities?**

---

## Project Overview

When a pandemic or disaster strikes, English Wikipedia is updated within hours. But what about the other 7 billion people who don't speak English?

This project analyzes **56 time-sensitive topics** (Public Health, Climate Disasters, Human Rights) across **40 Wikipedia language editions** to quantify systematic information inequality.

### Key Findings

* **23% Coverage Gap**: Low-resource languages cover 23% fewer critical topics (54.1% vs 77.3%)
* **3.2x Update Lag**: Information in low-resource languages is 3.2x more outdated (499 vs 156 days)
* **97.5% Missing Rate**: Essential health topics (Ebola, Monkeypox) missing in 39/40 languages
* **Real Impact**: COVID-19 info in Hausa (80M speakers) appeared **4.7 years late**

**Bottom Line**: Billions of people systematically receive outdated or missing life-saving information.

---

## How It Works

### Data Pipeline

1. **Data Collection** (approximately 2 hours)
- Queries Wikipedia MediaWiki API for 56 topics × 40 languages = 2,240 data points
- Fetches creation timestamps and latest edit timestamps
- Implements caching to reduce API load on subsequent runs

2. **Metric Calculation**
- **Coverage**: Does the page exist? (Yes/No)
- **Time-to-Translation**: Days between English creation and target language creation
- **Update Lag**: Days between English latest edit and target language latest edit

3. **Visualization**
- 6 interactive charts: heatmaps, coverage charts, distribution plots
- Reveals patterns of information inequality

### Tech Stack
* **Python 3.12**
* **Pandas** - Data transformation and analysis
* **Requests** - API client with retry logic and caching
* **Plotly** - Interactive HTML visualizations
* **PyArrow** - Efficient Parquet storage format

---

## Getting Started

### Prerequisites

Python 3.8+ installed on your system.

### Installation

```bash
# Install required libraries
pip install requests pandas plotly openpyxl pyarrow
```

### Usage

**Step 1: Run the analysis**
```bash
python wikipedia_analyzer_corrected.py
```
This collects data from Wikipedia (takes approximately 2 hours, uses cache on subsequent runs).

**Step 2: Generate visualizations**
```bash
python visualize_results.py data/language_equity_analysis_v2.csv
```
Creates 6 interactive HTML charts in `visualizations/` folder.

**Step 3: Explore the data**
Open the generated files:
- `data/language_equity_analysis_v2.xlsx` - Excel spreadsheet
- `visualizations/summary_dashboard.html` - Interactive dashboard

---

## Data Scale

- **56 time-sensitive topics** across 7 categories (Public Health, Climate, Human Rights, etc.)
- **40 languages**: 30 major languages + 10 low-resource languages
- **2,240 data points** analyzed
- **626 missing pages** (27.9% coverage gap)

---

## Example Outputs

### Summary Dashboard
Interactive 4-panel dashboard showing:
- Average update lag by language
- Coverage rates (percentage of topics available)
- Distribution of lags (box plots)
- Correlation between translation delay and maintenance

### Update Lag Heatmap
Color-coded matrix (languages × topics):
- Green = Current information
- Yellow = Moderate lag (months)
- Red = Severely outdated (years)

### Coverage Chart
Shows which languages are missing which critical topics.

---

## Key Insights

### Major vs Low-Resource Languages

| Metric | Major Languages | Low-Resource Languages |
|--------|-----------------|------------------------|
| Coverage | 77.3% | 54.1% (-23.2%) |
| Avg Update Lag | 156 days | 499 days (+3.2x) |

### Case Study: COVID-19

| Language | Speakers | Page Created | Info Outdated By |
|----------|----------|--------------|------------------|
| English | Baseline | Jan 5, 2020 | Current |
| German | 90M | Jan 25, 2020 (+20d) | Current |
| Bengali | 265M | Jan 24, 2020 (+19d) | 1 year |
| Hausa | 80M | Oct 2, 2024 (+1731d) | **4.7 years late** |

---

## Repository Structure

```
wikipedia-language-equity/
├── wikipedia_analyzer_corrected.py # Main analysis script
├── visualize_results.py # Visualization generator
├── target_languages_40.txt # 40 target languages
├── topics_critical_50.txt # 56 critical topics
├── data/
│ ├── language_equity_analysis_v2.csv
│ ├── language_equity_analysis_v2.xlsx
│ └── language_equity_analysis_v2.parquet
├── visualizations/ # 6 interactive charts
│ └── summary_dashboard.html
└── cache/ # API response cache
```

---

## Challenges & Limitations

### Challenges
* **API Rate Limiting**: Requires polite delays (0.5-1s) between requests
* **Runtime**: Complete analysis takes approximately 2 hours (caching speeds up subsequent runs)
* **Data Quality**: Some wikis have inconsistent data or are less maintained

### Limitations
* **Timestamp ≠ Quality**: Latest edit time is a proxy; doesn't measure content accuracy
* **Coverage ≠ Completeness**: Page existence doesn't mean information is adequate
* **English Baseline**: Assumes English Wikipedia is the "gold standard" (Western bias)

---

## Policy Recommendations

Based on the findings:

1. **Emergency Response Protocol**: Critical health topics should be translated to all 40 languages within 7 days
2. **Support Low-Resource Languages**: Provide funding and tools to Burmese, Hausa, Nepali communities
3. **Automated Monitoring**: Build real-time dashboard tracking update lags
4. **Governance Reform**: Establish "Language Equity Committee" with binding standards

---

## License

This project is open source under MIT License.
Wikipedia content is licensed under CC BY-SA 3.0.

---

## Acknowledgments

* Wikipedia community for maintaining multilingual knowledge
* Wikimedia Foundation for API access
* Open source libraries: pandas, plotly, requests

---

**Project Status**: Complete
**Data Collection Date**: January 26, 2026
**Total Analysis Time**: Approximately 2 hours
**Recommended Re-run**: Quarterly for updated metrics
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:47:39.679077",
"response": {
"continue": {
"rvcontinue": "20260122084518|28730331",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 19071,
"ns": 0,
"title": "Hajléktalanság",
"revisions": [
{
"timestamp": "2026-01-23T11:05:59Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:18:23.480574",
"response": {
"continue": {
"rvcontinue": "20260101230449|36594158",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 896,
"ns": 0,
"title": "Deprem",
"revisions": [
{
"timestamp": "2026-01-16T17:23:20Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:07:49.929785",
"response": {
"continue": {
"rvcontinue": "20200324025327|774938",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 111337,
"ns": 0,
"title": "विश्वव्यापी महामारी",
"revisions": [
{
"timestamp": "2020-03-24T02:49:04Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:23:55.405311",
"response": {
"continue": {
"rvcontinue": "20250708005515|35608093",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 3466229,
"ns": 0,
"title": "Ülke içinde yerinden edilmiş kişi",
"revisions": [
{
"timestamp": "2025-11-13T17:33:37Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:35:46.528987",
"response": {
"continue": {
"rvcontinue": "20260121103800|149019218",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 2075,
"ns": 0,
"title": "Guerra",
"revisions": [
{
"timestamp": "2026-01-21T10:43:05Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:30:38.319014",
"response": {
"continue": {
"rvcontinue": "20251223075247|70383776",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 144665,
"ns": 0,
"title": "Mensenhandel",
"revisions": [
{
"timestamp": "2025-12-23T07:53:40Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:31:07.087962",
"response": {
"continue": {
"rvcontinue": "20241028140059|24348507",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 28616,
"ns": 0,
"title": "Mučení",
"revisions": [
{
"timestamp": "2025-09-04T06:37:18Z"
}
]
}
]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"cached_at": "2026-01-26T00:29:42.324760",
"response": {
"continue": {
"rvcontinue": "20060116063141|690330",
"continue": "||"
},
"query": {
"pages": [
{
"pageid": 133978,
"ns": 0,
"title": "Цензура",
"revisions": [
{
"timestamp": "2006-01-16T06:28:31Z"
}
]
}
]
}
}
}
Loading