CSV Migration Tools for automatically generating CSV data migration schemas. This tool analyzes source and target CSV formats, then uses AI to intelligently map columns and values between different data structures.
Data migration projects often involve transforming CSV files from one format to another. A typical scenario involves a CSV file with hundreds or thousands of rows that needs to be converted to match a completely different structure. Column names don't match, boolean values use different representations (Y/N vs true/false), categorical codes require mapping (ELEC → Electronics), and complex grouped values present additional challenges (R+W+D → view, edit, delete).
This problem is particularly common for businesses using Point of Sale (POS) or inventory management applications. When switching from one application to another, the exported data from the current system rarely matches the import format required by the new system.
For example, a retail store migrating from an older POS system to a modern cloud-based solution might export product data where status values are coded as "A" (active), "D" (discontinued), or "L" (low stock). The new system, however, expects full text values like "in_stock", "discontinued", and "low_stock". Similarly, the old system might use category codes like "ELEC", "FURN", or "FOOD", while the new system requires complete category names like "Electronics", "Furniture", or "Food & Beverage".
The challenge multiplies when dealing with hundreds or thousands of product records, multiple enum fields requiring transformation, and complex grouped values such as permissions or access rights. Manual conversion becomes impractical, error-prone, and time-consuming.
The traditional approach requires manually comparing both CSVs, creating a mapping document, writing transformation code, testing, debugging, and repeating the process. For a 50-column CSV with dozens of enum mappings, this process can take hours or even days.
This project was created to address these challenges through a structured, automated approach, specifically targeting scenarios where businesses need to migrate data between different software systems.
The initial solution involved creating a structured JSON schema format that explicitly defines how to map data between formats. This approach establishes a clear contract for data transformation.
Example Target CSV:
| name | is_active | category | permissions |
|---|---|---|---|
| User 1 | true | Electronics | view |
| User 2 | false | Furniture | view, edit |
| User 3 | true | Electronics | view, edit, delete |
Defines the destination structure:
[
{
"column": "name",
"values": []
},
{
"column": "is_active",
"values": ["true", "false"]
},
{
"column": "category",
"values": ["Electronics", "Furniture"]
},
{
"column": "permissions",
"values": ["view", "view, edit", "view, edit, delete"]
}
]column- The CSV column name (from CSV headers)values- Array of all possible categorical values for this column. Populated only if:- The column contains a limited, fixed set of values (enum, category, status, boolean)
- The column has no relational dependency on other columns (not a foreign key or related name field)
- Set to empty array
[]if the column contains dynamic values (IDs, names, numbers, dates, free text)
Example Source CSV:
| username | active_status | category | access_rights |
|---|---|---|---|
| Nao Toyama | Y | ELEC | R+W+D |
| Aoki Hina | Y | ELEC | R+W |
| May'n | N | FURN | R |
Maps source to target with transformations:
[
{
"column": "username",
"target_column": "name",
"values": []
},
{
"column": "active_status",
"target_column": "is_active",
"values": ["Y", "N"],
"values_mapping": {
"Y": "true",
"N": "false"
}
},
{
"column": "category",
"target_column": "category",
"values": ["ELEC", "FURN"],
"values_mapping": {
"ELEC": "Electronics",
"FURN": "Furniture"
}
},
{
"column": "access_rights",
"target_column": "permissions",
"values": ["R", "R+W", "R+W+D"],
"values_mapping": {
"R": "view",
"R+W": "view, edit",
"R+W+D": "view, edit, delete"
}
}
]column- The source CSV column name (header)target_column- The target schema column name thatcolumnmaps tovalues- Array of all possible categorical values for this column. Populated only if:- The column contains a limited, fixed set of values (enum, category, status, boolean)
- The column has no relational dependency on other columns (not a foreign key or related name field)
- Set to empty array
[]if the column contains dynamic values (IDs, names, numbers, dates, free text)
values_mapping- Object that maps source CSV values to target schema values. Populated only when:- BOTH source
valuesAND targetvaluesare non-empty (both are categorical) - Maps each source categorical value to its corresponding target categorical value
- Set to
nullif either source or targetvaluesis empty (one or both are dynamic)
- BOTH source
| name | is_active | category | permissions |
|---|---|---|---|
| Nao Toyama | true | Electronics | view, edit, delete |
| Aoki Hina | true | Electronics | view, edit |
| May'n | false | Furniture | view |
The convert_csv.go file (separated from generate_schemas.go) reads these schemas and performs the actual data transformation. This separation of concerns allows the conversion logic to remain independent of how schemas are created.
The manual approach provides complete control over every mapping decision. Each transformation rule is explicitly defined, making the process deterministic and reproducible. There are no external dependencies, no AI services required, and the entire process works offline. Data privacy is guaranteed since information never leaves the local machine. For repetitive migrations where the same format is converted regularly, schemas can be created once and reused indefinitely. Debugging is straightforward since issues can be traced directly to specific mapping rules.
Creating schemas manually is time-consuming, particularly for large CSVs with many columns. A 38-column product catalog CSV, for example, requires hours to map completely, identifying all enum values, checking each column for relationships, and typing out the JSON structure. The process is error-prone, with easy opportunities to miss columns or introduce typos in mappings. The work itself is tedious, especially when analyzing 50+ columns and their enum values. There's also an initial learning curve to understand the schema format. For one-off migrations, the setup time often exceeds the value gained.
The breakthrough came from recognizing that analyzing CSV structure and creating mappings is fundamentally a pattern recognition problem. By feeding sample data to an AI with clear instructions about the desired output, the grunt work of schema creation can be automated.
However, this raised an important question: which AI approach works best?
The first AI implementation used local models via Ollama, offering an attractive privacy-focused solution.
Configuration:
Model: "qwen2.5-coder:0.5b"
Endpoint: "http://localhost:11434/api/generate"Initial testing with a 0.5B parameter model running on a standard laptop produced poor results. Column mappings were incorrect, enum values were missed, and the model occasionally hallucinated columns that didn't exist in the source data.
Testing progressed to larger models, first deepseek-r1 1.5b, then mistral 7B parameters. The mistral 7B model showed significant improvements, correctly identifying most mappings and even catching subtle semantic relationships. For instance, it properly recognized that location_id and location_name form a relational pair rather than fixed enums.
Local AI provides complete privacy guarantees, data never leaves the local machine. There are no API costs beyond the initial model download. The program works offline without internet connectivity and has no rate limits, allowing unlimited schema generations. Users have full control over model selection and can experiment with different options.
Larger models require significant resources, typically 16GB+ RAM for decent performance. Processing is slow, with schema generation taking seconds to minutes per schema on consumer hardware. Quality varies significantly by model size, smaller models produce poor results while larger models consume more resources. Setup complexity involves installing Ollama and downloading multi-gigabyte model files. Not all users have hardware capable of running larger models effectively.
The alternative approach uses cloud AI services, providing access to more powerful models with faster response times and better results. However, this raises data privacy concerns since information is sent to external servers.
Configuration:
Model: "gpt-oss:120b"
Endpoint: "https://ollama.com/api/chat"Ollama's cloud service was selected for several reasons: it's free within reasonable limits, uses open-source models, provides fast response times (5-10 seconds), and consistently delivers quality results.
Testing revealed impressive performance. The gpt-oss 120B model correctly identified all column mappings, properly handled subtle grouped enum transformations (R+W+D → view, edit, delete), accurately flagged relational columns (supplier_id/supplier_name pairs), generated clean and properly formatted JSON, and completed the entire process in seconds.
Critical insight regarding data privacy:
The program is recommended to use only sample data, not complete datasets. Include just enough rows to represent all possible enum values and data patterns. Sensitive information can be stripped while preserving the structure and variety of values.
Cloud AI is significantly faster, completing in 5-10 seconds versus 45-95 seconds for local models. It provides access to powerful models with 120B+ parameters, delivering consistent quality with fewer errors. No special hardware is required, any machine with internet access works. Models are always available without downloads or updates. Most importantly, only sample data needs to be sent, not complete datasets.
Data privacy concerns exist since information is transmitted to external servers, even if only samples. Internet connectivity is required, offline work is impossible. An API key is needed, requiring service registration. While Ollama offers generous free tiers, there are usage limits and potential costs for heavy usage. Rate limiting may throttle high-volume operations. The program depends on external service availability.
Privacy requirements are non-negotiable, such as in regulated industries like healthcare or finance. The same format is converted repeatedly, making the initial time investment worthwhile. The structure is relatively simple with few columns and straightforward mappings. Complex business rules or edge cases require precise control. AI access is restricted, such as in air-gapped systems or secure environments.
This method works best for ongoing ETL pipelines, regulated industries, and simple structures where one-time setup effort pays long-term dividends.
Privacy is important but not absolutely critical, such as with internal business data. Good hardware is available (16GB+ RAM, modern CPU). Time constraints are flexible, allowing 1-2 minutes for processing. There's no budget for API services. Offline capability is needed due to intermittent internet connectivity.
This approach suits one-off migrations, privacy-conscious teams with adequate hardware, and situations where cloud services are unavailable or undesirable.
Speed is important, requiring results in seconds rather than minutes. Mappings are complex with many columns and intricate enum transformations. Creating representative sample data is acceptable from a privacy standpoint. Internet connectivity is reliable. The team frequently works with different migration formats.
This method works best for most modern use cases, development teams, and rapid prototyping scenarios where iteration speed matters.
In practice, a hybrid approach often works best:
The process begins with cloud AI for initial generation, providing a fast and accurate first pass in approximately 15 seconds. This is followed by manual review and tweaks to fix any AI mistakes. The reviewed schemas are saved for reuse, making future migrations of the same format instant. When privacy requirements are strict, local AI can be used instead of cloud services.
This hybrid approach combines the speed of AI with the control of manual review and the privacy options when needed.
The AI's effectiveness depends entirely on sample data quality. If a CSV contains 50 possible status values but the sample only shows 3, the AI will only map those 3.
Solution: Ensure sample data represents all possible values for enum columns, including edge cases and rare combinations.
AI doesn't understand domain-specific business logic. If "status=pending + age>30 days" should map to "overdue", this logic must be handled manually or in post-processing.
Solution: Use AI for structural mapping, then add business logic separately in custom transformation code.
AI can miss subtle patterns or make incorrect assumptions about relationships between columns, particularly with ambiguous or unusual data patterns.
Solution: Always review generated schemas before production use, testing with a subset of real data.
Cloud AI services impose usage limits. While Ollama's free tier is generous, it's not unlimited for heavy production use.
Solution: Use sample data to reduce payload sizes, or consider upgrading for heavy usage scenarios.
Prerequisites:
- Go 1.23 or higher
- For Cloud AI: Ollama API key from https://ollama.com/settings/keys
- For Local AI: Ollama installed locally (https://ollama.com)
Clone the repository:
git clone https://github.com/ashr-tech/csv-migration-tools.git
cd csv-migration-toolsSet up Ollama API key for cloud mode:
Windows (PowerShell):
$env:OLLAMA_API_KEY="your-api-key-here"macOS/Linux:
export OLLAMA_API_KEY="your-api-key-here"go run generator/generate_schemas.goYou will be asked for these parameters:
- Path to source sample CSV (migration source sample)
- Path to target sample CSV (migration destination sample)
- AI Mode (Optional) Either
CLOUD(default) orLOCAL - Name for the schema (Suffix for output file name)
Example:
go run generator/generate_schemas.go
Please enter the source sample CSV path: input/samples/source_sample_data_3.csv
Please enter the target sample CSV path: input/samples/target_sample_data_3.csv
Please enter AI mode (CLOUD/LOCAL) [default: CLOUD]:
Please enter a name for the schemas: 3The tool generates two JSON schema files in the output/schemas/ directory:
- target_schema_3.json - Target data structure schema
- source_schema_3.json - Source to target mapping schema
go run converter/convert_csv.goYou will be asked for these parameters:
- Path to source data CSV (migration source data)
- Path to source schema JSON (migration source data schema)
- Path to target schema JSON (migration destination data schema)
- Name for the schema (Suffix for output file name)
Example:
go run converter\convert_csv.go
Please enter the source data CSV path: input\source_data_3.csv
Please enter the source schema JSON path: output\schemas\source_schema_3.json
Please enter the target schema JSON path: output\schemas\target_schema_3.json
Please enter a name for the output file: 3The tool generate the converted CSV file in the output/ directory.
The schema generation process follows a two-phase approach:
Target Schema Generation analyzes the target CSV structure including headers and sample data. It identifies columns with fixed value sets such as enums, categories, and boolean flags. The program distinguishes between fixed values and dynamic or relational data, then outputs a JSON schema describing the target structure.
Source Schema Generation analyzes the source CSV structure and compares it with the target schema. It maps source columns to their corresponding target columns, creates value transformation mappings for enum fields, and handles complex cases like grouped values.
AI detects relational data patterns indicating foreign key relationships, identifies semantic meaning in abbreviated codes, maps grouped enum values (e.g., "R+W" → "view, edit"), and preserves column order from the original CSV.
The repository includes sample CSV files for testing:
input/samples/target_sample_3.csv- Target format example (14 rows)input/samples/source_sample_3.csv- Source format example (14 rows)input/source_data_3.csv- Larger source dataset (50 rows)
csv-migration-tools/
├── config/
│ └── config.go # Model and endpoint config
├── converter/
│ └── convert_csv.go # CSV converter functions
├── generator/
│ └── generate_schemas.go # Schemas generation functions
├── ai/
│ └── ai.go # AI API call functions
├── types/
│ └── types.go # Data type definitions
├── utils/
│ └── utils.go # Utility functions (CSV/JSON handling)
├── input/
│ └── samples/ # Sample CSV files
├── output/
│ └── schemas/ # Generated schema files
└── README.md
"OLLAMA_API_KEY is not set"
Verify the API key has been exported in the current shell session.
Local Ollama connection refused
Ensure Ollama is running locally:
ollama serveJSON parse errors
The AI sometimes includes markdown formatting. The tool automatically strips common patterns, but if errors persist, check the console output for the raw AI response.
Schema doesn't match expectations
Review the console output showing the AI prompts and responses. Consider adjusting sample CSV data to be more representative, ensuring enough data rows to show all enum values, and verifying that column relationships are clear in the data.
Contributions are welcome. Please submit issues or pull requests through the GitHub repository.
MIT License - see LICENSE file for details
Created by elrizwiraswara (https://github.com/elrizwiraswara) with some help from Claude.ai
