Skip to content

Commit 1c3fa12

Browse files
committed
Improved the prompt for correcting llm output to be more strict
1 parent bc0ecd4 commit 1c3fa12

File tree

1 file changed

+37
-26
lines changed

1 file changed

+37
-26
lines changed

conversion2025/mathpix_to_llm_to_in2lambda_to_JSON.ipynb

Lines changed: 37 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@
308308
"# Uses gpt-4o-mini:\n",
309309
"# - more intelligent\n",
310310
"llm_mini = ChatOpenAI(\n",
311-
" model=\"gpt-4o-mini\",\n",
311+
" model=\"gpt-4.1-mini\",\n",
312312
" api_key=os.environ[\"OPENAI_API_KEY\"],\n",
313313
" )"
314314
]
@@ -407,8 +407,9 @@
407407
" - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n",
408408
"\n",
409409
" 3. **JSON Formatting Rules:**\n",
410-
" - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup`. This is the most important rule.\n",
411-
" - **Newlines:** Use `\\n` for newlines within the JSON string values.\n",
410+
" - **Backslash Escaping (Very Important):**\n",
411+
" - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac`.\n",
412+
" - For **newlines**, you MUST use a single `\\n`. Do NOT escape it as `\\\\n`.\n",
412413
" - **Content Integrity:** Preserve all text, LaTeX (`$...$`, `$$...$$`), and image tags perfectly. Do not alter or summarize content.\n",
413414
" - **Strict Schema:** Ensure the final JSON has no trailing commas and includes all fields, even if they are empty.\n",
414415
" \"\"\"\n",
@@ -545,22 +546,22 @@
545546
"\n",
546547
" 1. **Content Splitting:**\n",
547548
" - From the input `question_content`, identify the main introductory text (the stem) and place it in the `content` field.\n",
548-
" - Identify all sub-questions (e.g., \"(a)\", \"(b)\", \"i.\", \"ii.\") and place their text into the `parts` list.\n",
549-
" - Parts may also be implied.\n",
550-
" - All Question Must have at least one part.\n",
549+
" - Identify all sub-questions (e.g., \"(a)\", \"(b)\", \"i.\", \"ii.\") and place their text into the `parts` list. Sub-questions may also be implied.\n",
550+
" - Questions with no sub-questions should have a single part in the `parts` list, which is the entire question text.\n",
551551
" - Ensure that images references are correctly placed with their respective parts.\n",
552552
" - Preserve all content perfectly, including text, LaTeX, and image tags like `![pictureTag](filename.jpg)`.\n",
553553
" - Ensure no solution content is included in the `content` or `parts` fields.\n",
554-
" - The `title` should be a concise summary of the question.\n",
554+
" - You may choose what the title of the question should be.\n",
555555
" - The `images` list should be copied exactly from the input.\n",
556556
"\n",
557557
" 2. **Output Format (Crucial):**\n",
558558
" - You MUST output ONLY a single, raw, valid JSON string.\n",
559559
" - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n",
560560
"\n",
561561
" 3. **JSON Formatting Rules:**\n",
562-
" - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup`. This is the most important rule.\n",
563-
" - **Newlines:** Use `\\n` for newlines within the JSON string values.\n",
562+
" - **Backslash Escaping (Very Important):**\n",
563+
" - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac`.\n",
564+
" - For **newlines**, you MUST use a single `\\n`. Do NOT escape it as `\\\\n`.\n",
564565
" - **Content Integrity:** Preserve all text, LaTeX (`$...$`, `$$...$$`), and image tags (`![pictureTag](...)`) perfectly. Do not alter or summarize content.\n",
565566
" \"\"\"\n",
566567
"\n",
@@ -581,7 +582,8 @@
581582
" - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n",
582583
"\n",
583584
" 3. **JSON Formatting Rules:**\n",
584-
" - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup` in the JSON string. This is the most important rule.\n",
585+
" - **Backslash Escaping (Very Important):**\n",
586+
" - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac` in the JSON string. This is the most important rule.\n",
585587
" - **Newlines:** Use `\\n` for newlines within the JSON string values.\n",
586588
" - **Math Delimiters:** Ensure all math delimiters (`$...$` and `$$...$$`) are correctly balanced and preserved.\n",
587589
" \"\"\"\n",
@@ -713,16 +715,23 @@
713715
" content: str = Field(..., description=\"The main content of the question\")\n",
714716
"\n",
715717
"llm_task_text_check = r\"\"\"\n",
716-
" Your task is to validate and correct the content within the `part_text` field of the provided JSON input.\n",
718+
" Your task is to validate and correct the content of the provided JSON fields to ensure it is clean, well-formatted, and valid Texdown (Markdown with LaTeX).\n",
717719
" You MUST return ONLY a single, raw, valid JSON string that strictly follows the original schema. Do NOT add any explanations, comments, or markdown code blocks.\n",
718720
"\n",
719721
" Apply these correction rules to the content inside the JSON fields:\n",
720-
" 1. **JSON Escaping:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must be written as `\\\\cup`. Never escape backslashes for newlines (`\\n`), as they should remain as is.\n",
721-
" 2. **Math Delimiters:** All mathematical content must be enclosed in `$...$` for inline math or `$$...$$` for display math. Ensure all delimiters are correctly balanced and closed. '$' and '$$' should not be used for any other purpose. Move all `\\n` outside the math delimiters.\n",
722-
" 3. **Display Math:** `$$` delimiters must be on their own separate lines.\n",
723-
" 4. **Image Tags:** Preserve image tags like `![pictureTag](filename.jpg)` exactly as they are.\n",
724-
" 5. **Content Integrity:** Do not change, paraphrase, or summarize any text, formulas, or image links. Only fix formatting errors according to these rules.\n",
725-
" 6. **Newlines:** Use `\\n` for newlines within the JSON string values.\n",
722+
" 1. **JSON Escaping:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\frac` must become `\\\\frac`. Do not escape backslashes for newlines (`\\n`).\n",
723+
" 2. **Enforce Math Delimiters:** This is the most important rule. Any text containing LaTeX commands (e.g., `\\lim`, `\\frac`, `\\sin`, `\\alpha`, `^`, `_`) or mathematical structures that is NOT already enclosed in `$..$` or `$$..$$` MUST be wrapped.\n",
724+
" - Use `$$...$$` for standalone equations and `$...$` for inline math.\n",
725+
" - Be careful to not wrap text that is already correctly formatted with LaTeX math delimiters.\n",
726+
" 3. **Display Math Formatting:** This rule is critical. Display math blocks MUST be formatted strictly as follows: a blank line, the opening `$$` on its own line, the LaTeX content, the closing `$$` on its own line, and a blank line.\n",
727+
" - **Incorrect:** `...text $$x=y$$ more text...`\n",
728+
" - **Incorrect:** `...text\\n$$\\nx=y\\n\\n$$\\nmore text...`\n",
729+
" - **Correct:** `...text\\n\\n$$\\nx=y\\n$$\\n\\nmore text...`\n",
730+
" 4. **LaTeX Environments:** Environments like `aligned`, `cases`, `matrix`, `gathered`, etc., must be entirely contained within a single display math block (`$$...$$`). Ensure that every `\\begin{...}` has a matching `\\end{...}`.\n",
731+
" 7. **Markdown Lists:** Ensure that markdown lists (e.g., using `*`, `-`, or `1.`) are correctly formatted with proper indentation and spacing.\n",
732+
" 8. **Spacing and Readability:** Ensure there is a single blank line between paragraphs, lists, and other block elements to improve readability. Remove any excessive blank lines.\n",
733+
" 9. **Cleanup Redundancy:** Correct or remove any repetitive or nonsensical phrases that may be artifacts from OCR (e.g., \"is monotone. is monotone\" should be corrected to \"is monotone\").\n",
734+
" 10. **Content Integrity:** Do not change, paraphrase, or summarize any text, formulas, or image links. Only fix formatting, spacing, and structural errors according to these rules.\n",
726735
" \"\"\"\n",
727736
"\n",
728737
"def validate_part_text(part_text_data):\n",
@@ -736,7 +745,7 @@
736745
" }\n",
737746
" \n",
738747
" validation_prompt = f\"\"\"\n",
739-
" Your task is to extract a JSON with the following structure exactly:\n",
748+
" Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n",
740749
" {part_text_parser.get_format_instructions()}\n",
741750
"\n",
742751
" Your task is to validate and correct the content within the `part_text` field of the provided JSON input.\n",
@@ -777,7 +786,7 @@
777786
" }\n",
778787
" \n",
779788
" validation_prompt = f\"\"\"\n",
780-
" Your task is to extract a JSON with the following structure exactly:\n",
789+
" Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n",
781790
" {part_solution_parser.get_format_instructions()}\n",
782791
"\n",
783792
" Your task is to validate and correct the content within the `part_solution` field of the provided JSON input.\n",
@@ -819,7 +828,7 @@
819828
" }\n",
820829
" \n",
821830
" validation_prompt = f\"\"\"\n",
822-
" Your task is to extract a JSON with the following structure exactly:\n",
831+
" Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n",
823832
" {content_parser.get_format_instructions()}\n",
824833
"\n",
825834
" Your task is to validate and correct the content within the `title` and `content` fields of the provided JSON input.\n",
@@ -937,12 +946,14 @@
937946
" print(json.dumps(extracted_dict, indent=2))\n",
938947
" print(\"Now validating the content...\")\n",
939948
"\n",
940-
" # content_validated_dict = content_texdown_check(extracted_dict)\n",
941-
" # print(\"successfully validated the content.\")\n",
942-
" # print(json.dumps(content_validated_dict, indent=2))\n",
943-
" # print(\"successfully converted markdown to JSON.\")\n",
944-
" \n",
945-
" return extracted_dict"
949+
" # return extracted_dict\n",
950+
"\n",
951+
" content_validated_dict = content_texdown_check(extracted_dict)\n",
952+
" print(\"successfully validated the content.\")\n",
953+
" print(json.dumps(content_validated_dict, indent=2))\n",
954+
" print(\"successfully converted markdown to JSON.\")\n",
955+
"\n",
956+
" return content_validated_dict"
946957
]
947958
},
948959
{

0 commit comments

Comments
 (0)