|
308 | 308 | "# Uses gpt-4o-mini:\n", |
309 | 309 | "# - more intelligent\n", |
310 | 310 | "llm_mini = ChatOpenAI(\n", |
311 | | - " model=\"gpt-4o-mini\",\n", |
| 311 | + " model=\"gpt-4.1-mini\",\n", |
312 | 312 | " api_key=os.environ[\"OPENAI_API_KEY\"],\n", |
313 | 313 | " )" |
314 | 314 | ] |
|
407 | 407 | " - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n", |
408 | 408 | "\n", |
409 | 409 | " 3. **JSON Formatting Rules:**\n", |
410 | | - " - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup`. This is the most important rule.\n", |
411 | | - " - **Newlines:** Use `\\n` for newlines within the JSON string values.\n", |
| 410 | + " - **Backslash Escaping (Very Important):**\n", |
| 411 | + " - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac`.\n", |
| 412 | + " - For **newlines**, you MUST use a single `\\n`. Do NOT escape it as `\\\\n`.\n", |
412 | 413 | " - **Content Integrity:** Preserve all text, LaTeX (`$...$`, `$$...$$`), and image tags perfectly. Do not alter or summarize content.\n", |
413 | 414 | " - **Strict Schema:** Ensure the final JSON has no trailing commas and includes all fields, even if they are empty.\n", |
414 | 415 | " \"\"\"\n", |
|
545 | 546 | "\n", |
546 | 547 | " 1. **Content Splitting:**\n", |
547 | 548 | " - From the input `question_content`, identify the main introductory text (the stem) and place it in the `content` field.\n", |
548 | | - " - Identify all sub-questions (e.g., \"(a)\", \"(b)\", \"i.\", \"ii.\") and place their text into the `parts` list.\n", |
549 | | - " - Parts may also be implied.\n", |
550 | | - " - All Question Must have at least one part.\n", |
| 549 | + " - Identify all sub-questions (e.g., \"(a)\", \"(b)\", \"i.\", \"ii.\") and place their text into the `parts` list. Sub-questions may also be implied.\n", |
| 550 | + " - Questions with no sub-questions should have a single part in the `parts` list, which is the entire question text.\n", |
551 | 551 | " - Ensure that images references are correctly placed with their respective parts.\n", |
552 | 552 | " - Preserve all content perfectly, including text, LaTeX, and image tags like ``.\n", |
553 | 553 | " - Ensure no solution content is included in the `content` or `parts` fields.\n", |
554 | | - " - The `title` should be a concise summary of the question.\n", |
| 554 | + " - You may choose what the title of the question should be.\n", |
555 | 555 | " - The `images` list should be copied exactly from the input.\n", |
556 | 556 | "\n", |
557 | 557 | " 2. **Output Format (Crucial):**\n", |
558 | 558 | " - You MUST output ONLY a single, raw, valid JSON string.\n", |
559 | 559 | " - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n", |
560 | 560 | "\n", |
561 | 561 | " 3. **JSON Formatting Rules:**\n", |
562 | | - " - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup`. This is the most important rule.\n", |
563 | | - " - **Newlines:** Use `\\n` for newlines within the JSON string values.\n", |
| 562 | + " - **Backslash Escaping (Very Important):**\n", |
| 563 | + " - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac`.\n", |
| 564 | + " - For **newlines**, you MUST use a single `\\n`. Do NOT escape it as `\\\\n`.\n", |
564 | 565 | " - **Content Integrity:** Preserve all text, LaTeX (`$...$`, `$$...$$`), and image tags (``) perfectly. Do not alter or summarize content.\n", |
565 | 566 | " \"\"\"\n", |
566 | 567 | "\n", |
|
581 | 582 | " - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n", |
582 | 583 | "\n", |
583 | 584 | " 3. **JSON Formatting Rules:**\n", |
584 | | - " - **Escape Backslashes:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must become `\\\\cup` in the JSON string. This is the most important rule.\n", |
| 585 | + " - **Backslash Escaping (Very Important):**\n", |
| 586 | + " - For **LaTeX commands**, a single backslash `\\` MUST be escaped as a double backslash `\\\\`. For example, `\\frac` must become `\\\\frac` in the JSON string. This is the most important rule.\n", |
585 | 587 | " - **Newlines:** Use `\\n` for newlines within the JSON string values.\n", |
586 | 588 | " - **Math Delimiters:** Ensure all math delimiters (`$...$` and `$$...$$`) are correctly balanced and preserved.\n", |
587 | 589 | " \"\"\"\n", |
|
713 | 715 | " content: str = Field(..., description=\"The main content of the question\")\n", |
714 | 716 | "\n", |
715 | 717 | "llm_task_text_check = r\"\"\"\n", |
716 | | - " Your task is to validate and correct the content within the `part_text` field of the provided JSON input.\n", |
| 718 | + " Your task is to validate and correct the content of the provided JSON fields to ensure it is clean, well-formatted, and valid Texdown (Markdown with LaTeX).\n", |
717 | 719 | " You MUST return ONLY a single, raw, valid JSON string that strictly follows the original schema. Do NOT add any explanations, comments, or markdown code blocks.\n", |
718 | 720 | "\n", |
719 | 721 | " Apply these correction rules to the content inside the JSON fields:\n", |
720 | | - " 1. **JSON Escaping:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\cup` must be written as `\\\\cup`. Never escape backslashes for newlines (`\\n`), as they should remain as is.\n", |
721 | | - " 2. **Math Delimiters:** All mathematical content must be enclosed in `$...$` for inline math or `$$...$$` for display math. Ensure all delimiters are correctly balanced and closed. '$' and '$$' should not be used for any other purpose. Move all `\\n` outside the math delimiters.\n", |
722 | | - " 3. **Display Math:** `$$` delimiters must be on their own separate lines.\n", |
723 | | - " 4. **Image Tags:** Preserve image tags like `` exactly as they are.\n", |
724 | | - " 5. **Content Integrity:** Do not change, paraphrase, or summarize any text, formulas, or image links. Only fix formatting errors according to these rules.\n", |
725 | | - " 6. **Newlines:** Use `\\n` for newlines within the JSON string values.\n", |
| 722 | + " 1. **JSON Escaping:** All LaTeX backslashes (`\\`) MUST be escaped as double backslashes (`\\\\`). For example, `\\frac` must become `\\\\frac`. Do not escape backslashes for newlines (`\\n`).\n", |
| 723 | + " 2. **Enforce Math Delimiters:** This is the most important rule. Any text containing LaTeX commands (e.g., `\\lim`, `\\frac`, `\\sin`, `\\alpha`, `^`, `_`) or mathematical structures that is NOT already enclosed in `$..$` or `$$..$$` MUST be wrapped.\n", |
| 724 | + " - Use `$$...$$` for standalone equations and `$...$` for inline math.\n", |
| 725 | + " - Be careful to not wrap text that is already correctly formatted with LaTeX math delimiters.\n", |
| 726 | + " 3. **Display Math Formatting:** This rule is critical. Display math blocks MUST be formatted strictly as follows: a blank line, the opening `$$` on its own line, the LaTeX content, the closing `$$` on its own line, and a blank line.\n", |
| 727 | + " - **Incorrect:** `...text $$x=y$$ more text...`\n", |
| 728 | + " - **Incorrect:** `...text\\n$$\\nx=y\\n\\n$$\\nmore text...`\n", |
| 729 | + " - **Correct:** `...text\\n\\n$$\\nx=y\\n$$\\n\\nmore text...`\n", |
| 730 | + " 4. **LaTeX Environments:** Environments like `aligned`, `cases`, `matrix`, `gathered`, etc., must be entirely contained within a single display math block (`$$...$$`). Ensure that every `\\begin{...}` has a matching `\\end{...}`.\n", |
| 731 | + " 7. **Markdown Lists:** Ensure that markdown lists (e.g., using `*`, `-`, or `1.`) are correctly formatted with proper indentation and spacing.\n", |
| 732 | + " 8. **Spacing and Readability:** Ensure there is a single blank line between paragraphs, lists, and other block elements to improve readability. Remove any excessive blank lines.\n", |
| 733 | + " 9. **Cleanup Redundancy:** Correct or remove any repetitive or nonsensical phrases that may be artifacts from OCR (e.g., \"is monotone. is monotone\" should be corrected to \"is monotone\").\n", |
| 734 | + " 10. **Content Integrity:** Do not change, paraphrase, or summarize any text, formulas, or image links. Only fix formatting, spacing, and structural errors according to these rules.\n", |
726 | 735 | " \"\"\"\n", |
727 | 736 | "\n", |
728 | 737 | "def validate_part_text(part_text_data):\n", |
|
736 | 745 | " }\n", |
737 | 746 | " \n", |
738 | 747 | " validation_prompt = f\"\"\"\n", |
739 | | - " Your task is to extract a JSON with the following structure exactly:\n", |
| 748 | + " Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n", |
740 | 749 | " {part_text_parser.get_format_instructions()}\n", |
741 | 750 | "\n", |
742 | 751 | " Your task is to validate and correct the content within the `part_text` field of the provided JSON input.\n", |
|
777 | 786 | " }\n", |
778 | 787 | " \n", |
779 | 788 | " validation_prompt = f\"\"\"\n", |
780 | | - " Your task is to extract a JSON with the following structure exactly:\n", |
| 789 | + " Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n", |
781 | 790 | " {part_solution_parser.get_format_instructions()}\n", |
782 | 791 | "\n", |
783 | 792 | " Your task is to validate and correct the content within the `part_solution` field of the provided JSON input.\n", |
|
819 | 828 | " }\n", |
820 | 829 | " \n", |
821 | 830 | " validation_prompt = f\"\"\"\n", |
822 | | - " Your task is to extract a JSON with the following structure exactly:\n", |
| 831 | + " Your task is to extract a JSON with the following structure exactly, to be parsed by a pydantic model:\n", |
823 | 832 | " {content_parser.get_format_instructions()}\n", |
824 | 833 | "\n", |
825 | 834 | " Your task is to validate and correct the content within the `title` and `content` fields of the provided JSON input.\n", |
|
937 | 946 | " print(json.dumps(extracted_dict, indent=2))\n", |
938 | 947 | " print(\"Now validating the content...\")\n", |
939 | 948 | "\n", |
940 | | - " # content_validated_dict = content_texdown_check(extracted_dict)\n", |
941 | | - " # print(\"successfully validated the content.\")\n", |
942 | | - " # print(json.dumps(content_validated_dict, indent=2))\n", |
943 | | - " # print(\"successfully converted markdown to JSON.\")\n", |
944 | | - " \n", |
945 | | - " return extracted_dict" |
| 949 | + " # return extracted_dict\n", |
| 950 | + "\n", |
| 951 | + " content_validated_dict = content_texdown_check(extracted_dict)\n", |
| 952 | + " print(\"successfully validated the content.\")\n", |
| 953 | + " print(json.dumps(content_validated_dict, indent=2))\n", |
| 954 | + " print(\"successfully converted markdown to JSON.\")\n", |
| 955 | + "\n", |
| 956 | + " return content_validated_dict" |
946 | 957 | ] |
947 | 958 | }, |
948 | 959 | { |
|
0 commit comments