gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser#137837
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser#137837serhiy-storchaka merged 12 commits intopython:mainfrom
Conversation
…arser * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
bb7b873 to
2153a4c
Compare
Doc/library/html.parser.rst
Outdated
| Create a parser instance able to parse invalid markup. | ||
|
|
||
| If *convert_charrefs* is ``True`` (the default), all character | ||
| references (except the ones in ``script``/``style`` elements) are |
There was a problem hiding this comment.
This should be updated now that the list has been expanded.
It might be easier to have a short section about parsing modes, listing each mode, which elements trigger it, whether charrefs are converted or not, and when the state is terminated.
Here we could then say
| references (except the ones in ``script``/``style`` elements) are | |
| references (except the ones in RAWTEXT tags) are |
with RAWTEXT linking to that section.
There was a problem hiding this comment.
Do we need to document this here? This is a part of the HTML5 specification. What will the user get from this information?
Lib/html/parser.py
Outdated
| self.set_cdata_mode(tag) | ||
| elif tag == "plaintext": | ||
| self.set_cdata_mode(tag) | ||
| self.interesting = re.compile(r'\z') |
There was a problem hiding this comment.
I think it would be better to move this in set_cdata_mode by adding a third branch to the if/else that sets self.interesting.
There was a problem hiding this comment.
I considered this option. But should we repeat condition tag == "plaintext" in two places or add "plaintext" to CDATA_CONTENT_ELEMENTS or RCDATA_CONTENT_ELEMENTS? In any case we will need to repeat "plaintext" twice. This can also create asymmetry with "noscript" if special cases will be handled in different places. So I came to the current code.
Other option is to use special value escapable=None to switch to the PLAINTEXT mode.
|
This PR seems to address 3 issues:
The difference between states is the following:
|
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Lib/html/parser.py
Outdated
| @@ -448,6 +458,10 @@ def parse_starttag(self, i): | |||
| self.set_cdata_mode(tag) | |||
| elif tag in self.RCDATA_CONTENT_ELEMENTS: | |||
| self.set_cdata_mode(tag, escapable=True) | |||
| elif self.scripting and tag == "noscript": | |||
| self.set_cdata_mode(tag) | |||
| elif tag == "plaintext": | |||
| self.set_cdata_mode(tag, escapable=None) | |||
There was a problem hiding this comment.
I don't like too much (ab)using escapable=None for PLAINTEXT mode.
Currently the set_cdata_mode function does two things:
- determines where the closing tag/end is, which depends on the value
tagpassed; - determines whether charrefs are converted, which depends on the value passed to
escapable;
Even though there is some duplication, I would prefer something like this:
if (tag in self.CDATA_CONTENT_ELEMENTS or
(self.scripting and tag == "noscript") or
tag == "plaintext"):
self.set_cdata_mode(tag, escapable=False)
elif tag in self.RCDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag, escapable=True)This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.
Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
…hon into htmlparser-rawtext
|
Thank you for your review @ezio-melotti. |
|
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
GH-140841 is a backport of this pull request to the 3.14 branch. |
|
GH-140842 is a backport of this pull request to the 3.13 branch. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
Backporting to older Python versions should be from 3.13. |
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
… HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
📚 Documentation preview 📚: https://cpython-previews--137837.org.readthedocs.build/