Summary
The default HTML parser of PHP uses the underlying package libxml2 (for example here). Libxml2 doesn’t currently support HTML5 parsing, and while it is undergoing process, after contacting them about this matter they said it will take a while before implementing this feature. This means that the built-in HTML parser of PHP behind loadHTML, DOMImplementation, etc. does not follow the same parsing rules as modern web browsers.
This behaviour becomes security-relevant when HTML sanitizers use the built-in HTML parser.
We have come across multiple PHP sanitizers that are vulnerable to bypasses due to using the built-in parser, and we think that the root cause can’t be addressed without significant changes by libxml2.
PoC
Here are some examples of how attackers can leverage these parsing differentials in order to bypass sanitizers.
1. Comments:
According to the XML specification (XHTML), comments must end with the characters —>
. On the other hand, the HTML specification states that a comment’s text “must not start with the string >
, nor start with the string ->
”.
When parsing the following string in a browser, the comment will end before the p
tag. But when parsing with PHP the p
tag will be considered a
comment:
1 | Input: <!--><p> |
This can be done with either <!-->
or <!--->
.
An attacker can input the following payload <!--><xss>-->
. While the parser considers the xss tag as a comment, the browser will end the comment right before and render the xss tag as expected.
2. RCDATA/RAWTEXT elements
In HTML5, other element parsing types were introduced:
- RCDATA
- textarea
- title
- RAWTEXT
- noframes
- noembed
- iframe
- xmp
- style
- OTHERS
- noscript - depends if scripting is enabled (enabled by default in browsers).
- plaintext
- script
While the PHP’s parser is oblivious to that. There are multiple ways an attacker can bypass a sanitizer due to wrong parsing such as:
<iframe><!--</iframe><xss>--></iframe>
<noframes><style></noframes><xss></style></noframes>
- …
3. Foreign content elements
HTML5 introduced two foreign elements (math and svg) which follow different parsing specifications than HTML. Again parsing with PHP doesn’t take it into account, causing other parsing differentials and sanitizers bypass such as:
<svg><p><style><!--</style><xss>--></style>
- …
4. DOCTYPE element
The !DOCTYPE
element in XML/XHTML is more complex allowing more characters and element nesting than in HTML5. In contrast, the HTML doctype ends with the first occurrence of the “greater than” sign >
.
Parsing the following string will render an xss tag in the browser but not in PHP:
<!DOCTYPE HTML PUBLIC "-//W3C//DTDHTML4.01//EN" "><xss>">
<!DOCTYPE HTML SYSTEM "><xss>">
5. Element name starting with underscrool
According to the XML specification Element names must start with a letter or underscore
, unlike HTML where tags must start with ASCII alphanumerics.
1 | Input: <p><_test>/<p> |
Impact
Sanitizers using the built-in PHP parser are inherently vulnerable to bypass due to wrong parsing.
Recommendation
This issue is known but isn’t clear for users of PHP, after this report the PHP team added a red warning to the documentation: