Welcome! » Log In » Create A New Profile

Scheme parsed incorrectly

Posted by Michael Gusev 
Michael Gusev
Scheme parsed incorrectly
April 16, 2013 10:52AM

Code below is parsed incorrectly:

<a href="{:test:}"></a>

Problem here because of incorrect detection of scheme in URL.

RFC 1738
2.1. The main parts of URLs

Scheme names consist of a sequence of characters. The lower case
letters "a"--"z", digits, and the characters plus ("+"), period
("."), and hyphen ("-") are allowed. For resiliency, programs
interpreting URLs should treat upper case letters as equivalent to
lower case in scheme names (e.g., allow "HTTP" as well as "http").

patch for fix

diff --git a/library/HTMLPurifier/URIParser.php b/library/HTMLPurifier/URIParser.php
index 7179e4ab8991077aa2ff4b3a4fca0a4eafd416e0..a7e5dd66eaab4fa6cc8b3daa3856ec450f1c35ce 100644
--- a/library/HTMLPurifier/URIParser.php
+++ b/library/HTMLPurifier/URIParser.php
@@ -30,7 +30,7 @@ class HTMLPurifier_URIParser
         // Note that ["<>] are an addition to the RFC&#039;s recommended
         // characters, because they represent external delimeters.
         $r_URI = &#039;!&#039;.
-            &#039;(([^:/?#"<>]+):)?&#039;. // 2. Scheme
+            &#039;(([a-zA-Z0-9\.\+\-]+):)?&#039;. // 2. Scheme
             &#039;(//([^/?#"<>]*))?&#039;. // 4. Authority
             &#039;([^?#"<>]*)&#039;.       // 5. Path
             &#039;(\?([^#"<>]*))?&#039;.   // 7. Query
Re: Scheme parsed incorrectly
April 16, 2013 04:57PM
commit 6e37ecd1c8389db63445fcfe7490db1b7b6a8383
Author: Edward Z. Yang <ezyang@mit.edu>
Date:   Tue Apr 16 13:46:00 2013 -0700

    Make URI parsing algorithm more strict.
Your Email:


HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

Place code here

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}