NeatHtml: Displaying Untrusted Content
Securely, Efficiently, and Accessibly

How To Fight Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and Phishing
with JavaScript Judo, Layout Lockdown, and a Table Trap

by Dean Brettle
Last updated 06/17/08 07:35:47PM

Table of Contents

1 Overview

2 Technical Details

2.1 Defending Most Users with JavaScript Judo and Layout Lockdown

2.1.1 Preventing XSS and Removing Automated CSRF Attacks with JavaScript Judo

2.1.1.1 Playing Hide and Seek with Untrusted Content

2.1.1.2 Filtering Untrusted Content with JavaScript

2.1.1.3 Displaying the Filtered Content

2.1.2 Preventing Phishing and Containing Vandalism with Layout Lockdown

2.2 Defending No-Script Users with Regex Replacements, Layout Lockdown, and a Table Trap

2.2.1 Sacrificing Well-Formedness

2.2.2 Preventing Phishing and Containing Vandalism with Layout Lockdown and a Table Trap

2.2.2.1 Handling HTML Comments and CDATA Sections in Untrusted Content

2.2.2.2 Allowing Tables in Untrusted Content

2.2.3 Removing Automated CSRF Attacks by Disabling Suspicious Tags and Attributes

2.2.4 Preventing CSS Counter Manipulation

2.3 Preventing NeatHtml from Being Used for Denial of Service Attacks

2.4 Discouraging Link Spam

3 Future Work

3.1 Improving Server-Side Performance

3.1.1 Making Style Checking Optional

3.1.2 Avoiding No-Script Processing When JavaScript Is Enabled

3.2 Adding New Features and Ports

3.2.1 Improving CSRF Removal

3.2.2 Supporting Inline Untrusted Content

3.2.3 Running Untrusted Scripts in a Restricted Environment

3.2.4 Porting the Server-Side Code to Other Environments

1 Overview

NeatHtml™ is a highly-portable open source website component that displays untrusted content securely, efficiently, and accessibly. Untrusted content is any content that is not trusted by the website owner. Typical examples include blog comments, forum posts, or user pages on social networking sites. NeatHtml uses an “accept only known good” (whitelist) approach to security to help prevent attacks which are not yet known. It focuses on preventing Cross-Site Scripting (XSS) attacks but can also prevent phishing attacks and remove automated Cross-Site Request Forgery (CSRF) attacks. In this context, phishing attacks are attacks which try to display untrusted content where the user would trust it, and automated CSRF attacks are CSRF attacks that do not require any user action beyond viewing the untrusted content.

NeatHtml primarily uses three simple but effective techniques. The first technique uses JavaScript's document.writeln() to inject markup (e.g. “<!--”) that will hide the untrusted content from the browser's HTML parser until the untrusted content has been filtered. This "JavaScript Judo" technique uses the source of an attacker's power (i.e. JavaScript) against him. Using JavaScript to filter the untrusted content also minimizes server load. The technique was inspired by Stefano Di Paola's work on Preventing XSS with Data Binding. The primary disadvantage to the data binding approach is that the untrusted content is not in the normal document flow. This causes accessibility problems for users who disable script and for search engine spiders. NeatHtml avoids this problem by leaving the untrusted content in the document flow but using JavaScript-injected markup to hide it from the browser until it has been filtered. As a result, search engine spiders and users who disable script see untrusted content normally.

NeatHtml's two other techniques restrict where the untrusted content will be displayed. The “Layout Lockdown” technique is merely a DIV element styled with “overflow: hidden” combined with workarounds for Internet Explorer 6. The “Table Trap” technique protects no-script users from untrusted content which tries to close a containing trusted element like the Layout Lockdown DIV element. A Table Trap prevents this by enclosing the untrusted content in a TABLE element and prohibiting the untrusted content from closing the TABLE. All three techniques are discussed in more detail in Section 2.

NeatHtml should work with any browser that supports both JavaScript 1.3 and a few DOM APIs. It does not use the browser's internal XML/HTML parser to parse the untrusted content, thereby eliminating many browser compatibility issues. It has been tested against:

NeatHtml consists of the NeatHtml.js JavaScript library and a small server-side component for ASP.NET. The server-side component is approximately 400 lines of code and should be easy to port to other web development platforms. To facilitate porting and testing, NeatHtml includes a JavaScript test framework and a demo page which uses the test framework and demonstrates the capabilities of NeatHtml. NeatHtml is licensed under the Lesser General Public License (LGPL), a business-friendly open source license.

NeatHtml is currently available for download as a mature development snapshot. It primarily needs independent testing before an official release. Bug reportsfeature requests, questions, comments, and other contributions are welcome.

2 Technical Details

This section describes the various techniques that NeatHtml uses. To simplify the presentation, the problem is broken into smaller subproblems and the techniques used to address each of the subproblems is described. NeatHtml uses all of the techniques in combination; other solutions could use a subset of the techniques to address a subset of the problems. For example, if the untrusted content is an entire user page on a social networking site, preventing phishing and vandalism might not be a concern and the associated techniques would not need to be applied.

2.1 Defending Most Users with JavaScript Judo and Layout Lockdown

As of January 2007, approximately 94% of users have JavaScript enabled. Defenses which rely on JavaScript obviously directly protect users that have JavaScript enabled. Such defenses also indirectly protect no-script users by reducing the expected value of a non-targeted attack by a factor of over 16. NeatHtml uses the techniques described in this section to achieve these benefits. It also uses other techniques, described later, to directly protect non-script users.

2.1.1 Preventing XSS and Removing Automated CSRF Attacks with JavaScript Judo

2.1.1.1 Playing Hide and Seek with Untrusted Content

To prevent XSS and remove automated CSRF attacks, the untrusted content must be filtered before the browser interprets it as HTML. To prevent XSS, the filter must at least address SCRIPT elements, onEvent attributes, and expression() calls in inline styles. To remove automated CSRF attacks, the filter must at least address IMG, IFRAME, and OBJECT elements, and url() calls in inline styles. This filtering is normally performed on the server, using either a blacklist or a whitelist approach. The blacklist approach typically produces less load on the server but is less secure than a whitelist approach. A whitelist approach requires parsing the untrusted content (both the HTML and its inline styles) before passing it to the browser. Such parsing produces enough load on the server that it may not be practical to do the parsing each time the untrusted content is sent to the browser. Filtering the untrusted content when it is created would solve the load problem, but has its own disadvantages. It is generally more complicated to implement because the type of filtering required can vary over time and across users. For example, it might be desirable to let the content's author (or an administrator) see their unfiltered content for editing (or auditing) purposes. Filtering the content at creation time, also means that if a bug is fixed in the filter, all previously filtered content needs to be refiltered. Keeping track of which content has been filtered with which filter can add significant complexity to an application. That complexity can be especially onerous when trying to secure existing applications.

To avoid the above issues, NeatHtml does most of its filtering using JavaScript in the browser. This means that the untrusted content must be hidden from the browser's HTML parser until it has been filtered. One way to achieve this is to enclose the untrusted content in an HTML comment and then use JavaScript to extract it, filter it, and replace the comment with the result. Before doing this, the server must replace all “--” with “&#34;&#34;” in the untrusted content. That preprocessing ensures that the comment is not ended prematurely. The preprocessing is undone by the JavaScript code before the untrusted content is filtered. The result is below. The ProcessUntrusted() method is responsible for extracting the preprocessed untrusted content, undoing the preprocessing, filtering the untrusted content, and replacing the comment with the filtered result.

    <!-- preprocessed untrusted content --> 
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

This works well in most browsers, but fails in Safari and Konqueror because those browsers don't provide access to comments from JavaScript. An alternative is to enclose the untrusted content in an XMP element (after removing or replacing “</xmp>”), like this:

<xmp> preprocessed untrusted content </xmp> 
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

This appears to work in most browsers, including Safari and Konqueror. However, the XMP element is not part of the HTML4 standard, so future browsers might not support it and might even ignore the XMP tags and parse the untrusted content as HTML. Also, strictly speaking, even earlier versions of HTML specified that the string “</” was not allowed inside the XMP element. Luckily most browsers seem to ignore that part of the specification and treat everything until the “</xmp>” as plain text.

NeatHtml dynamically chooses which of the above techniques to use. The server prepends a test comment followed by script which calls NeatHtml's BeginUntrusted() method. That method checks whether the comment is accessible. If it is, the script injects “<!--” using document.writeln(). If comments are not accessible, the script injects “<xmp>”. The server replaces both “--” and “</xmp>” in the untrusted content and follows the content with the string “<!   >   ><!-- <xmp></xmp><! -->”, like this:

<!-- test comment -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.BeginUntrusted();
</script>

preprocessed untrusted content
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

The string “<!   >   ><!-- <xmp></xmp><! -->” is designed to close either an open comment or an open XMP element. The “>” in the first comment is needed because many browsers interpret “--” as ending a comment but allow a new comment to be started if another “--” occurs before the tag is closed with an “>”. The comment enclosing the “<xmp></xmp>” hides those tags from browsers where the XMP element is not used and ensures that they don't cause validation errors. The last “<!” hides the ending “-->” from browsers where the XMP element is used.

Hiding the untrusted content using HTML injected by JavaScript has another major advantage – the untrusted content is not hidden from no-script users. They will see the untrusted content in the normal document flow. Of course this means that it is important to do additional server-side preprocessing to prevent attacks on no-script users. That is described in section 2.2.

On a related note, the untrusted content will not be hidden if the call to NeatHtml.DefaultFilter.BeginUntrusted() fails for any reason. The most common reason would be forgetting to include NeatHtml.js on the page. To ensure that such a mistake does not result in untrusted content being displayed to script users, the server adds a fail-safe:

    <!-- test comment --> 
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>
preprocessed untrusted content
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

That will cause the untrusted content to be hidden in a comment if something goes wrong. Using '...\074-' + '-' instead of '<!--' simply ensures that the script source code doesn't open any tags (\074 is the octal escape for '<') or end any comments.

Dynamically choosing which way to hide the untrusted content makes it a little harder for ProcessUntrusted() to find it. To assist with this, BeginUntrusted() notes the script element that called it so that ProcessUntrusted() can extract the untrusted content from the following node (i.e. the comment or XMP element). Also, the server adds a sentinel string after the untrusted content so that ProcessUntrusted() can determine where the untrusted content ends and doesn't have to worry about the portion of “<!   >   ><xmp></xmp>” which is not applicable for the hiding method being used. Any string which would not generally occur in benign content will work as a sentinel string. The sentinel string that NeatHtml uses is in bold below:

<!-- test comment -->
<script type='text/javascript'> try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); } </script>
preprocessed untrusted content
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

NeatHtml uses a hidden input element because it is a valid element that is not displayed to the user and should not occur in benign untrusted content.

2.1.1.2 Filtering Untrusted Content with JavaScript

Once ProcessUntrusted() has obtained the preprocessed untrusted content and undone the server's preprocessing, it needs to filter the original untrusted content. To maximize security and provide the best user experience, ProcessUntrusted() actually parses the untrusted content and converts it to secure, well-formed XHTML that all browsers should parse consistently. The parser can handle poorly formed “tag soup” HTML. Specifically, the filter:

Applications can configure the parser to call arbitrary functions for particular element or attribute names. The functions used to provide the whitelisting functionality described above are part of NeatHtml's JavaScript API, so that it is simple to add elements and attributes to the whitelists. The whitelists used by default can be found in NeatHtml.js by searching for “allowedTags”, “prohibitedTags” (i.e. tags whose content is removed), “allowedAttrs”, and “allowedProps”. The whitelist regular expression used for style property values is part of the “StyleDeclRe” used to find style property declaration within the style attribute.

To make the filter's job slightly easier, it ignores everything after the first top-level element. To accommodate this, the server encloses the preprocessed untrusted content in a DIV element, like this:

<!-- test comment -->
<script type='text/javascript'> try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); } </script>
<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>

From ProcessUntrusted()'s perspective, the start and end DIV tags are part of the untrusted content.

2.1.1.3 Displaying the Filtered Content

Once the content has been filtered, ProcessUntrusted() puts the filtered content into the DOM. To simplify this process and to ensure that all of the untrusted content and associated leading and trailing elements are removed from the DOM, the server provides an outer DIV element, like this:

<div> 
<!-- test comment -->
<script type='text/javascript'> try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); } </script>
<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>

ProcessUntrusted() finds that DIV element by searching the ancestors of the script element that called BeginUntrusted() and stopping at the first DIV element. It then sets the innerHTML member of that DIV element to the filtered content.

2.1.2 Preventing Phishing and Containing Vandalism with Layout Lockdown

With the techniques described so far, users with script enabled will see the untrusted content after it has been filtered to prevent XSS and CSRF attacks. However, an attacker could still use styles (e.g. absolute positioning or negative margins) to display malicious content on any part of the page. This ability could be used vandalize the page or to launch a phishing attack in which the user is tricked into trusting the attacker's content because of where it appears on the page.

CSS provides a mechanism to restrict the area where content is displayed. If an element has the style “overflow: hidden”, then the browser is supposed to clip the content to the element's containing box. To take advantage of this, the server could style the outer DIV element, like this:

<div style='overflow: hidden;' > 
<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>

<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>

That works in most browsers, but Internet Explorer 6 (IE6) and earlier do not hide overflow content unless the DIVs dimensions have been explicitly set either via a style or from script. To address this, the server adds a call to another NeatHtml JavaScript method: ResizeContainer(). On IE6 and earlier, ResizeContainer() sets the dimensions based on the computed dimensions of the filtered untrusted content.

  <div style='overflow: hidden;'> 
<!-- test comment -->
<script type='text/javascript'> try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); } </script>
<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

NeatHtml also does some additional styling which is not security related:

<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0;
'>
<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>

<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

The class attribute makes it easy for designers to style all untrusted content similarly (e.g. give it all the same background color). The “position: relative” allows for untrusted content to be positioned absolutely within the DIV. The “border: none; padding: 0; margin: 0”, simply make it easier for designers to style untrusted content by enclosing it in a separate DIV element for which they control all the style properties.

There is one other small way that the untrusted content could change something outside of the containing box. On browsers that support CSS counters the untrusted content can use a style of “counter-increment: counterName” or “counter-reset: counterName” to change the values of counters. This could be used to change the numbering of paragraphs, chapters, etc. To avoid this, NeatHtml's client-side filter does not include “counter-increment” and “counter-reset” on the whitelist of property names.

2.2 Defending No-Script Users with Regex Replacements, Layout Lockdown, and a Table Trap

The techniques described above are designed to protect the overwhelming majority of users – those with JavaScript enabled. Protecting no-script users requires doing additional server-side preprocessing. NeatHtml tries to minimize the amount of server-side preprocessing while maximizing end-user security and usability. This section describes NeatHtml preprocessing techniques in general terms. See the Filter.cs source code for full details.

2.2.1 Sacrificing Well-Formedness

To maximize speed, NeatHtml's server-side component uses regular expressions (regexs) to manipulate the untrusted content directly as a string. This technique should be faster than the true parsing done by the client-side filter, especially for benign content. Unfortunately, the regex manipulations described below will not ensure that the preprocessed untrusted content parsed by the browser is well-formed XML. For example, no attempt is made to ensure that every start tag has a matching end tag, and named entities are not converted to numeric entities. For applications which require well-formed pages, the techniques described in this section are not appropriate. For those applications, options include:

2.2.2 Preventing Phishing and Containing Vandalism with Layout Lockdown and a Table Trap

Recall that the Layout Lockdown technique used for script users required calling the client-side function ResizeContainer() to ensure that overflow content was hidden from users of IE6 and earlier. Without script, NeatHtml needs to ensure that a fixed size scrollable outer DIV is used. To avoid affecting users of other browsers, the server uses a conditional comment, like this:

   <!--[if gte IE 7]><!--> 
<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0;'>
<!--<![endif]-->
<!--[if lt IE 7]>
<div class='NeatHtml'
style='width:
NoScriptDownlevelIEWidth ; height: NoScriptDownlevelIEHeight ;
overflow:auto; position:relative; border:none; padding:0; margin:0;'>
<![endif]-->

<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>

<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

The values NoScriptDownlevelIEWidth and NoScriptDownlevelIEHeight are configurable, and default to “100%” and “400px” by default.

If the untrusted content is known to be well-formed and the containing page does not use CSS counters, the above technique is all that is needed to prevent phishing and contain all vandalism. Preventing CSS counter manipulation requires more complex server-side processing and is described in section 2.2.4. The remainder of this section describes the techniques the server uses to support ill-formed untrusted content.

To prevent untrusted content that leaves an attribute or tag half open (i.e. a missing quote or “>”) from affecting trusted content, NeatHtml uses both single and double quoted attributes in the sentinel element already being used to mark the end of untrusted content for the client-side script:

   <!--[if gte IE 7]><!--> 
<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0;'>
<!--<![endif]-->
<!--[if lt IE 7]>
<div class='NeatHtml'
style='width:
NoScriptDownlevelIEWidth ; height: NoScriptDownlevelIEHeight ;
overflow:auto; position:relative; border:none; padding:0; margin:0;'>
<![endif]-->

<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>
<div>

preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

As a result, a missing quote or “>” would at most cause the inner DIV end tag and the sentinel element to be hidden from the browser. Since neither serves any security function, that is acceptable.

Next, consider untrusted content that tries to close the outer DIV element so that it can mount a phishing attack. To prevent this, NeatHtml uses a Table Trap. A Table Trap relies on the fact that modern browsers require TABLE elements to be explicitly closed. If a TABLE is enclosed in other elements, end tags for the enclosing elements are ignored as long as the TABLE element remains open. That means that the server can contain ill-formed untrusted content that does not contain TABLE end tags by enclosing it in a TABLE element, like below. Note: the styles associated with the TABLE and TD elements are simply to ensure that using this technique does not add any space around the untrusted content.

 <!--[if gte IE 7]><!-->
<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0;'>
<!--<![endif]-->
<!--[if lt IE 7]>
<div class='NeatHtml'
style='width:
NoScriptDownlevelIEWidth ; height: NoScriptDownlevelIEHeight ;
overflow:auto; position:relative; border:none; padding:0; margin:0;'>
<![endif]-->

<table style='border-spacing: 0;'><tr><td style='padding: 0;'>
<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>
<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<!-- > --><!-- <xmp></xmp><! -->
</td></tr></table> <script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>NeatHtml.DefaultFilter.ResizeContainer();</script>

The Table Trap prevents untrusted content that doesn't contain TABLE end tags from displaying outside the outer DIV but does not prevent untrusted content from pulling trusted content into the outer DIV. In particular, if the untrusted content starts but does not end an IFRAME, OBJECT, SCRIPT, or CDATA element/section, then the trusted content which follows the outer DIV in the page will be pulled into the outer DIV and will appear to the user to be untrusted. This will occur because, like TABLE elements, the aforementioned elements must be explicitly closed. As a result the outer DIV's end tag will be ignored if such an element is still open. Although it's hard to imagine how such content could be used for phishing, it could certainly be used to vandalize the page. For that reason, the server closes any open SCRIPT elements, hides IFRAME and OBJECT tags, and disables CDATA sections as described below.

To close any open SCRIPT elements, the server simply adds an empty SCRIPT element after the untrusted content, like this:

<!--[if gte IE 7]><!-->
<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0;'>
<!--<![endif]-->
<!--[if lt IE 7]>
<div class='NeatHtml'
style='width:
NoScriptDownlevelIEWidth ; height: NoScriptDownlevelIEHeight ;
overflow:auto; position:relative; border:none; padding:0; margin:0;'>
<![endif]-->

<table style='border-spacing: 0;'><tr><td style='padding: 0;'>
<!-- test comment -->
<script type='text/javascript'>
try { NeatHtml.DefaultFilter.BeginUntrusted(); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>
<div>
preprocessed untrusted content
</div>
<input name='NeatHtmlEndUntrusted' type='hidden' value=”” />
<script type=”text/javascript”></script>
<!-- > --><!-- <xmp></xmp><! -->
</td></tr></table>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ProcessUntrusted();
</script>
</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

If the untrusted content leaves open a SCRIPT element the SCRIPT start tag above will be ignored (because SCRIPT elements are not allowed to nest) and the SCRIPT end tag will end the SCRIPT element started by the untrusted content.

To hide IFRAME and OBJECT tags, the server searches for all potential tags using the regex:

<(/)?([a-z][a-z0-9_:]*)? 

The “(/)” group will match for all potential end tags and the “([a-z][a-z0-9_:]*)” group will match the tag name for any element tag.

If there is a tag name but it is not on a whitelist (which would not contain TABLE, OBJECT, or IFRAME), the server prepends “NeatHtmlReplace_” to the tag name (e.g. “<iframe” becomes “<NeatHtmlReplace_iframe”). This effectively hides the tag from no-script users. When script is enabled, ProcessUntrusted() removes the “NeatHtmlReplace_” from the tag name, before filtering begins. In the current implementation everything on the client-side tag name whitelist is also on the server's whitelist, so renaming the tag is essentially the same as deleting it. However, renaming the tag leaves open the possibility of handling these elements differently when script is enabled.

Note: the tag name whitelist allows SCRIPT tags. Removing them from the whitelist would cause them to be renamed and as a result the script source code would be displayed to the user. While that isn't a security issue, it impacts usability. Having SCRIPT tags on the whitelist is not a security risk because no-script users won't run the scripts, and for scripting users the client-side filter will remove SCRIPT elements and their contents.

If no tag name is captured (e.g. “![CDATA[” does not match “([a-z][a-z0-9_:]*)”) the server disables the tag by replacing the initial “<” with “<NeatHtmlLt />&lt;”. For no-script users, the browser ignores the “<NeatHtmlLt />” and the user just sees “<”. For script users, the “<NeatHtmlLt />” allows ProcessUntrusted() to undo the replacement and pass the original untrusted content through the client-side filter. This gives the client-side filter a chance to handle any tags that did not match the server's regex.

The techniques described thus far should be sufficient to defend no-script users from phishing attacks and vandalism on pages that don't use CSS counters. The only drawbacks are:

For applications where benign untrusted content is not expected to contain these types of markup, where CSS counters are not used, and where defending no-script users from CSRF attacks is not a requirement, the above techniques should be sufficient. For other applications, NeatHtml goes on to address these limitations, as described below.

2.2.2.1 Handling HTML Comments and CDATA Sections in Untrusted Content

Recall that the JavaScript Judo technique requires that the server replace all “--” with “&#34;&#34;”. Unfortunately, that replacement will cause any HTML comments in the untrusted content to be visible to no-script users. To reduce this problem, the server removes all strings matching:

<!--[^-]*(?:-[^-]+)*-->

before replacing “--” with “&34;&34;”. That matches any HTML comment that does not contain “--”. Comments containing “--” are not removed by the server because matching them requires a considerably more complex regex. Since displaying such comments is not a security issue and the HTML standards discourage use of “--” in comments, NeatHtml avoids the additional complexity.

The server handles CDATA sections by replacing all complete CDATA sections with the HTML-encoded content of the CDATA section. A complete CDATA section is anything matching the following regex:

<!\[CDATA\[([^\]]*(?:\][^\]]+)*)\]\]>

The content of the CDATA section (i.e. the part that the server HTML-encodes) is the part matching the first grouping construct.

2.2.2.2 Allowing Tables in Untrusted Content

The algorithm used to allow tables has not been extensively tested and using it involves taking an unnecessary risk if benign content is not expected to contain TABLE elements. For that reason, support for displaying untrusted tables to no-script users is optional, and disabled by default.

When support for TABLE elements is enabled, NeatHtml keeps track of how many TABLE elements are open and closes any that remain open at the end of the untrusted content. This requires accurately identifying TABLE tags that will actually cause the browser to start or end a TABLE element. The regex used above to match potential tags, is not sufficient for the following reasons:

NeatHtml attempts to address most of these situations by using the following regex instead:

<(/)?(([a-z][a-z0-9_:]*)?(?:[ \t\n\r]+([_:a-z][_:a-z0-9.]*)((?:[ \t\n\r]*=[ \t\n\r]*("[^<"]*"|'[^<']*'|[^"'][^ \t\n\r<>]*))?))*([ \t\n\r]*/?>)?)

The groups in bold match/capture different parts of the tag as follows:

Group

Description

(/)

Matches for end tags.

([a-z][a-z0-9_:]*)

Captures the tag name.

([_:a-z][_:a-z0-9.]*)

Captures each attribute name. Each regex match would have one capture for each attribute in the tag.

("[^<"]*"|'[^<']*'|[^"'][^ \t\n\r<>]*)

Captures the attribute value. Each regex match would have one capture for each attribute value in the tag. The attribute value must be a single quoted string, a double quoted string, or an unquoted string that contains neither whitespace nor angle brackets.

([ \t\n\r]*/?>)

Captures the part of the tag after the last attribute, referred to below as the “tag tail”. The tag tail includes any trailing whitespace, followed by either “/>” or just “>”.

The server continues to disable tags when there is no tag name, and additionally disables tags if there is no tag tail, or if there are attributes on an end tag, or an end tag ends with “/>”. The server disables the tag by replacing the initial “<” with “<NeatHtmlLt />&lt;” as described earlier. The server also continues to rename tags that are not on a whitelist, by prepending “NeatHtmlReplace_” to the tag name as described earlier. Additionally, the server canonicalizes attributes by quoting any unquoted attribute values, and adding attribute values wherever they are missing.

The above regex manipulations result in preprocessed untrusted content where all tags are whitelisted and well-formed (i.e. all attributes have valid names and a quoted value that does not contain “<”). This makes it possible to accurately identify well-formed TABLE tags, but those tags could still occur:

Based on limited testing, TABLE elements appear to be allowed almost anywhere within the body of a document. The only apparent exception is that they are not allowed inside of another TABLE element unless they are also inside of a cell (i.e. TD or TH) within that TABLE element. To account for this, NeatHtml keeps track of how many allowed TABLE elements are currently open, and whether a TABLE start tag is currently allowed. A TABLE start tag is only allowed if the most recent TABLE-related tag was a TD or TH start tag, or an allowed TABLE end tag (i.e. where an allowed TABLE element was closed), or no TABLE elements are open. To ensure that SCRIPT elements can't be used to hide TABLE start or end tags or TD or TH start tags, NeatHtml inserts “<script></script>” before those tags to force closed any open SCRIPT elements. If NeatHtml encounters a TABLE start or end tag where it is not allowed, it prefixes the tag name with “NeatHtmlReplace_” to disable the tag.

2.2.3 Removing Automated CSRF Attacks by Disabling Suspicious Tags and Attributes

To remove automated CSRF attacks, the server uses the same regex used to identify well-formed tags, but uses a more restrictive tag name whitelist (e.g. IMG tags are not allowed by default), disables any attributes whose name is not on a whitelist (e.g. BACKGROUND is not allowed), and disables any style attributes containing property names not on a whitelist (e.g. background-image is not allowed) or containing a suspicious property value. The server disables attributes by appending “_NeatHtmlReplace” to the attribute name. That effectively hides the attribute from non-script users. When script is enabled, the client-side filter removes the “_NeatHtmlReplace” from the end of the attribute name before filtering it. This gives the client-side filter a chance to handle any attributes that the server cannot. In particular, the client-side filter can remove individual style properties that are not whitelisted, instead of disabling the entire style attribute.

To inspect style property names and property values, the server uses the following regex:

^(?: *(-?[_a-z][_a-z0-9-]*) *:(?:\((?<=rgb\()|[ -%')-9<-[\]-~])*(?:;|$))*$

That matches a style attribute value only if the value is a sequence of declarations. Each declaration must consist of a property name optionally surrounded by spaces, followed by a “:”, an allowed property value and an optional “;”. The server compares the captured property names against a whitelist, disabling the entire attribute if one is not on the list. Note that allowed property values can contain only the following characters:

The allowed style value regex is designed to match the overwhelming majority of benign styles seen in typical HTML markup, while not matching any styles that could contain automated CSRF attacks. In those cases where a benign styles does not match the regex, no-script users see the associated content, but with no style applied.

2.2.4 Preventing CSS Counter Manipulation

Recall that untrusted content can use “counter-reset” or “counter-increment” to manipulate CSS counters that are used by trusted content. To ensure that trusted content can use CSS counters securely, the property name whitelist used by the server to remove CSRF attacks, does not contain those properties.

2.3 Preventing NeatHtml from Being Used for Denial of Service Attacks

Consider an attack that use pathological untrusted content in an attempt to mount a DoS attack against either the server or the user's browser. Specifically, content which contains a very large number of “<” characters, or a very large number of attributes might take the server or ProcessUntrusted() a long time to process. Additionally, content with a large number of “&” characters can take ProcessUntrusted() a long time to process. NeatHtml was designed to take time proportional to the size of the untrusted content, but the constant of proportionality can be high for the cases just mentioned. Processing is slow in these cases because they cause a large number of regex matches and subsequent function calls. To address this, NeatHtml tracks the number of regex matches that have occurred, and stops processing when a configurable limit is exceeded. By default, the limit is 10000.

2.4 Discouraging Link Spam

Link spam is more of an annoyance than a security threat, but NeatHtml is in a position to discourage it. Since NeatHtml parses attribute names and values, it ensures that any tag with an HREF attribute has a REL attribute with a value of “nofollow”. Since the major search engines don't consider nofollow links when computing rankings, this removes one of the largest incentives for link spam. It does not prevent link spam entirely though. Some spammers do not check whether links are marked as “nofollow” and some hope that normal users will follow the links even if search engine spiders don't.

3 Future Work

This section describes ways that NeatHtml could be extended or improved. Potential performance improvements are discussed as well as additional functionality.

3.1 Improving Server-Side Performance

3.1.1 Making Style Checking Optional

Removing CSRF attacks and preventing CSS counter manipulation requires that the server check all style property names and values. To improve performance for applications where preventing those attacks is not a requirement, that processing should be made optional.

3.1.2 Avoiding No-Script Processing When JavaScript Is Enabled

Almost all of NeatHtml's server-side processing is dedicated to protecting no-script users. If the server knows that the user has JavaScript enabled, the all of that work can be skipped. To determine that a user has JavaScript enabled, NeatHtml could use JavaScript on a page to set a cookie. To detect if these users later disable JavaScript, a NOSCRIPT element can be used. The NOSCRIPT element would contain an image and/or a link to a page. The server would delete the cookie when the image or page was retrieved.

When the server detected the cookie indicating JavaScript was enabled, it would place the untrusted content in a JavaScript string - escaping quotes and newlines appropriately - and send something like this to the browser:

<div class='NeatHtml' style='overflow: hidden; position: relative;
border: none; padding: 0; margin: 0; '>

<script type='text/javascript'>
try { NeatHtml.DefaultFilter.ProcessUntrusted(' escaped untrusted content '); }
catch (ex) { document.writeln('NeatHtml not found\074!-' + '-'); }
</script>
<noscript>
<a target=”_blank” href=” RemoveScriptEnabledCookiePageURL
><img src=”
RemoveScriptEnabledCookieImageURL
alt='Click here and then...'
/></a>
Please reload this page to view without JavaScript enabled.
</noscript>

</div>
<script type='text/javascript'>
NeatHtml.DefaultFilter.ResizeContainer();
</script>

If JavaScript was still enabled, ProcessUntrusted() would replace the contents of the containing DIV element with the filtered untrusted content.

If the user had disabled JavaScript, the exact behavior would depend on how the browser handled images. If the user's browser automatically downloaded images, it would retrieve a 1x1 transparent GIF from RemoveScriptEnabledCookieImageURL – allowing the server to delete the cookie – and the user would be asked to reload the page. If the user's browser did not automatically retrieve images, the user would be asked to click a link before reloading the page. The link would display RemoveScriptEnabledCookiePageURL in a new window which would allow the server to delete the cookie.

3.2 Adding New Features and Ports

3.2.1 Improving CSRF Removal

NeatHtml currently removes automated CSRF attacks by removing anything that could cause the browser to make a secondary request, including images. Unfortunately, that approach removes benign images from untrusted content and does not address user-initiated CSRF attacks where a user is tricked into clicking a link designed to launch a CSRF attack.

Most CSRF attacks rely on the fact that the user's cookies are sent to the target site. RFC 2965 states that cookies must not be sent on cross-domain redirects. If browsers actually follow that rule, it should be possible to prevent all CSRF attacks that rely on cookies by changing URLs in untrusted content so that they go through a cross-domain redirector.

3.2.2 Supporting Inline Untrusted Content

NeatHtml currently displays untrusted content in a DIV element. Browsers display DIV elements as CSS blocks by default. As a result, the only way to use NeatHtml to handle untrusted HTML that should be displayed inline is to treat the entire block element that contains the untrusted content as untrusted, even though much of it may be trusted. In addition to being inconvenient, this also allows the untrusted content to mount phishing and vandalism attacks on the trusted content in the block.

It is worth investigating whether untrusted content could be displayed inline by styling the DIV element to use “display: inline” instead. It's not clear whether overflow would be hidden by the browser, particularly if the browser is IE6 or earlier.

3.2.3 Running Untrusted Scripts in a Restricted Environment

For some applications, it would be desirable to allow untrusted scripts to run if their capabilities could be restricted. This would be particularly valuable for mashup applications and social networking sites. Since NeatHtml can modify an untrusted script before it runs, it might be able to create a restricted environment for these scripts. For example, NeatHtml could do the following:

  1. Use a regex to find all potential global identifiers in the script. False positives would not impact functionality or security (as long as the matches are syntactically valid identifiers), so this regex could be fairly liberal.

  2. Enclose the script in a function declaration, and before the function declaration add local (i.e. “var”) declarations for all the global identifiers that are not on a whitelist. This hides those global objects from the untrusted script. For whitelisted identifiers, NeatHtml could provide a proxy object. The proxy object would be responsible for restricting access to the underlying global object. For example, the proxy object for the built-in global eval() function would treat it's string parameter as an untrusted script, processing it the same as an untrusted top-level script before passing it to the built-in eval() function. The proxy for the Function object constructor would work the same way.

  3. Convert the resulting script to a quoted JavaScript string (e.g. escaping quotes and newlines), try to pass quoted string to the Function object constructor, and call the newly constructed function. This would provide a way to catch syntax and execution errors in the untrusted script.

With the above technique, the following script:

window.alert('XSS');
eval(“ale” + “rt('XSS');”);
(new Function(“ale” + “rt('XSS');”))();

would become:

  try {
(new Function(“”
+ “ var window = null;\n”
+ “ var eval = EvalProxy;\n”
+ “ var Function = FunctionProxy;\n”
+ “ RunUntrustedScript();\n”

+ “ function RunUntrustedScript
() {\n”
+ “ window.alert('XSS');\n”
+ “ eval(\“ale\” + \“rt('XSS');\”);\n”
+ “ (new Function(\“ale\” + \“rt('XSS');\”))();\n”
+ “ }\n”

)();
}
catch (ex) { /* Optionally handle errors */ }

Unfortunately, there is at least one problem with the above technique. Although it hides the global eval() function behind a proxy, it is not able to hide the deprecated eval() method that exists for every object. It's not clear how hard it would be to address that. It might be as easy as calling something like “Object.prototype.eval = EvalProxy”. However, if that isn't sufficient, the solution would be considerably more complicated and would probably look something like this:

  1. Remove all comments.

  2. Split the script into code, quoted strings, and regexs.

  3. In code portions of the script, replace the regex “\.[ \t\r\n]*eval” with “.EvalProxy” to prevent access to the eval method with the dot operator.

  4. In code portions of the script, find all “[” characters that are used to access object properties instead of start array literals.

  5. Replace the aforementioned “[” characters with “[NeatHtml.Eval2EvalProxy(” and replace the matching “]” characters with “)]”. That would make:
    x[“ev”+”al”]
    become:
    x[NeatHtml.Eval2EvalProxy(“ev”+”al”)]

In addition to restricting object access, the filter could also try to restrict CPU usage of the untrusted script. That is a considerably more complex problem, especially if the untrusted script is allowed to use regular expressions.

Despite these challenges, using NeatHtml to restrict untrusted scripts is a goal worth pursuing, primarily because it would not require any changes to the software installed on the user's computer.

3.2.4 Porting the Server-Side Code to Other Environments

NeatHtml's current server-side component is written for ASP.NET, but it designed to be small and simple, so that it is easy to port to other development environments. This means that in addition to NeatHtml for ASP.NET, it should be easy to create NeatHtml for Java, NeatHtml for PHP, etc.

Other types of ports would also be interesting. First, consider a port that could be plugged into the web server (e.g. NeatHtml for Apache, or NeatHtml for IIS) to filter HTTP responses as they were sent to the browser. To allow the add-on to find the untrusted content in the HTTP responses, developers would only need to mark it like securely, perhaps like this:

<NeatHtmlBeginRawUntrusted token=”token” />
raw untrusted content
<NeatHtmlEndRawUntrusted token=”token” />

where token would be any value that the author of the untrusted content could not determine before submitting the content. This would ensure that the untrusted content couldn't forge the “<NeatHtmlEndRawUntrusted>” tag to get content past the add-on.

If marking untrusted content in such a way became common enough, the server-side code could be ported to proxy server or browser add-ons. Such add-ons would allow users to protect themselves while requiring no additional server-side computational resources and a minimum amount of application developer effort.