FilterDataInput Component

Note: Do not view this file using a browser, view the text using a text editor otherwise the examples showing escaping will be meaningless.

** Overview **

This component provides extra functionality for filtering data for illegal or corruptive HTML constructs.

It provides this functionality in two pieces: a new Idoc script function called encodeHtml and a filter hook
to automatically scrub all input data for dangerious HTML constructs.

The encodeHtml function is defined as follows.

Function encodeHtml(str, rule, workbreakrules)

	str -- String to encode.
	rule -- Rule to apply when encoding. The following values are allowed.
		"none" -- No conversion is done.
		"unsafe" -- Only well known unsafe script is encoded. The current list is:
			"script", "applet", "object", "html", "body", "head", "form", "input", "select", "option", "textarea".
		"exceptsafe" -- Only well known safe script tags are NOT encoded. This list includes:
			"font", "span", "strong", "p", "b", "i", "br", "a", "img", "hr", "center", "link",
			"blockquote", "bq", "fn", "note", "tab", "code", "credit", "del", "dfn", "em", "h1", "h2", "h3", "h4", "h5",
			"blink", "s", "small", "sub", "sup", "tt", "u", "ins", "kbd", "q", "person", "samp", "var", "ul", "li",
			"math", "over", "left", "right", "text", "above", "below", "bar", "dot", "ddot", "hat", "tilde", "vec", "sqrt",
			"root", "of", "array", "row", "item".
		"lfexceptsafe" -- Like exceptsafe except that line feed (Ascii 10) characters are turned into HTML break tags (br). 
			However, line feeds inside of HTML tags are NOT turned into "br" tags. Also some tags which are safe for the "exceptsafe" 
			option are not safe for this option. This list includes: "br", "p", "ul", "li".
		Except for the rule "none", all the rules have special HTML comment handling. In particular, all HTML comments are allowed through.
		But while inside an HTML comment all less than (<) and greater than (>) symbols are encoded. This, of course, does not apply
		to the HTML closing signature (-->). If there is an unterminated comment, the encoding function appends the HTML comment close 
		signature (-->). Additionally for all rules except the rule "none", any attribute value inside a tag has any parenthesis encoded 
		to %28 (for '(') or %29 (for ')'). Otherwise, if any character is escaped it is escaped using the XML (&xxxx;) type encoding.
	wordbreakrules -- This is an optional parameter that can specify if long strings without space characters are to be broken up and what
		maximum word size to apply. Either the string "wordbreak" or "nowordbreak" can be specified. Also, the additional parameter (following
		the "wordbreak" parameter if present by using a comma to separate) maxlinelength=XXX where XXX is a maximum line length desired.
		The default is to turn on "wordbreak" if the rule "lfexceptsafe" is specified and to use the a maxlinelength of 120 characters. The
		word break functionality is only useable by this Idoc script function because this Idoc script function is used for display and 
		not applied before the data is stored.
	(return) -- Returns the encoded string.

** Level of Encoding Configuration Entries **
	
The component filters all input data received by the content server at the default "unsafe" encode HTML option. This choice can be changed
by using the configuration the following configuration entries:

HtmlDataInputFilterLevel={one of none, unsafe, exceptsafe, or lfexceptsafe}. If "exceptsafe" or "lfexceptsafe" is chosen, then "unsafe" will be applied
to GET style requests (unless HtmlDataScriptableInputFilterLevel is used). Doing a higher level of encoding on GET requests breaks content server 
operation since <$...$> and other tags are routinely passed in as part of the parameter data on URLs. The higher level of filtering will be applied 
to POSTs and in particular all values that are stored persistently will be encoded before they are stored.

HtmlDataScriptableInputFilterLevel={one of none, unsafe, exceptsafe, or lfexceptsafe}. This forces the encoding of parameters on all GET style requests
to be at the the specified level. If "exceptsafe" is used, then unpredictable user interface results may occur. In particular, if parameter is not 
"protected" using the configuration entry HtmlDataInputProtectedScriptingParameters, it cannot have Idoc script in it (the Idoc script would be escaped).

HtmlDataInputEncodeDocAndUserFieldsAsExceptSafe -- Bumps up the encoding of certain parameters that are likely to be stored in critical database tables.
If a request is likely to store values in a database then enabling this value will cause the values that are to be put into the database to be encoded
at the "exceptsafe" level (all HTML tags will be encoded except for a small list -- "exceptsafe" in rule parameter for encodeHtml above). The fact that
these values are encoded may not be apparent until you visit either the update doc info or update user profile pages.

HtmlDataInputProtectedScriptingParameters -- Parameters that are not allowed to go above "exceptsafe" when being encoded. In particular, Idoc script can
be put into these parameters regardless of the current global encoding level. The default list is 

	QueryText,queryText,RedirectUrl,RedirectParams,Text1,Text2,pageText1,pageText2,ResultsTitle,PageTitle,descriptionScript,DataScript
	
HtmlDataInputProtectedScriptingParametersExtraKeys -- This is similar to HtmlDataInputProtectedScriptingParameters but it only adds to the list
of protected scripting parameters.

Also, rules can be applied to specific incoming fields using the configuration entry

HtmlDataInputCustomFieldLevels={comma separated list paired into field name and rule, ex: MyField1,exceptsafe,Myfield2,lfexceptsafe,Myfield3,none}. The
encoding is applied to all requests.


** Encoding Algorithm Configuration Entries **

The list of "unsafe" tags can be extended by the following configuration entry.

HtmlDataInputExtraUnsafeTags -- A comma separated list of additional tags that should be considered unsafe. If a tag in this list
is in the "exceptsafe" list, it will no longer be considered an "exceptsafe" tag.

The filter also looks for double quotes (") inside values of a preset list of parameter names (such as IDC_Name) and changes them all double quotes
into the xml escape version &quot;. There are a some configuration entries that control this and they are valid for all requests.

HtmlDataInputEncodeDoubleQuotes -- Comma separated list of parameters which have their double quotes encoded. This stops the values from being
	used in javascript literal string attacks. Note, these parameters are special because it is legitimate for them to have single quote
	in their values, but are not encoded by Idoc script when used in url or javascript constructions. Any custom components that use these
	values to build javascript expressions or urls need to make sure to use surrounding double quotes (") and not single quotes (').
	If this entry is missing the default entry is
	
	SearchProviders,FromPageUrl,dSecurityGroup,dDocAccount,IDC_Name

HtmlDataInputEncodeDoubleQuotesExtraKeys -- Comma separated list to add to previous list. This allows list to be augmented without having to reference
the existing entries in the list. The default is the empty string.

Some parameters need to be full escaped. Usually this means that the parameters are expected to be keywords or numeric. Because their values tend
to be numeric or come from a limited list of keyword strings, the Idoc script that references them tends to use them in an unsafe fashion when
constructing the page. In particular, this can mean unsafe references in javascript because the javascript may assume the value to be numeric
or the name of a variable.

HtmlDataInputEncodeFully -- Comma separated list of parameters which have single quote ('), double quote ("), ampersand (&), less than (<), greater
than (>), left parenthesis ((), right parenthesis ()), and backslash (\) all encoded. If this entry is missing than the default entry is

	ResultCount,ResultTemplate,dID,dRevClassID,numTopics,subscribeService,unsubscribeService,SortOrder,SortField,SearchEngineName,SearchQueryFormat,
	MaxSavedSearchResults,Repository,ftx

HtmlDataInputEncodeFullyExtraKeys -- Comma separated list to add to previous list. This allows list to be augmented without having to reference
the existing entries in the list. The default is the empty string.

If additional protection is required when encoding at the exceptsafe level, more special characters can be escaped. This can create a level of
extra protection and not only protect against cross site scripting attacks but also prevent malformed pages from bad usage of HTML constructs. Doing
this is generally not recommended because of the degraded user experience that it can cause.

HtmlDataInputEscapePotentiallyUnsafeCharacters -- If set to true, then causes additional characters to be escaped regardless of whether they
are inside or outside a tag. This extra encoding only occurs if "exceptsafe" or higher level encoding is requested (usually both
HtmlDataInputFilterLevel and HtmlDataScriptableInputFilterLevel have been set to exceptsafe so that only "protected" fields do not have
their special characters escaped). By default, only the double quote (") is escaped but additional characters can be escaped using
the configuration entry HtmlDataInputPotentiallyUnsafeCharacters. ** Warning - Escaping special characters such as double quote will
limit the utility of the content server and may create issues that will be perceived as a malfunctioning system. For example, if you
enter a double quote in a document metadata field (such as the title), then next time you view the metadata field it will contain an 
escaped version of the double quote which may be reported by end users as a visual display problem. It may also limit the flexibility
of some search queries and create problems with copying and pasting values from the metadata fields.

HtmlDataInputPotentiallyUnsafeCharacters -- The list of characters (do not comma separate) to escape using xml encoding (&#xNN;) when
HtmlDataInputEscapePotentiallyUnsafeCharacters is true and the encoding level is exceptsafe or higher. The default is double quote (").
** Warning - If you choose to escape the ampersand character (&) it will cause already escaped character sequences to have the ampersand
in the escape sequence escaped again. For example, if you put an ampersand into a document title and keep going to the update doc info 
page and doing update, the ampersand will be escaped again and again on each update creating a longer and longer title. This escaping
rule may also may make impossible for users to properly enter URL paths that have parameters. For example, in http://mysite.com?arg1=1&arg2=2
the ampersand will be escaped making it difficult to copy and directly use the URL.

Note 1: This component only works with 7.5 (unlike the earlier versions -- but the earlier versions did not intercept updates to the database very well).

Note 2: There are the number of ways browsers make it too easy to do cross site scripting attacks in how they handle encoding of javascript inside an
attribute of a tag. In order to make this example easier to view in a browser we replace < with [ and > with ].

Consider the construction [iframe src="javascript:alert('XSS')][/iframe] (put back in < for [ and > for ] to get actual text). If put this on a page, 
you will see an a popup dialog that says XSS. If you try to encode the ( that occurs before the 'XSS' as any one of %28, &#28, &#28;, or \u0028, 
you will still see the popup. The browser is very aggressive at decoding character sequences before trying to execute the javascript. So though 
even though %28, &#28;, and \u0028 really should not all be respected in the same literal string, they are by IE. IE goes even further in that 
the trailing semicolon on &#28 does not have to be present. This component tries to foil this attempt to execute javascript by taking ( -> &#amp;28; 
and in the alternative constructs it interprets the & as &amp; and the \ to \\. So &#28 becomes &#amp;28; and \u0028 becomes \\u0028.

Note 3: This component also provides a patch for the java class intradoc.common.IdcLogWriter. This patch prevents raw HTML from being put into the
log file creating an opportunity for cross site scripting attacks.

12/20/2006
-----------------

The parameter MaxSavedSearchResults was not being fully encoded. There was a reference to it that did not use "#env." as prefix so it
created a vector for attack.

12/4/2006
-----------------

Added support for configuration variable HtmlDataInputAllowEnvVarsEncodingAtExceptSafe which allows environment variables such as QUERY_STRING
to be encoded at "exceptsafe" (but only if other configuration specifies such an encoding level). This configuration entry cannot be used
unless a somewhat customized version of the content server user interface is created (the "trays" layout may not behave correctly otherwise).

11/21/2006
-----------------

Added functionality to support the new configuration entries HtmlDataInputEscapePotentiallyUnsafeCharacters and HtmlDataInputPotentiallyUnsafeCharacters.
Also put in code to defend against more obscure cross site scripting (CSS) attacks that involve assignment to self.location.href and so
do not actually have to use < or > characters. It is unknown whether such an attack presents a real vulnerability. Also put in code to examine
HTTP headers supplied by WebDAV calls.

11/9/2006
-----------------
Put in defense against &#x28 (without ; on end) for IE and also encoded ( -> &#amp;#28; instead of %28 (and similarly for the right parenthesis).

11/7/2006
-----------------
Added support for the configuration entry HtmlDataInputExtraUnsafeTags. Also improved the escaping of characters in values of attributes in tag
constructions. For an example of the enhancement, if you have <span onMouseOver="alert(&#28;'test'&#29;">, the ampersand (&) will be encoded 
into &amp;. This is done because the javascript parser in the browsers will decode &#28;'test'&#29; into ('test') which allows the formation 
of a javascript function call. 

6/23/2006
-----------------
Made component when encoding at "exceptsafe" level work with form submits. In particular DataScript and any parameter that ends with ":default" needed special
handling.

1/25/2006
-----------------
Bundled with readme.htm for distrubution to support site.

11/18/2005
-----------------
Put in support for configuration entry HtmlDataInputEncodeDocAndUserFieldsAsExceptSafe. Also reduced scope of HtmlDataInputFilterLevel so that levels
of encoding "exceptsafe" or higher only apply to "POST" style requests and not "GET" requests. Also fixed issue with HtmlDataInputFilterLevel not applying
to some of the metadata in a checkin or to parameters to certain document action commands (CHECKOUT, etc.).

11/10/2005
-----------------
Changed to supporting only 7.5.1 and removed following lines from component definition file.

HttpImplementor
filterdatainput.FilterDataInputServiceHttpImplementor

Also changed the lines

filterDataInputBeforeInitLocale
filterdatainput.FilterDataInputEventFilter
prepareforfilterinput
1
filterDataInputAfterInitLocale
filterdatainput.FilterDataInputEventFilter
globalfilterinput
1
checkCredentials
filterdatainput.FilterDataInputEventFilter
docfilterinput
1
filterDataInputCheckDoneFiltering
filterdatainput.FilterDataInputEventFilter
checkerrorexitfilterinput
1

to (the filter events are built into 7.5.1)

beforeInitLocale
filterdatainput.FilterDataInputEventFilter
prepareforfilterinput
1
afterInitLocale
filterdatainput.FilterDataInputEventFilter
globalfilterinput
1
validateStandard
filterdatainput.FilterDataInputEventFilter
docfilterinput
1
onServiceRequestError
filterdatainput.FilterDataInputEventFilter
checkerrorexitfilterinput
1

	
10/5/2005
-----------------
Added support for doing forced word breaks when using the encodeHtml Idoc script function.

3/15/2005
-----------------
Fixed issue where values loaded out of search results or data files would be encoded because of a second pass on encoding data values
for certain services.

2/10/2005
-----------------
Fixed issue where an early data validation error would prevent filtering of input data.

1/27/2005
-----------------
Fixed issue where line feeds were not encoded if they came before the first "<" character (see option lfexceptsafe above).

1/24/2005
-----------------
Fixed issue where linefeed characters were only encoded if html tags were present in text (see option lfexceptsafe above).
	
1/19/2005
-----------------
Made processing of a < following a < more subtle. The second < would many times be encoded unnecessarily. The new logic will look to see
if the first < is starting a safe (or invalid) tag. If so the next < will not be encoded, but it may terminate processing of the prior tag
and start the processing of a new tag. See UCF p51030105.

1/12/2005
-----------------
Added code to also escape QUERY_STRING parameter sent by web server. Also added code to escape double quotes in a limited list of parameters.

-----------------
12/30/2004

Made filter apply to GET_SEARCH_RESULTS service and every service that requires only anonymous read privilege to access. 
Before it applied only to services that required a login or verified real security.

11/9/2004

Created component. Attached to the component is an hcst page that exhibits the encodeHtml Idoc script function.


