sec13.html

13 Browser and Server Problems

T_TH translates T_EX into standard HTML and takes account as far as possible of the idiosyncrasies of the major browsers. Nevertheless, there are several problems that are associated with the browsers, and a few that are associated with web servers. Authors and publishers should recognize that these are not T_TH bugs. Font-related problems are complicated. If you don't need all the gory details, you might want to read section 13.1 and then skip to 13.3.

13.1 Accessing Symbol Fonts: Overview

Many of the most serious difficulties of Mathematics rendering in HTML are associated with the need for extra symbols. In addition to various Greek letters and mathematical operators, one needs access to the glyphs used to build up from parts the large brackets matching the height of built-up fractions. These symbols are almost universally present on systems with graphical browsers, which all have a "Symbol" font, generally based on that made freely available by Adobe. The problem lies in accessing the font because of shortcomings in the browsers and the HTML standards that relate to font use.

In brief, there are three ways to access the symbol fonts; these will be described in more detail below. The following table indicates which of these approaches to accessing the symbol fonts works with which browser. It also outlines which of the mathematics rendering improvements via CSS positioning are satisfactory.

	Symbol Encoding			CSS Positioning
	8-bit numeric	Adobe Private	Unicode 3.2	relative	height compress
T_TH switch	-u0	-u1	-u2	-y2	-y1
Browser:
MSIE 5.0	Yes	No	No	Yes	Buggy
Mozilla 1.x X	Alias/Font	Buggy	Buggy	Yes	Yes
Firefox 1.x X	Alias/Font	Buggy	Buggy	Yes	Yes
Firefox 1.x Win	Yes	Buggy	Buggy	Yes	Yes
Konqueror 1.9.8	Alias	No	No	Yes	Yes
Firefox 3.5 X	No	Buggy	Ugly	Yes	Yes
Chrome 4.0 X	No	Buggy	Ugly	Yes	Yes
Firefox 3.5 Win	Yes	No	Buggy	Yes	Yes
MSIE 8.0 Win	Yes	No	Ugly	Yes	Yes

This situation is painful. The 8-bit numeric style symbol access method, which was the approach originally pioneered by T_TH, used to work with a significant number of browsers but needed additional font settings for X-window systems. This is the approach that T_TH used to use by default. However Mozilla and Firefox have systematically moved towards disabling this method under linux and OSX, presumably because they consider it not standards-compliant. They have not properly implemented the unicode 3.2 alternative, because the glyphs they use for built-up delimiters are incorrectly sized and leave ugly gaps. In some cases the spacing is completely erroneous. One is left with the choice between the traditional 8-bit approach, which works well with all MSWindows systems up to Vista, but does not work with most recent X-based operating systems; or Unicode 3.2 which works with most browsers, but is badly buggy in Windows Firefox and ugly everywhere.

In the interests of an eventual rationalization of this situation, TtH has changed to make the Unicode 3.2 coding its default from the 2010 version 3.87 on, but this by no means universally satisfactory.

13.2 Accessing Symbol Fonts: Details

Prior to HTML4.0, that is, during the major phase of the evolution of HTML, the default encoding for HTML documents was ISO-8859-1 (sometimes called ISO Latin-1). The document encoding defines a mapping between the bytes of the file itself and characters. The HTML4.0 standard draws a strict (but often confused) distinction between the document "character set", sometimes referred to more recently as the character "repertoire"(which refers to all the characters that might be used in it) and the "document encoding" (which encodes a subset of the character set by mapping them to bytes). The confusion is compounded by the entrenched usage of the term "charset" to refer to the "document encoding" (not the character set). This usage is presumably a reflection of the prior lack of any significant distinction between the two.

Purists since the adoption of HMTL4.0 regard the selection of a glyph as governed by the process: (byte) code →glyph-name → font-glyph. In this view, even though the font contains the glyphs in a well defined order, the glyph is accessed not by its position in the font but by its name. For example, in a document with ISO-8859-1 encoding, the byte with decimal value 97 maps to the "latin small letter a" which is accessed from the font on that basis. On this view, it is not possible, or rather ought not to be possible, to access the Greek letter alpha by specifying that the font is Symbol and the byte coding decimal value is 97, despite the fact that the Greek alpha is indeed in the same position in the Symbol font as the lower case a in its font. This is because (the story goes) 97 means "latin small letter a" and the Symbol font simply does not contain the latin small letter a.

In practice, of course, most browsers, including Internet Explorer (to 8.x), have not taken so pedantic an approach. In a document that is encoded in the same order as the fonts on the system, as is the case for ISO-8859 on systems other than the (old) MacIntosh, the browser maps code to glyph directly on the basis of numeric position in the font. Therefore it is perfectly sensible to specify eight-bit code 97 and Symbol font to obtain alpha. In other words, the browsers treat the Symbol font as if it were an ISO-8859 font even though, as far as the glyph names are concerned, it is not. It can be argued, even within the world-view of standards lawyers, that a document that does not explicitly specify its encoding (and T_TH documents do not) could be considered to obey its own font encoding or some unspecified encoding, in which case, bytes ought to be permitted to refer directly to numeric font positions, in just this fashion, regardless of whether the font is identified as ISO 8859. But such arguments are usually a waste of breath. In any case, recent versions of Mozilla and its derivatives on the Windows operating system will properly render symbols provided they are told that the DOCTYPE is HTML 4.0, not HTML 4.01. This is the reason why T_TH has reverted to giving its documents this rather out of date DOCTYPE.

On the X-windows system, a distinction between fonts is provided directly in the system via the font naming conventions. Mozilla takes notice of this font allocation by permitting access only to fonts whose names end 8859-1, for default encoded documents. The symbol font is not one of those fonts unless additional steps are taken. The enabling of the symbol font requires specification of some system font aliases, or installation of a specially encoded Symbol font, which then ensures that the Symbol font is treated as if it were ISO-8859-1 encoded. Notice that this type of problem arises for any document that wants to access more than one language of font. Thus, any document desiring a mixture of, for example, western and cyrilic characters would face the same problem.

To summarise, the symbol font is present on practically every computer on the planet that runs a graphical browser. Under the MSWindows operating system, IE to version 8.x, and Mozilla (gecko)-based browsers treat the symbol font as if it were a numerically encoded font and compatible with ISO 8859-1 encoding, provided the DOCTYPE is HTML 4.0 Transitional. Treating the font as such enables the glyphs to be accessed using either eight-bit codes in just the same way as standard ASCII characters. This is the way that documents have accessed these glyphs for years.

The HTML4.01 standard says that unicode (ISO 10646, also called UCS) is the character set of HTML, and that the way characters outside the current document encoding should be accessed is through unicode points. Unicode is backwardly compatible with ISO 8859-1 in a way that we need not dwell on. Unicode is supposed to fix all the font problems that are described here, and with luck eventually it will indeed help. The problem is that (1) Unicode is enormous, so only a tiny fraction of it is so far supported, and (2) in its original incarnation unicode does not even assign points to the parts of large delimiters that are needed for mathematics. They are present in the new version of unicode, 3.2, becoming current. However, as the table above shows, no browser cleanly supports the new unicode assignments. Mozilla used to support some assignments of points in unicode's designated "private usage area" to the glyphs we need. Apparently these assignments have become de-facto standards for the Adobe Symbol font in typographic circles. No other browser supports them. They are not and, according to unicode principles, never will be part of the unicode standard, and appear to be on the way out.

The option that mathematics web publishing currently has, then, is either an approach that works with Windows browsers but which purists say is not consistent with latest standards, or a representation that is consistent with the standard but useless with some browsers. It would be really nice if the browsers would get their act together on mathematical symbols.

13.3 Printing

In many browsers, the printing fonts are hard coded into the browser and the font-changing commands are ignored when printing. For that reason, visitors viewing T_TH documents will often not be able to print readable versions of documents with lots of mathematics. This problem could, and should, be fixed in the browsers. However, if you want your readers to be able to print a high-quality paper copy of the file, then you probably want to make available to them either the T_EX source or a common page-description format such as Postscript or PDF. Since HTML documents download and display so much faster and better than these other formats on the screen, T_TH's translation provides the natural medium for people to browse, but not necessarily the best medium for paper production.

13.4 Netscape/Mozilla Composer

Netscape Composer and Mozilla Composer is too clever for its own good. If you run an HTML document produced by T_TH through Netscape Composer, all sorts of internal translations are performed that are detrimental to its eventual display. For example, if you subsequently save the document with the usual encoding set (Western), the eightbit codes that work with Macs are replaced with HTML4.0 entities such as [&]ograve; or [&]pound;. This effectively breaks the document for viewing on Macs because it undoes everything just explained. Even if you use User-Defined encoding, which prevents this particular substitution, Composer will rearrange the document in various ways that it thinks are better, but that make the display of the document worse. The moral is, don't run T_TH documents through Netscape Composer. You therefore cannot use the "publish" facility of Composer. Transfering the document to the server with plain old ftp will keep it away from Composer's clutches.

13.5 Other Browser Bugs

Font changing commands do not propagate from cell to cell of HTML tables. In rendering equations (using tables) T_TH circumvents this bug (excuse me, feature) at the cost of significant extra effort and slightly verbose HTML. However, for tables generated by \halign or \begin{tabular} T_TH takes no special steps to avoid this problem. A change of font face in a cell, for example by \it will not carry over to the next cell. A document containing this problem will not pass some HTML validations. It is prevented if every cell of a T_EX table is enclosed in braces and the required style applied separately to every cell - a serious annoyance.

Tables are incapable of being properly embedded within a line of text. They generally force a new line. This is quite a significant handicap when translating in-line material that could use a table. It can be argued that this behaviour is required by the HTML standard. Specifically, the <p> element is defined as having in-line attributes which prevent it from containing any elements defined as being block type, of which <table> or actually strictly <td> is one. However, even if you ensure that text is not inside a <p>, most browsers force a new line.

13.6 Web server problems

The HTML files that T_TH produces are encoded using the charset ISO-8859-1, like most web files. In newer linux systems the default file encoding on the computer is in many cases now UTF-8. For the characters with codes above 128, this can cause problems with the web server. The web server may wrongly assume that the HTML file is a UTF-8-encoded file, and declare this assumption in the http content-type header that it sends to browsers when they access the file. For gecko-based browsers, the http content-type declaration overrides any internal file declaration of the encoding of the file. Consequently, the browser treats this file as if it is UTF-8 encoded, with the result that codes higher than 128 are misinterpreted. This is an inadequacy in the web server (apache is known to behave this way in some situations).

There are several options to work around this problem.

It is possible to convert all files from ISO-8859-1 to UTF-8 encoding, using a utility called iconv, present on most modern linux installations. This is not an attractive solution because then when the files are browsed locally (via file://...) they will display incorrectly. Locally, the browser does not have the http content-type declaration to guide (or misguide) it, and it thinks the files are ISO-8859-1 encoded. But if they've been converted, they are not.

The better approach seems to be to fix the web server so that it gets the file content-type right. This can be done on a per-directory basis by creating a file called .htaccess in the directory. This file should contain the line:

  AddType text/html;charset=ISO-8859-1 html

This tells the server that all files in this directory and its subdirectories that have extension html are to be considered of type HTML and encoded with the ISO-8859-1 charset.

Unfortunately some web servers are configured not to pay attention to the .htaccess file. If yours is one, you have to get the web master to edit the server configuration file (/etc/httpd/conf/httpd.conf). The lines that read AllowOverride None must read instead AllowOverride FileInfo. Alternatively, get the webmaster to change the line in that configuration file that reads AddDefaultCharset UTF-8 to read instead

AddDefaultCharset ISO-8859-1

and once the server is restarted all your troubles will be over without any of those pesky .htaccess files.

There are other ways of accomplishing the same thing in the web server, if you are a guru. Information is available at the W3C FAQ.

HEAD