Anchors are seriously screwed up in Tiki. This applies to Tiki versions up to, at least, 25.0. I’ve dug into the mess and here I’m summarizing my findings.
The URL standard defines exactly how anchor names in URLs (the optional part starting with “#”) are formed. It’s available here.
Anchors and subsection parts of links in Tiki are called “URL-fragment string” in the URL standard. See here for the formal definition.
A URL-fragment string consists of none to several “URL Units”. “The URL units are URL code points and percent-encoded bytes.” See here.
URL code points are ASCII alphanumeric characters, certain ASCII non-alphanumeric characters, and a part of the Unicode code points in the range U+00A0 to U+10FFFD - except “surrogates” (?) and non-characters. See here. This means that most Unicode characters, including all letters, are allowed in URL-fragment strings - in the ASCII range or otherwise. Spaces aren’t URL code points and need to be “percent-encoded” as “%20”.
“Percent-encoded bytes” are the percent character, followed by two hexadecimal digits.
This means that all byte sequences (all Unicode strings) can be used as names for URL-fragment strings, they only need to be properly encoded. All characters can be encoded as one or several “percent-encoded bytes”, but most don’t need to be percent-encoded, they can be specified literally.
In Tiki, the ANAME and ALINK plugins don’t encode the anchor name as an URL-fragment string. Instead, they replace, in the generated HTML page, each sequence of non-allowed letters with an underscore. Only the ASCII letters (upper and lower A-Z) and digits are allowed. This code is in both of those plugins:
// the following replace is necessary to maintain compliance with XHTML 1.0 Transitional // and the same behavior as tikilib.php. This will change when the world arrives at XHTML 1.0 Strict. $aname = preg_replace('/[^a-zA-Z0-9]+/', '_', $aname);
In Tiki, links to an anchor, in a certain page, can be generated from the ALINK plugin, and from the subsection part of an internal link (if present). The latter gets encoded as an URL-fragment string. This means that we have two different ways, how the anchor name gets encoded in the generated HTML. For the anchor definition (generated from the ANAME plugin) and for internal links to that anchor (generated from the subsection part).
However, Tiki’s generation of the URL-fragment string percent-encodes more characters that necessary. All letters are allowed here, according to the URL standard. But - as far as I can see - only ASCII alphanumeric ones are encoded literally. Other letters are percent-encoded. This leads to ugly URLs. Firefox replaces percent-encoded characters (including spaces) with the characters they encode, in the adress bar. So it isn’t ugly. But when you copy the URL, you get the encoding like it is specified in the HTML code.
In Tiki, anchors are automatically generated for headings. They are treated differently again. Spaces are replaced by underscores. And, it seems, all characters except for ASCII alphanumeric ones, are removed.
Tables of contents, generated by the maketoc plugin, are also affected. They consist of links to headings, which are encoded like like described in 3.
This situation is unsatisfying. You can make a link to an anchor, by using a regular internal Tiki link with a subsection. You can also use the ALINK plugin. If you use an internal link, you need to manually encode the subsection part. But there are two ways to encode it - the way the ANAME plugin does, and the way the anchors generated from headings get encoded.
Both ways allow only ASCII alphanumeric letters and throw away all other letters. What about Asian languages, which don’t use latin characters at all? In order to make an anchor, they need to use latin characters. But what about automatically generated anchors for headings? Such can’t be used at all, because such anchors are encoded as the empty string.
The solution is to properly encode anchor names as URL-fragment strings, the way it is specified in the URL standard. The ANAME/ALINK plugins, the subsection part of internal links, the anchors generated for headings and the links generated by the maketoc plugin would all use the same encoding when translated to HTML. However, characters which don’t need to be encoded as percent-encoded bytes, shouldn’t be encoded this way. They should be encoded literally. Otherwise, for non-latin languages, you get anchors which consist of percent-encoded bytes exclusively.
Doing it right (this way) would break compatibility with older wiki pages. When changing the encoding this way, there’s no problem with the ANAME and ALINK plugins. They encoded anchor names would be different, but they still would match. But when using regular internal Tiki links, which link to anchors, we have a problem. This is because the Tiki user has been forced to manually encode the subsection part in the page’s source code, according to the two ways discussed above. When we change those two ways as described, the subsections will no longer match the anchors.
This could be resolved by encoding the anchors with new new encoding (as an URL-fragment string) as well as with the old encoding (which throws away non-ASCII letters). This means that two anchors are generated in the HTML code for each occurrence of an anchor definition in the Tiki page. For anchors which are made from ASCII alphanumeric charachter, both encodings would be the same.
The subsection part of internal Tiki links would be encoded the new way, when generating HTML. Old links, which have the manual encoding in it, won’t get changed by the new encoding. They will match the old anchors. New links, which don’t have the old encoding in the subsection part, will match the new anchors.
Doing it this way wouldn’t require any transformation of old Tiki pages’ source code.
Volker Wysk, post at volker hyphen wysk dot de, 2023-01-18, version 3