Anchors in Tiki

Anchors are seriously screwed up in Tiki. This applies to Tiki versions up to, at least, 25.0. I’ve dug into the mess and here I’m summarizing my findings.

Formalities

The URL standard defines exactly how anchor names in URLs (the optional part starting with “#”) are formed. It’s available here.

This means that all byte sequences (all Unicode strings) can be used as names for URL-fragment strings, they only need to be properly encoded. All characters can be encoded as one or several “percent-encoded bytes”, but most don’t need to be percent-encoded, they can be specified literally.

Anchors in Tiki

  1. In Tiki, the ANAME and ALINK plugins don’t encode the anchor name as an URL-fragment string. Instead, they replace, in the generated HTML page, each sequence of non-allowed letters with an underscore. Only the ASCII letters (upper and lower A-Z) and digits are allowed. This code is in both of those plugins:

    // the following replace is necessary to maintain compliance with XHTML 1.0 Transitional
    // and the same behavior as tikilib.php. This will change when the world arrives at XHTML 1.0 Strict.
    $aname = preg_replace('/[^a-zA-Z0-9]+/', '_', $aname);
  2. In Tiki, links to an anchor, in a certain page, can be generated from the ALINK plugin, and from the subsection part of an internal link (if present). The latter gets encoded as an URL-fragment string. This means that we have two different ways, how the anchor name gets encoded in the generated HTML. For the anchor definition (generated from the ANAME plugin) and for internal links to that anchor (generated from the subsection part).

    However, Tiki’s generation of the URL-fragment string percent-encodes more characters that necessary. All letters are allowed here, according to the URL standard. But - as far as I can see - only ASCII alphanumeric ones are encoded literally. Other letters are percent-encoded. This leads to ugly URLs. Firefox replaces percent-encoded characters (including spaces) with the characters they encode, in the adress bar. So it isn’t ugly. But when you copy the URL, you get the encoding like it is specified in the HTML code.

  3. In Tiki, anchors are automatically generated for headings. They are treated differently again. All sequences of spaces are replaced by a single underscore. And, it seems, all characters except for ASCII alphanumeric ones, are removed.

  4. Tables of contents, generated by the maketoc plugin, are also affected. They consist of links to headings, which are encoded like like described in 3.

Discussion

This situation is unsatisfying. You can make a link to an anchor, by using a regular internal Tiki link with a subsection. You can also use the ALINK plugin. If you use an internal link, you need to manually encode the subsection part. But there are two ways to encode it - the way the ANAME plugin does, and the way the anchors generated from headings get encoded.

Both ways allow only ASCII alphanumeric letters and throw away all other letters. What about Asian languages, which don’t use latin characters at all? In order to make an anchor, they need to use latin characters. But what about automatically generated anchors for headings? Such can’t be used at all, because such anchors are encoded as the empty string.

The solution is to properly encode anchor names as URL-fragment strings, the way it is specified in the URL standard. The ANAME/ALINK plugins, the subsection part of internal links, the anchors generated for headings and the links generated by the maketoc plugin would all use the same encoding when translated to HTML. However, characters which don’t need to be encoded as percent-encoded bytes, shouldn’t be encoded this way. They should be encoded literally. Otherwise, for non-latin languages, you get anchors which consist of percent-encoded bytes exclusively.

Doing it right (this way) would break compatibility with older wiki pages. When changing the encoding this way, there’s no problem with the ANAME and ALINK plugins. They encoded anchor names would be different, but they still would match. But when using regular internal Tiki links, which link to anchors, we have a problem. This is because the Tiki user has been forced to manually encode the subsection part in the page’s source code, according to the two ways discussed above. When we change those two ways as described, the subsections will no longer match the anchors.

This could be resolved by encoding the anchors with new new encoding (as an URL-fragment string) as well as with the old encoding (which throws away non-ASCII letters). This means that two anchors are generated in the HTML code for each occurrence of an anchor definition in the Tiki page. For anchors which are made from ASCII alphanumeric charachter, both encodings would be the same.

The subsection part of internal Tiki links would be encoded the new way, when generating HTML. Old links, which have the manual encoding in it, won’t get changed by the new encoding. They will match the old anchors. New links, which don’t have the old encoding in the subsection part, will match the new anchors.

Doing it this way wouldn’t require any transformation of old Tiki pages’ source code.

Relevance for Moin2Tiki

When generating an ordinary, internal Tiki link from a MoinMoin link, which has an anchor in it, we get a Tiki link with a subsection. This subsection needs to be encoded manually, like described above. That’s because Tiki (up to version 25.0, at least) fails to do this for normal, internal links, when generating HTML.

The problem is, that we have two encodings (see above) and it isn’t clear which one should apply. It could be the ANAME/ALINK encoding, or the heading encoding. They differ when non-ASCII letters are in the anchor name. In order to do it right, the right encoding has to be determined.

Alas, Moin2Tiki parses only the selected pages. When the target page, which the anchor is in, isn’t selected on the command line, Moin2Tiki doesn’t know if the anchor part links to an explicit anchor (created by the ANAME plugin) or to one generated from a heading. In this case, Moin2Tiki takes pot luck and encodes it as an anchor generated from a heading. This is because such anchors should be much more frequent than explicit anchors.