User:PerfektesChaos/js/WikiSyntaxTextMod/flow/format

WikiSyntaxTextMod → Syntax polishing → Step 6

Syntax readability

At sixth step the source readability for human beings is improved, and unique formatting makes constructs detectable for scripts, bots, human beings in dump evaluation or daily source code search.

Normally this would not affect page rendering.

Character Entities

Named entities for graphical characters according to HTML4 are replaced. They are confusing less technically experienced users and are available by tool bars nowadays. Who had no access to edit helps and entered a character by entity will get converted that automatically without loss of information into single Unicode character.
- An exception is made for nbsp and ML syntax escapes amp gt lt quot apos and other invisible codes like thinsp.
Numerical entities &#xhhhh; or &#ddd; for graphical (visible) characters are replaced
- with same exclusion list as named entities
- and excluding wikisyntax escapes for [ ] | = { }
- and no control codes
- until ahead of x2800 = 10.240 decimal – those originated from european region including greek, russian and mathematial neighbours. Such fonts are rather widely distributed and low efforts needed to enter such a character.
- On the other had it seems to be legal, that vietnamese, tamil or korean glyphs as numerical entities from x2800 = 10.240 decimal (braille) are kept readable and modifiable. It should be taken into account that authors have not installed such fonts and see ￭ only.
- Different from this behaviour targetting at latin and letter based languages text sequences ([interlanguage] link or entire page) written in CJK (jp ko zh) within the range of such sequences entities are converted into ideograms, if recognized.
Since an entity may be protected by nowiki or syntaxhighlight or a comment might clarify that Τ Α Χ Ε is the real meaning of “ΤΑΧΕ”, entities are not replaced in first step but after identification of unchangeable areas.

Percent sign

Since 2007 between digit and percent sign the MediaWiki software inserts automatically   as non-breaking space. If their are older texts with   or by good faith authors inserted recently such entity or UCS that will be exchanged against ASCII space.

Line break

More than two line breaks out of protected (syntaxhighlight etc.) are reduced to two line breaks.
Every [[Category: and every interlanguage (if not yet on wikidata) gets a line for its own.

Headline text separated by spaces

In many projects it is common that between equal signs of wikisyntax headline markup and the headline text one space is improving perceptibility. Depending on the project this will be standardized.

`<gallery>`

In picture galleries the following rules are applied:

If there was an indentation found, all lines will be indented by the maximum number of detected spaces.
The name space (mostly File:) is not required any longer (rev:79639) and will be discarded since it is redundant.
The name of the image file is decoded like a Wikilink.
If there is a user defined wikilink modification this will be executed.
If there is a necessity the name of the image file is protected against changes.

`<ref>`

It is common practice to begin content immediately after opening <ref> within text, not putting any spaces or even line breaks between. The same goes for the closing </ref> that is following the content without any space or line break. This formatting is ensured.

That is invisible on the rendered page. Furthermore there are typographic rules how to join the resulting footnote sign with the surrounding text, the sentence or word. That is beyond syntax polishing and might be established with user defined rules.

 within references is without effect depending on skin and style preferences or might lead to indecipherable letter size. Therefore  tags are deleted.

Within <references>………</references> blocks the <ref…name= and </ref> are put on a line for its own in order to make it easier distinguishing the single the references (especially when using cite templates).

Table attributes

For the entire table, table rows and leading cells attribute syntax is formatted similar to tags .

Tags, templates, links

This has been formatted and adapted in previous steps already:

Localized syntax elements in unique format

In non-English projects like German wikipedia there will be replaced according to project specific rules:

#REDIRECT or localised variant – instead of REDIRECT or redirect or Redirect
{{DEFAULTSORT: or localised variant – instead of Defaultsort etc.
[[File: or localised variant – instead of Image or others.
- image (media) parameters downcased and localised standard variant
[[Category: or localised variant – instead of category.

Examples of user defined modifications

Users may define on their own reponsibility their own cosmetics to extend the automatic polishing as described above.

HTML markup

checkwiki #26 checkwiki #38

When copying from external text sources sometimes authors put HTML markup …… or …… into wikitext. This should be wikified.

mw.libs.WikiSyntaxTextMod.config.mod.plain  =  [
       ["([^'])<(em|i)>([^'<\n]+)</ *\\2>([^'])",
        "$1''$3''$4",
        "gi"],
       ["([^'])<(strong|b)>([^'<\n]+)</ *\\2>([^'])",
        "$1'''$3'''$4",
        "gi"]
                                               ];

Automatically this might be taken from brief parts but another apostrophe »'«, line breaks, other HTML elements and protected regions show more difficult problems and need manual interpretation. Also ''……'' is rendered differently.

Exponents

The well known ANSI characters may be inserted easily:

mw.libs.WikiSyntaxTextMod.config.mod.plain  =  [
                 ["m<sup>2</sup>",
                  "m²"],
                 ["m<sup>3</sup>",
                  "m³"],
                 …
                                               ];

However, for fragments and in music the 2 format is common and will be preferred optically; for measurement units like m² or cm³ or m/s² only the small exponent is meaningful in general.

With Unicode there are more superscript digits at 8304–8319 and algebraic signs as well as subscripts at 8320–8334 (H₂O, CO₂). However, currently it cannot be presumed that such codes are present in the font used by the reaader for rendering. Therefore formulas should be built by  or  as shown.

Wikisyntax bullets separated by spaces from content

At line beginning bullet characters like * and others should be separated by a space from content to make them easier recognizable:

mw.libs.WikiSyntaxTextMod.config.mod.plain  =  [
                 ["(\n[*#:;]+)([^\n *#:;])",
                  "$1 $2"],
                 ["\n(:+) +\\{\\|",
                  "\n$1{|"],
                 …
                                               ];

The second term is re-establishing table indentation, which would not be interpreted correctly otherwise. In general it is not recommended to format tables this way.^[1]

Sometimes a compact format of definition lists is used like

;Term1:Meaning of 1
;Term2:something different with meaning 2

Formally this is correct. For very brief terms and explanations this might be less questionable. However, human interpretation may be supported by

; Term1
: Meaning of 1
; Term2
: something different with meaning 2

by

mw.libs.WikiSyntaxTextMod.config.mod.plain  =  [
                 ["(\n; *([^ :\n][^:\n]*) *: *([^ \n])",
                  "\n; $1\n: $2"],
                 …
                                               ];

Remarks

^ Actually it is expected that the beginning of the table {| is leading at beginning of line. Over years a non-documented feature has made it possible to detect the beginning of the table even if just colons are used for indentation. It is better and easier to understand for both man and machine to declare explicit CSS indentation by {| style="margin-left:2em" and use {| at beginning of line only.

[ German page ]

[1] Actually it is expected that the beginning of the table {| is leading at beginning of line. Over years a non-documented feature has made it possible to detect the beginning of the table even if just colons are used for indentation. It is better and easier to understand for both man and machine to declare explicit CSS indentation by {| style="margin-left:2em" and use {| at beginning of line only.

[1]