User:PerfektesChaos/js/WikiSyntaxTextMod/flow/format
WikiSyntaxTextMod → Syntax polishing → Step 6
At sixth step the source readability for human beings is improved, and unique formatting makes constructs detectable for scripts, bots, human beings in dump evaluation or daily source code search.
Normally this would not affect page rendering.
Character Entities
edit- Named entities for graphical characters according to HTML4 are replaced. They are confusing less technically experienced users and are available by tool bars nowadays. Who had no access to edit helps and entered a character by entity will get converted that automatically without loss of information into single Unicode character.
- An exception is made for
nbsp
and ML syntax escapesamp gt lt quot apos
and other invisible codes likethinsp
.
- An exception is made for
- Numerical entities
&#x
hhhh;
or&#
ddd;
for graphical (visible) characters are replaced- with same exclusion list as named entities
- and excluding wikisyntax escapes for
[ ] | = { }
- and no control codes
- until ahead of x2800 = 10.240 decimal – those originated from european region including greek, russian and mathematial neighbours. Such fonts are rather widely distributed and low efforts needed to enter such a character.
- On the other had it seems to be legal, that vietnamese, tamil or korean glyphs as numerical entities from x2800 = 10.240 decimal (braille) are kept readable and modifiable. It should be taken into account that authors have not installed such fonts and see ■ only.
- Different from this behaviour targetting at latin and letter based languages text sequences ([interlanguage] link or entire page) written in CJK (jp ko zh) within the range of such sequences entities are converted into ideograms, if recognized.
- Since an entity may be protected by
nowiki
orsyntaxhighlight
or a comment might clarify thatΤ Α Χ Ε
is the real meaning of “ΤΑΧΕ”, entities are not replaced in first step but after identification of unchangeable areas.
Percent sign
editSince 2007 between digit and percent sign the MediaWiki software inserts automatically  
as non-breaking space. If their are older texts with
or by good faith authors inserted recently such entity or UCS that will be exchanged against ASCII space.
Line break
edit- More than two line breaks out of protected (
syntaxhighlight
etc.) are reduced to two line breaks. - Every
[[Category:
and every interlanguage (if not yet on wikidata) gets a line for its own.
Headline text separated by spaces
editIn many projects it is common that between equal signs of wikisyntax headline markup and the headline text one space is improving perceptibility. Depending on the project this will be standardized.
<gallery>
edit
In picture galleries the following rules are applied:
- If there was an indentation found, all lines will be indented by the maximum number of detected spaces.
- The name space (mostly
File:
) is not required any longer (rev:79639) and will be discarded since it is redundant. - The name of the image file is decoded like a Wikilink.
- If there is a user defined wikilink modification this will be executed.
- If there is a necessity the name of the image file is protected against changes.
<ref>
edit
It is common practice to begin content immediately after opening <ref>
within text, not putting any spaces or even line breaks between. The same goes for the closing </ref>
that is following the content without any space or line break. This formatting is ensured.
That is invisible on the rendered page. Furthermore there are typographic rules how to join the resulting footnote sign with the surrounding text, the sentence or word. That is beyond syntax polishing and might be established with user defined rules.
<small>
within references is without effect depending on skin and style preferences or might lead to indecipherable letter size. Therefore <small>
tags are deleted.
Within <references>
………</references>
blocks the <ref
…name=
and </ref>
are put on a line for its own in order to make it easier distinguishing the single the references (especially when using cite templates).
Table attributes
editFor the entire table, table rows and leading cells attribute syntax is formatted similar to tags .
Tags, templates, links
editThis has been formatted and adapted in previous steps already:
Localized syntax elements in unique format
editIn non-English projects like German wikipedia there will be replaced according to project specific rules:
#REDIRECT
or localised variant – instead ofREDIRECT
orredirect
orRedirect
{{DEFAULTSORT:
or localised variant – instead ofDefaultsort
etc.[[File:
or localised variant – instead ofImage
or others.- image (media) parameters downcased and localised standard variant
[[Category:
or localised variant – instead ofcategory
.
More on keywords see localisation.
Examples of user defined modifications
editUsers may define on their own reponsibility their own cosmetics to extend the automatic polishing as described above.
HTML markup
editWhen copying from external text sources sometimes authors put HTML markup <b>
……</b>
or <i>
……</i>
into wikitext. This should be wikified.
mw.libs.WikiSyntaxTextMod.config.mod.plain = [
["([^'])<(em|i)>([^'<\n]+)</ *\\2>([^'])",
"$1''$3''$4",
"gi"],
["([^'])<(strong|b)>([^'<\n]+)</ *\\2>([^'])",
"$1'''$3'''$4",
"gi"]
];
Automatically this might be taken from brief parts but another apostrophe »'«, line breaks, other HTML elements and protected regions show more difficult problems and need manual interpretation. Also ''<i>……</i>''
is rendered differently.
Exponents
editThe well known ANSI characters may be inserted easily:
mw.libs.WikiSyntaxTextMod.config.mod.plain = [
["m<sup>2</sup>",
"m²"],
["m<sup>3</sup>",
"m³"],
…
];
However, for fragments and in music the <sup>2</sup>
format is common and will be preferred optically; for measurement units like m² or cm³ or m/s² only the small exponent is meaningful in general.
With Unicode there are more superscript digits at 8304–8319 and algebraic signs as well as subscripts at 8320–8334 (H₂O, CO₂). However, currently it cannot be presumed that such codes are present in the font used by the reaader for rendering. Therefore formulas should be built by <sup>
or <sub>
as shown.
Wikisyntax bullets separated by spaces from content
editAt line beginning bullet characters like *
and others should be separated by a space from content to make them easier recognizable:
mw.libs.WikiSyntaxTextMod.config.mod.plain = [
["(\n[*#:;]+)([^\n *#:;])",
"$1 $2"],
["\n(:+) +\\{\\|",
"\n$1{|"],
…
];
The second term is re-establishing table indentation, which would not be interpreted correctly otherwise. In general it is not recommended to format tables this way.[1]
Sometimes a compact format of definition lists is used like
;Term1:Meaning of 1
;Term2:something different with meaning 2
Formally this is correct. For very brief terms and explanations this might be less questionable. However, human interpretation may be supported by
; Term1
: Meaning of 1
; Term2
: something different with meaning 2
by
mw.libs.WikiSyntaxTextMod.config.mod.plain = [
["(\n; *([^ :\n][^:\n]*) *: *([^ \n])",
"\n; $1\n: $2"],
…
];
Remarks
edit- ^ Actually it is expected that the beginning of the table
{|
is leading at beginning of line. Over years a non-documented feature has made it possible to detect the beginning of the table even if just colons are used for indentation. It is better and easier to understand for both man and machine to declare explicit CSS indentation by{| style="margin-left:2em"
and use{|
at beginning of line only.
[ German page ]