User:Monkbot/task 18: cosmetic cs1 template cleanup
Monkbot task 18 is a WP:COSMETICBOT task to cleanup cs1|2 templates. It is authorized by Wikipedia:Bots/Requests for approval/Monkbot 18.
cs1|2 templates are transcluded in more than 4.7 million pages. Task 18 is constrained to the canonical list of cs1|2 templates and their more common redirects. Wrapper templates are not considered.
Task 18 does these things:
- deletes empty known and unknown named parameters
- deletes empty positional parameters
- deletes or repairs parameters inside html comments
- deletes non-contributing parameter/value pairs
- hyphenates cs1|2 parameters when they are written using the to-be-deprecated all-run-together form
- converts known language names assigned to
|language=
to their MediaWiki codes
This task strives to leave template style as it found it so does not change whitespace in and around template parameters. There is an exception to that whitespace rule: blank lines within vertically formatted cs1|2 templates are removed. Though executed by WP:AWB, this task does not perform AWB general fixes.
Task 18 operates only in mainspace. Any article that has {{bots|deny=monkbot 18}}
shall be skipped. Additionally, task 18 maintains a list of articles that it will not edit. Contact the bot's operator to get an article added to the list.
Update 2020-12-15: A modified version of the bot (namespace restriction and empty positional parameter portions disabled) is created for use on templates that wrap the cs1|2 canonical templates. This version will be run manually only in template namespace and is identified as task 18a.
Update 2021-01-10: Because the bot depends on results from mw:Help:CirrusSearch to focus its work, an alternate form of the bot task (18b) has been created that focuses on templates that are not cs1|2 templates but are (in many but not all cases) wrapper templates that use a cs1|2 template. Task 18b runs separately from the main task 18.
delete empty parameters
editparameter | initial count | count | search |
---|---|---|---|
{{cite arXiv}} |
76 | 0 | search |
{{cite AV media}} |
3,149 | 1,008 | search |
{{cite AV media notes}} |
1,845 | 9 | search |
{{cite bioRxiv}} |
6 | 1 | search |
{{cite book}} |
88,787 (timed out) | 7,296 (timed out) | search |
{{cite citeseerx}} |
0 | 0 | search |
{{cite conference}} |
1,246 | 11 | search |
{{cite encyclopedia}} |
8,393 | 4,886 | search |
{{cite episode}} |
2,882 | 297 | search |
{{cite interview}} |
636 | 191 | search |
{{cite journal}} |
111,144 (timed out) | 1,177 (timed out) | search |
{{cite magazine}} |
9,576 | 113 | search |
{{cite mailing list}} |
30 | 0 | search |
{{cite map}} |
2,620 | 1,835 | search |
{{cite news}} |
69,918 (timed out) | 569 (timed out) | search |
{{cite newsgroup}} |
41 | 0 | search |
{{cite podcast}} |
287 | 15 | search |
{{cite press release}} |
2,246 | 37 | search |
{{cite report}} |
2,325 | 79 | search |
{{cite serial}} |
54 | 0 | search |
{{cite sign}} |
111 | 3 | search |
{{cite speech}} |
101 | 2 | search |
{{cite ssrn}} |
0 | 0 | search |
{{cite techreport}} |
57 | 0 | search |
{{cite thesis}} |
2,354 | 93 | search |
{{cite web}} |
148,445 (timed out) | 72,171 (timed out) | search |
{{citation}} |
21,644 | 545 (timed out) | search |
totals | 477,973 | 90,338 | |
2020-11-13 | 2021-02-03 |
It is to be expected that some significant portion of cs1|2 transclusions will hold some number of superfluous parameters. One purpose of this task to remove those superfluous parameters. Empty parameters in cs1|2 templates occur for various reasons: the parameter is no longer required or not required in 'this' citation; a skeleton template was copied from a template documentation page and only partially filled in; deprecated error messages were 'fixed' by removing the parameter's value; the template was reworked; the template was placed by automated or semiautomated tools that inserted empty parameters; there are, no doubt, other explanations. For cs1|2, empty parameters serve no purpose.
Editors commonly complain that inline references make it more difficult to read wikitext. Empty parameters in cs1|2 templates occupy space for no meaningful purpose and, as a consequence, contribute to the wikitext readability problem.
The table lists all of the cs1|2 templates. The second column states the approximate number of mainspace articles using the template where at least one parameter was empty. The counts were last updated on the date listed at the bottom of the table.
delete empty positional parameters
editcs1|2 does not support positional (unnamed) parameters. A positional parameter is one that does not contain an assignment operator (=
). Positional parameters that contain nothing or only whitespace, have two forms:
||
|}}
Task 18 deletes the first pipe and any whitespace that exists between it and the succeeding pipe or brace.
Task 18 ignores positional parameters that have content other than whitespace because cs1|2 emits a reader-facing text ignored error message for those sorts of positional parameters.
delete empty named parameters in html comments
editTo cs1|2, parameters that are hidden with html comment markup (<!--content-->
) can appear to be empty positional parameters because MediaWiki removes html comments and their content before handing the template to Module:Citation/CS1. cs1|2 templates entirely within html markup are ignored.
Empty parameters that have this form are deleted along with the html comment markup:
|<!-- param= -->
Similarly, empty parameters inside html comment markup are deleted; parameters with assigned values are retained:
<!-- |param=value |param= -->
→<!-- |param=value -->
repair parameters in html comments
editTo cs1|2, parameters that are hidden with html comment markup (<!--content-->
) can appear to be empty positional parameters because MediaWiki removes html comments and their content before handing the template to Module:Citation/CS1. cs1|2 templates entirely within html markup are ignored.
Parameters that have this form are repaired by moving the preceding pipe into the html comment markup:
|<!-- param=value -->
→<!-- |param=value -->
A similar configuration is commonly found. This form may or may not look like an empty positional parameter to cs1|2 depending on the surrounding wikimarkup. This form is repaired for the purposes of consistency:
|<!-- param=value |-->
→<!-- |param=value --> |
Comments that do not have an assignment operator (=
) but which occupy all of the space between a pair of pipes or a pipe and the template's closing brace are repaired by moving the leading pipe inside the html comment:
|<!-- plain text html comment --> |
→<!-- |plain text html comment --> |
delete non-contributing parameters
editThere are several parameters and specific parameter/value pairs that, under certain conditions, to not contribute the rendering of the template. Task 18 deletes these:
|deadurl=
– is routinely added (with the assigned valuey
) to cs1|2 templates created by the unmaintained tool ReFill. Because|deadurl=
is no-longer supported by cs1|2, when it is present in cs1|2 templates, cs1|2 emits the error message: Unknown parameter|deadurl=
ignored (|url-status=
suggested) and adds the article to Category:Pages with citations using unsupported parameters which gnomes keep mostly empty. When task 18 encounters|deadurl=y
(or the similar|deadurl=yes
and|deadurl=true
and their hyphenated forms), the parameter shall be deleted. Replacement is not necessary because|deadurl=y
(when it was supported) was redundant to the cs1|2 default state. See|url-status=
documentation.|mode=
– this parameter accepts one of two keywords:cs1
orcs2
. The purpose is to direct the template to render in the specified style. For example,{{citation}}
is a native cs2 template.{{citation}}
will render as if it were a cs1 template when given|mode=cs1
; similarly, cs1 templates render in cs2 style when given|mode=cs2
. Setting|mode=
to the template's native style does not change the rendering. Task 18 deletes|mode=
parameters when they match the template's native style. See|mode=
documentation.|name-list-style=
– controls how name lists are rendered. When assigned one of the valuesamp
,ampersand
,and
,&
, orserial
, cs1|2 inserts an ampersand or the word 'and' between the last two names of a name list. When none of the name lists have two or more names, the parameter does not change the rendering. When assigned the valuevanc
, name lists are rendered in Vancouver style.|name-list-style=vanc
requires|last=
and|first=
(or aliases); without|first=
aliases, the parameter does not change the rendering. See|name-list-style=
documentation.|<name-list>-link=
,|<name-list>-mask=
[a] – these wikilink or mask a<name>
in an associated<name-list>
. When there is no associated|<name>=
(for example,|author-mask14=2
but no|author14=
), these parameters do not change the rendering.|no-pp=
– suppresses p. and pp. page annotation when either of|page=
or|pages=
has a value; when there is no|page=
or|pages=
parameter, this parameter does not change the rendering. When found in{{citation}}
templates where|journal=
has an assigned value or when found in{{cite journal}}
,|no-pp=
has no value because pagination in journal citations does not use the P. and pp. annotation.|orig-year=
– a free-form parameter that accompanies|year=
,|date=
or when neither of those are present,|publication-date=
. When none of those are present,|orig-year=
is not rendered.|postscript=
– specifies the rendered template's terminal punctuation. Default terminal punctuation for cs1 is a dot; for cs2, terminal punctuation is omitted.|postscript=.
in cs1 templates and|postscript=none
in cs2 templates does not change the rendering. The form|postscript=<!--none-->
is ignored by cs1|2 templates because the html comments make|postscript=
an empty parameter.|ref=harv
– At the 2020-04-18 update to Module:Citation/CS1 (this edit),|ref=harv
was internally defined as the default state so the explicit|ref=harv
does not change the template's function. Use of this parameter/value pair is tracked in Category:CS1 maint: ref=harv. See|ref=
documentation.|url-status=
– like its predecessors|dead-url=
and|deadurl=
,|url-status=
is a display control parameter that determines how a cs1|2 template will be rendered when that template has|archive-url=
and|archive-date=
. When|archive-url=
and|archive-date=
are empty or omitted,|url-status=
is non-functional. See|url-status=
documentation.
hyphenate cs1|2 parameter names
editparameter | initial count | count | search |
---|---|---|---|
|accessdate= |
2,804,881 | 1,841,029 | search |
|archivedate= |
837,655 | 444,873 | search |
|archiveurl= |
834,802 | 444,601 | search |
|authorlink= |
268,232 | 85,737 | search |
|authormask= |
4 | 0 | search |
|booktitle= |
3,345 | 28 | search |
|chapterurl= |
17,289 | 250 | search |
|conferenceurl= † |
427 | 0 | search |
|contributionurl= † |
40 | 0 | search |
|displayauthors= |
1 | 1 | search |
|editorlink= |
1 | 0 | search |
|editormask= |
1 | 0 | search |
|episodelink= |
2,539 | 2 | search |
|laydate= † |
1,049 | 13 | search |
|laysource= † |
1,091 | 13 | search |
|layurl= † |
538 | 2 | search |
|mailinglist= |
379 | 76 | search |
|mapurl= |
81 | 31 | search |
|nopp= |
2,902 | 35 | search |
|notracking= |
1 | 1 | search |
|origyear= |
44,022 | 9,841 | search |
|publicationdate= |
542 | 75 | search |
|publicationplace= |
189 | 0 | search |
|sectionurl= † |
65 | 4 | search |
|serieslink= |
5,042 | 8 | search |
|seriesno= † |
1,075 | 13 | search |
|subjectlink= |
0 | 0 | search |
|timecaption= † |
39 | 0 | search |
|titlelink= † |
3,518 | 93 | search |
|transcripturl= |
910 | 1 | search |
|author[0-9]+link= |
648 (timed out) | 126 (timed out) | search |
|author[0-9]+mask= |
0 (timed out) | 0 (timed out) | search |
|editor[0-9]+link= |
20 (timed out) | 0 (timed out) | search |
|editor[0-9]+mask= |
0 (timed out) | 0 (timed out) | search |
|subject[0-9]+link= |
0 (timed out) | 0 (timed out) | search |
totals | 4,831,328 | 2,826,853 | |
2020-11-13 | 2021-02-03 | ||
† currently deprecated support withdrawn for parameters that are |
Because cs1|2 is an amalgam of several individually developed templates, it acquired a variety of parameter-name styles: |lowercase=
, |Capitalized=
, |camelCase=
, |underscore_separated=
, |space separated=
, |hyphen-separated=
, |allruntogether=
. The |Capitalized=
, |camelCase=
, |underscore_separated=
, and |space separated=
parameter name styles have all been deprecated and support for these styles withdrawn in favor of the |lowercase=
and |hyphen-separated=
forms as a result of this RfC. For parameter names that are multiword, cs1|2 is gradually shifting to prefer the hyphenated form. The table lists the all-run-together form with the approximate number of articles that transclude cs1|2 templates using these parameters. Task 18 replaces the all-run-together forms of these parameters with the hyphenated forms.
The |accessdate=
and |authorlink=
parameter names have been included in awb's genfixes since this edit. There has been some pushback against using genfixes to normalize cs1|2 parameter name style. See Wikipedia talk:AutoWikiBrowser/Archive 32 § Citation parameter renaming.
There are commonly used tools that continue to insert the all-run-together names. An interface protected edit request has been submitted to normalize all-run-together parameter names emitted by WP:RefToolbar. See MediaWiki talk:RefToolbarConfig.js § normalize parameter names.
convert language names to codes
editAs a courtesy to editors at other-language wikis, it is desirable to use the MediaWiki language codes instead of English-language language names in |language=
so that cs1|2 can render language names using the other wiki's language without the need for an editor to make a translation. Task 18 will replace language names with the associated MediaWiki code. See the |language=
documentation and the list of (mostly) English-language language names and associated codes.
edit summary messaging
editFor each edit made, task 18 creates an edit summary detailing what was done. For example, this edit summary from a test (no save) of Ida B. Wells:
- Task 18 (cosmetic) (dev test): eval 146 templates: del empty params (1251×); hyphenate params (4×); del pos params (1×); del |ref=harv (4×);
Each summary always begins with the 'Task 18 (cosmetic) (dev test): eval n templates:' leader where n is the total number of cs1|2 templates and recognized redirects that task 18 evaluated. The remaining portion of the edit summary is assembled from one or more of these:
- del cmtd params (n×); – deleted commented parameters
- rep cmtd params (n×); – repaired commented parameters
- del empty params (n×); – deleted empty parameters
- hyphenate params (n×); – hyphenated parameter names
- del empty pos params (n×); – deleted empty positional parameters
- del |ref=harv (n×); – deleted |ref=harv
- del |mode= (n×); – deleted |mode=
- del |postscript= (n×); – deleted |postscript=
- del |url-status= (n×); – deleted |url-status=
- del |deadurl= (n×); – deleted |deadurl=
- del |no-pp= (n×); – deleted |no-pp=
- del |orig-year= (n×); – deleted |orig-year
- del |name-list-style= (n×); – deleted |name-list-style=
- del |<name-list>-mask= (n×); – deleted |<name-list>-mask=
- del |<name-list>-link= (n×); – deleted |<name-list>-link=
- cvt lang vals (n×); – converted language values
- skip (n×); – during development of task 18, it found malformed cs1|2 templates that contain the opening
{|
or closing|}
that belongs to wikitables. An example of this is found in this version of Satish Shah which is missing the closing}}
for this reference. Left unchecked, task 18 will see the table content as part of that broken{{cite news}}
template and delete many of the double pipe markup used to separate cells in the wikitable. This particular garbage is detected and skipped but editors are endlessly clever when it comes to breaking cs1|2 templates.
AWB truncates long edit summaries (phab:T199347) so the messaging is necessarily terse in an attempt to list everything accomplished by task 18 in the summary. When that is not possible, task 18 truncates its summary and adds an ellipsis.
known issues
editWhen an article transcludes a named reference that exactly matches a named reference in the article, and task 18 edits the article's named reference, MediaWiki will emit this error message:
- The named reference "$1" was defined multiple times with different content (help page).
Task 18 cannot know that a cs1|2 template that it edits is mirrored in a transcluded template. If you know of articles where this is a problem, give a list of those articles to the bot's operator so that the bot can be instructed to skip those articles.
notes
edit- a. ^ The
|<name-list>-link=
and|<name-list>-mask=
parameters are:
|author-link=
|authorn-link=
|author-linkn=
|contributor-link=
|contributorn-link=
|contributor-linkn=
|editor-link=
|editorn-link=
|editor-linkn=
|interviewer-link=
|interviewern-link=
|interviewer-linkn=
|translator-link=
|translatorn-link=
|translator-linkn=
|author-mask=
|authorn-mask=
|author-maskn=
|contributor-mask=
|contributorn-mask=
|contributor-maskn=
|editor-mask=
|editorn-mask=
|editor-maskn=
|interviewer-mask=
|interviewern-mask=
|interviewer-maskn=
|translator-mask=
|translatorn-mask=
|translator-maskn=
script
edit// This script does cosmetic work:
// repairs / removes commented parameters
// deletes empty parameters
// replaces un-hyphenated parameters with their hyphenated forms
// deletes |ref=harv
// replaces language names in |language= with their MediaWiki ISO 639-like codes
//
// wikitext searches:
// html comments:
// hastemplate:"Module:Citation/CS1" insource:/\{ *[Cc]it[ae][^\}]*\| *\<!\-\-+ *[^\>]+\-\-+\>/
// mode:
// hastemplate:Citation insource:/\{ *[Cc]itation[^\}]*\| *mode *= *cs2/
// hastemplate:"Module:Citation/CS1" insource:/\{ *[Cc]ite[^\}]*\| *mode *= *cs1/
// deadurl:
// hastemplate:"Module:Citation/CS1" insource:"deadurl"
// postscript:
// hastemplate:Citation insource:/\{ *[Cc]itation[^\}]*\| *postscript *= *none/
// hastemplate:"Module:Citation/CS1" insource:/\{ *[Cc]ite[^\}]*\| *postscript *= *\./
// hastemplate:"Module:Citation/CS1" insource:/\{ *[Cc]it[ae][^\}]*\| *postscript *= *\<!\-\- *[Nn][Oo][Nn][Ee]/
//
// category:
// CS1 errors: empty unknown parameters
//
public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
{
if (0 != wikiNamespace ) // if <wikiNamespace> is not mainspace
{
if (-1 == ArticleTitle.IndexOf ("User:Trappist the monk")) // don't skip pages in my own user space
{
Skip = true;
Summary = @"article not in mainspace: " + wikiNamespace;
return ArticleText; // abandon this edit
}
}
if (skip_map.ContainsKey (ArticleTitle)) // if <ArticleTitle> is in the skip list
{
Skip = true;
Summary = @"article in skip list";
return ArticleText; // abandon this edit
}
Skip = false; // assume that something will be changed
// Summary = "[[User:Monkbot/task 18|Task 18 (cosmetic)]] (dev test):"; // uses redirect to User:Monkbot/task 18: cosmetic cs1 template cleanup
// Summary = "[[User:Monkbot/task 18|Task 18 (cosmetic)]] (BRFA trial):";
Summary = "[[User:Monkbot/task 18|Task 18 (cosmetic)]]:";
int total_count = 0; // total number of cs1|2 templates
int skip_count = 0; // skip count incremented when cite template holds wikitable markup
int comment_deleted = 0;
int comment_repaired = 0;
int empty_positional_deleted = 0;
int params_removed_count = 0;
int params_hyphenated_count = 0;
int ref_harv_count = 0;
int url_status_count =0;
int deadurl_count = 0;
int nopp_count = 0;
int orig_count = 0;
int mode_count = 0;
int ps_count = 0;
int nls_count = 0;
int mask_count = 0;
int link_count = 0;
int language_count = 0;
//---------------------------< B E G I N >--------------------------------------------------------------------
ArticleText = hide (ArticleText, IS_CS1); // hide all templates except cs1|2 and hide wikilinks
//---------------------------< M A I N >----------------------------------------------------------------------
//
// this is main
//
if (Regex.Match (ArticleText, template_pattern).Success)
ArticleText = Regex.Replace (ArticleText, template_pattern,
delegate(Match match)
{
string template = match.Groups[0].Value; // this will be returned if no changes
if ((-1 != template.IndexOf("__0T4BL3__")) || (-1 != template.IndexOf("__CT4BL3__"))) // if table markup found in the cite template abandon
{
skip_count++;
return template;
}
total_count++; // bump total number of cs1|2 templates tally
template = counted_replace (template, comment_params_1_pattern, "$2$1$3$5$4", ref comment_repaired); // |<!-- param=value |--> to <!-- |param=value -->|
template = counted_replace (template, comment_params_2_pattern, "$2$1$3", ref comment_repaired); // |<!-- param=value --> to <!-- |param=value -->
template = counted_replace (template, comment_params_3_pattern, "", ref params_removed_count); // |<!-- param= --> to (delete)
template = counted_replace (template, comment_params_4_pattern, "$1", ref params_removed_count); // <!-- |param=value |param= --> to <!-- |param=value -->
template = counted_replace (template, comment_params_5_pattern, "$2$1$3", ref comment_repaired); // |<!-- plain text without = or | or } --> to <!-- plain text -->
template = counted_replace (template, empty_positional_parameter_nl, "$1", ref empty_positional_deleted); // || or |} to | or } (newline variant)
template = counted_replace (template, empty_positional_parameter, "$1", ref empty_positional_deleted); // || or |} to | or }
template = counted_replace (template, ref_harv_pattern, "", ref ref_harv_count); // |ref=harv to (delete)
template = mode_remove (template, ref mode_count);
template = postscript_remove (template, ref ps_count);
template = empty_param_remove (template, ref params_removed_count); // this is a prerequisite function for these:
template = url_status_remove (template, ref url_status_count, ref deadurl_count);
template = hyphenate (hyphenate_map, template, ref params_hyphenated_count); // this is a prerequisite function for these:
template = nopp_remove (template, ref nopp_count);
template = orig_year_remove (template, ref orig_count);
template = name_list_style_remove (template, ref nls_count);
template = name_link_mask_remove (template, ref mask_count, @"mask");
template = name_link_mask_remove (template, ref link_count, @"link");
template = language (language_map, template, ref language_count);
while (Regex.Match (template, empty_comment_pattern).Success) // cleanup; delete <!-- --> (empty or only whitespace)
template = Regex.Replace (template, empty_comment_pattern, "");
return template;
});
//---------------------------< F I N I S H >------------------------------------------------------------------
ArticleText = unhide (ArticleText); // unhide all that is hidden
if (0 != total_count) // build our edit summary
{
Summary = summary_concat (Summary, " eval " + total_count + " template" + (1 == total_count ? ":" : "s:"));
if (0 != comment_deleted)
Summary = summary_concat (Summary, " del cmtd params (" + comment_deleted + "×);");
if (0 != comment_repaired)
Summary = summary_concat (Summary, " rep cmtd params (" + comment_repaired + "×);");
if (0 != params_removed_count)
Summary = summary_concat (Summary, " del empty params (" + params_removed_count + "×);");
if (0 != params_hyphenated_count)
Summary = summary_concat (Summary, " hyphenate params (" + params_hyphenated_count + "×);");
if (0 != empty_positional_deleted)
Summary = summary_concat (Summary, " del pos params (" + empty_positional_deleted + "×);");
if (0 != ref_harv_count)
Summary = summary_concat (Summary, " del |ref=harv (" + ref_harv_count + "×);");
if (0 != mode_count)
Summary = summary_concat (Summary, " del |mode= (" + mode_count + "×);");
if (0 != ps_count)
Summary = summary_concat (Summary, " del |postscript= (" + ps_count + "×);");
if (0 != url_status_count)
Summary = summary_concat (Summary, " del |url-status= (" + url_status_count + "×);");
if (0 != deadurl_count)
Summary = summary_concat (Summary, " del |deadurl= (" + deadurl_count + "×);");
if (0 != nopp_count)
Summary = summary_concat (Summary, " del |no-pp= (" + nopp_count + "×);");
if (0 != orig_count)
Summary = summary_concat (Summary, " del |orig-year= (" + orig_count + "×);");
if (0 != nls_count)
Summary = summary_concat (Summary, " del |name-list-style= (" + nls_count + "×);");
if (0 != mask_count)
Summary = summary_concat (Summary, " del |<name-list>-mask= (" + mask_count + "×);");
if (0 != link_count)
Summary = summary_concat (Summary, " del |<name-list>-link= (" + link_count + "×);");
if (0 != language_count)
Summary = summary_concat (Summary, " cvt lang vals (" + language_count + "×);");
if (0 != skip_count)
Summary = summary_concat (Summary, " skip (" + skip_count + "×);");
}
else
{
Skip = true;
Summary = Summary + " no templates to evaluate;";
}
System.IO.StreamWriter sw;
string log_file = @"Z:\Wikipedia\AWB\Monkbot_tasks\Task18_" + DateTimeOffset.Now.ToString("u").Substring (0, 10) + @".csv";
if (!System.IO.File.Exists (log_file))
{ // for csv import, pipe is string delimiter here because not legal wp title char
sw = System.IO.File.AppendText (log_file);
sw.WriteLine ("|Article|, |evaluated|, |deleted comment|, |repaired comment|, |empty params|, |hyphenated|, |empty pos params|, |ref=harv|, |mode|, |postscript|, "
+ "|url-status|, |deadurl|, |no-pp|, |orig-year|, |name-list-style|, |<name-list>-mask|, |<name-list>-link|, |language|, |skipped|");
sw.Close();
}
if (Skip)
{
sw = System.IO.File.AppendText (log_file);
sw.WriteLine ("|" + ArticleTitle + "|," + total_count);
sw.Close();
}
else
{
sw = System.IO.File.AppendText (log_file);
sw.WriteLine ("|" + ArticleTitle + "|," + total_count + "," + comment_deleted + "," + comment_repaired + "," + params_removed_count + "," + params_hyphenated_count
+ "," + empty_positional_deleted + "," + ref_harv_count + "," + mode_count + "," + ps_count + "," + url_status_count + "," + deadurl_count
+ "," + nopp_count + "," + orig_count + "," + nls_count + "," + mask_count + "," + link_count + "," + language_count + "," + skip_count);
sw.Close();
}
return ArticleText;
}
//===========================<< S U P P O R T >>==============================================================
//---------------------------< H I D E >----------------------------------------------------------------------
//
// HIDE TEMPLATES: find templates that are not <dont_hide>; replace the opening {{ with __0P3N__, the closing }}
// with __CL0S3__, and internal | (pipes) with __P1P3__
//
// single curly braces in urls and other parameter values can confuse other regex in this code so replace {
// with __0CU!21Y__ and } with __CCU!21Y__
//
private string hide (string ArticleText, string dont_hide)
{
string pattern = @"\{\{(?!\s*" + dont_hide + @")[^\{\}]*\}\}";
if (Regex.Match (ArticleText, pattern).Success)
{
ArticleText = Regex.Replace(ArticleText, pattern,
delegate(Match match)
{
string fixed_template; // a hidden template is assembled here
string raw_template = match.Groups[0].Value; // the whole template
pattern = @"\{\{"; // hide the opening {{
fixed_template = Regex.Replace (raw_template, pattern, "__0P3N__");
pattern = @"\}\}"; // hide the closing }}
fixed_template = Regex.Replace (fixed_template, pattern, "__CL0S3__");
pattern = @"\|"; // and hide the pipes
fixed_template = Regex.Replace (fixed_template, pattern, "__P1P3__");
return fixed_template;
});
}
pattern = @"(\<!\-{2,}\s*[^\>\|\}]*)\{\{(\s*" + dont_hide + @"[^\}]*)\}\}([^\>]*\-{2,}\>)"; // <!-- {{citx...}} -->
ArticleText = Regex.Replace(ArticleText, pattern, "$1__0P3N__$2__CL0S3__$3");
pattern = @"\{\|"; // open table markup
ArticleText = Regex.Replace(ArticleText, pattern, "__0T4BL3__");
pattern = @"\|\}(?!\})"; // open table markup
ArticleText = Regex.Replace(ArticleText, pattern, "__CT4BL3__");
pattern = @"([^\{])\{([^\{])"; // single opening curly brace
ArticleText = Regex.Replace(ArticleText, pattern, "$1__0CU!21Y__$2");
pattern = @"([^\}])\}([^\}])"; // single closing curly brace
ArticleText = Regex.Replace(ArticleText, pattern, "$1__CCU!21Y__$2");
pattern = @"\[\[(?![Ff]ile|[Ii]mage)([^\|\]]+)\|([^\]]+)\]\]"; // HIDE complex wikilinks: [[article title|label]] to __WL1NK_O__article title__P1P3__label__WL1NK_C__
ArticleText = Regex.Replace(ArticleText, pattern, "__WL1NK_O__$1__P1P3__$2__WL1NK_C__"); // [[File: with wikilinks inside can be confusing
pattern = @"\[\[([^\]]+)\]\]"; // HIDE simple wikilinks: [[article title]] to __WL1NK_O__article title__WL1NK_C__
ArticleText = Regex.Replace(ArticleText, pattern, "__WL1NK_O__$1__WL1NK_C__");
return ArticleText;
}
//---------------------------< U N H I D E >------------------------------------------------------------------
//
// UNHIDE TEMPLATES: find templates and wikilinks that are hidden; replace the 'hide' keywords with the
// appropriate wiki markup
//
private string unhide (string ArticleText)
{
ArticleText = Regex.Replace(ArticleText, @"__WL1NK_O__", "[["); // UNHIDE: replace __WL1NK_O__ with [[
ArticleText = Regex.Replace(ArticleText, @"__WL1NK_C__", "]]"); // UNHIDE: replace __WL1NK_C__ with ]]
ArticleText = Regex.Replace(ArticleText, @"__P1P3__", "|"); // UNHIDE: replace __P1P3__ with |
ArticleText = Regex.Replace(ArticleText, @"__0T4BL3__", "{|"); // UNHIDE: replace __0T4BL3__ with {|
ArticleText = Regex.Replace(ArticleText, @"__CT4BL3__", "|}"); // UNHIDE: replace __CT4BL3__ with |}
ArticleText = Regex.Replace(ArticleText, @"__0CU!21Y__", "{"); // UNHIDE: replace __0CU!21Y__ with {
ArticleText = Regex.Replace(ArticleText, @"__CCU!21Y__", "}"); // UNHIDE: replace __CCU!21Y__ with }
ArticleText = Regex.Replace(ArticleText, @"__0P3N__", "{{"); // UNHIDE: replace __0P3N__ with {{
ArticleText = Regex.Replace(ArticleText, @"__CL0S3__", "}}"); // UNHIDE: replace __CL0S3__ with }}
return ArticleText;
}
//---------------------------< S U M M A R Y _ C O N C A T >--------------------------------------------------
//
// concatenates text onto an existing edit summary string, limiting the string to a length of no more than 247
// characters. When <summary> appended with <text> would be longer than the allowed 247 character limit, this
// function replaces <text> with an ellipsis. Once an ellipsis is added, no more <text> can be added to <summary>
//
private string summary_concat (string summary, string text)
{
if (0 <= summary.IndexOf ("...")) // if ellipsis already present in <summary>, abandon
return summary;
if (247 >= (summary.Length + text.Length + 3)) // if adding <text> to summary will overrun the 247 char limit (+ 3 to make sure we can add ellipsis if necessary)
return summary + text; // append <text> to <summary> and done
return summary + "..."; // append ellipsis instead
}
//---------------------------< C O U N T E D _ R E P L A C E >------------------------------------------------
//
// common function to replace <pattern> with <replace> and bump <count> until no more <pattern>
//
private string counted_replace (string template, string pattern, string replace, ref int count)
{
Regex rgx = new Regex (pattern); // make a new regex from <pattern>
while (Regex.Match (template, pattern).Success) // look for <pattern> in <template>
{
template = rgx.Replace (template, replace, 1); // replace one copy of <pattern> with <replace>
count++; // bump the counter
}
return template;
}
//---------------------------< E M P T Y _ P A R A M _ R E M O V E >------------------------------------------
//
// This function removes all empty named parameters from a template, attempting to leave what remains the same form.
//
// this is a multi-step process that attempts to handle most of the vagaries of how templates are written in
// wikitext. In general there are three basic 'styles': horizontal – all parameters written on a single
// line of text, vertical – all parameters written singly one-to-a-line, and a mix of the two – multiple lines
// where each has one or more parameters.
//
// 1. where the parameter name & '=' are on one line and the value on a following line, put the value on the same line as the '='
// 2. for mixed, when empties are followed by new line; remove the empty but leave the newline
// 3. for any, empties are followed by pipe or closing }; remove the empty but leave the | or }
// 4. the preceding steps can leave blank lines; remove the blank lines
//
private string empty_param_remove (string template, ref int params_removed_count)
{
string pattern;
pattern = @"(\|[^=]+=[ \t]*)[\r\n]+(?!\s*[\|\}])"; // parameter name & '=' on one line, value on a following line
template = counted_replace (template, pattern, "$1", ref params_removed_count);
pattern = @"\|[^=]+=[ \t]*([\r\n]+)"; // empty followed by new line
template = counted_replace (template, pattern, "$1", ref params_removed_count);
pattern = @"\|[^=]+=\s*([\|\}])"; // empty followed by pipe or at end of template
template = counted_replace (template, pattern, "$1", ref params_removed_count);
pattern = @"([\r\n]+)[ \t]*[\r\n]+"; // close up multiple new lines
while (Regex.Match(template, pattern).Success)
template = Regex.Replace(template, pattern, "$1");
return template;
}
//---------------------------< H Y P H E N A T E >------------------------------------------------------------
//
// <hyphenate_map> lists unhyphenated parameter names (key) and their hyphenated counterparts (value). The
// enumerated forms of these keys contain a capture so that the enumerator is included in the hyphenated
// replacements
//
private string hyphenate (Dictionary<string, string> hyphenate_map, string template, ref int params_hyphenated_count)
{
string pattern;
foreach(KeyValuePair<string, string> entry in hyphenate_map)
{
pattern = @"(\| *)" + entry.Key; // does this unhyphenated form exist in <template>?
template = counted_replace (template, pattern, "$1" + entry.Value, ref params_hyphenated_count);
}
return template;
}
//---------------------------< L A N G U A G E >--------------------------------------------------------------
//
// replaces |language=<language name> with <language code> from <language_map> for i18n
//
// extract language name and trim. Make a lowercase copy, lc_lang_name, for indexing into the language
// dictionary. If the lc_lang_name is in the dictionary, escape any parentheses in trimmed language value for
// use as the regex pattern to be replaced.
private string language (Dictionary<string, string> language_map, string template, ref int count)
{
bool changed = false;
string raw_pattern = @"\|\s*(?:language|lang)\s*=\s*([^\|\}]+)";
Match match = Regex.Match (template, raw_pattern);
string lang_param = match.Groups[0].Value; // the whole match; this gets modified and is used to replace raw_pattern at the end
if (match.Success) // if there is a |language= parameter
{
string[] lang_array = match.Groups[1].Value.Split (new string[] {","}, StringSplitOptions.None); // split the parameter value at the comma
string pattern;
for (int i=0; i < lang_array.Length; i++)
{
string lc_lang_name = lang_array[i].Trim().ToLower(); // trimmed lowercase value for indexing into the dictionary
if (language_map.ContainsKey (lc_lang_name)) // if in the dictionary
{
string lang_name_pattern = lang_array[i].Trim();
pattern = @"\("; // escape any left-side parentheses
lang_name_pattern = Regex.Replace (lang_name_pattern, pattern, @"\(");
pattern = @"\)"; // escape any right-side parentheses
lang_name_pattern = Regex.Replace (lang_name_pattern, pattern, @"\)");
lang_param = counted_replace (lang_param, lang_name_pattern, language_map[lc_lang_name], ref count); // replace language name with MediaWiki code
changed = true;
}
}
}
if (changed)
template = Regex.Replace (template, raw_pattern, lang_param); // replace original |language=<name list> with modified |language=<code list>
return template;
}
//---------------------------< U R L _ S T A T U S _ R E M O V E >--------------------------------------------
//
// removes |url-status=<anything> when |archive-url= and |archive-date= are empty or missing (and |deadurl=y always)
//
// this function to be called after empty parameters have been removed
//
private string url_status_remove (string template, ref int url_status_count, ref int deadurl_count)
{
string pattern = @"\|\s*archive"; // remove |url-status=<anything> when |url-archive= and |archive-date= not present
if (!Regex.Match(template, pattern).Success) // if no archive parameters
{
pattern = @"\|\s*url-status\s*=\s*\b[^\|\}]*([\|\}])"; // |url-status= with any assigned value
template = counted_replace (template, pattern, "$1", ref url_status_count);
}
pattern = @"\|\s*dead\-?url\s*=\s*(?:yes|true|y)\b([\s\|\}])"; // remove reFill-added |deadurl=y
template = counted_replace (template, pattern, "$1", ref deadurl_count);
return template;
}
//---------------------------< M O D E _ R E M O V E >--------------------------------------------------------
//
// removes |mode=cs1 from cs1 templates; removes |mode=cs2 from cs2 templates
//
private string mode_remove (string template, ref int count)
{
string pattern = @"\{\{\s*([Cc]it[ae][^\|]*)";
string template_name = Regex.Match(template, pattern).Groups[1].Value.Trim().ToLower(); // template name: citation, cite book, cite, etc
if (template_name.Equals ("citation") || template_name.Equals ("cite"))
{
pattern = @"\|\s*mode\s*=\s*cs2\s*"; // remove |mode=cs2 from citation template
template = counted_replace (template, pattern, "", ref count);
}
else // here for cs1 template names
{
pattern = @"\|\s*mode\s*=\s*cs1\s*"; // remove |mode=cs1 from cite xxx template
template = counted_replace (template, pattern, "", ref count);
}
return template;
}
//---------------------------< P O S T S C R I P T _ R E M O V E >--------------------------------------------
//
// removes |postscript=. from cs1 templates; removes |postscript=none from cs2 templates; removes |postscript=<!--none-->
// from both
//
private string postscript_remove (string template, ref int ps_count)
{
string pattern = @"\|\s*postscript\s*=\s*\<!\-\-\s*[Nn][Oo][Nn][Ee]\s*\-\-\>\s*"; // various flavors of <!--none-->
template = counted_replace (template, pattern, "", ref ps_count);
pattern = @"\{\{\s*([Cc]it[ae][^\|]*)";
string template_name = Regex.Match(template, pattern).Groups[1].Value.Trim().ToLower(); // template name: citation, cite book, cite, etc
if (template_name.Equals ("citation") || template_name.Equals ("cite")) // {{citation}} and its primary redirect {{cite}}
{
pattern = @"\|\s*postscript\s*=\s*[Nn][Oo][Nn][Ee]\s*"; // remove |postscript=none from citation template
template = counted_replace (template, pattern, "", ref ps_count);
}
else // here for cs1 template names
{
pattern = @"\|\s*postscript\s*=\s*\.+\s*([\|\}])"; // remove |postscript=. from cite xxx template
template = counted_replace (template, pattern, "$1", ref ps_count);
}
return template;
}
//---------------------------< N O P P _ R E M O V E >--------------------------------------------------------
//
// removes |no-pp=<anything> when |page=, |pages=, |p=, and |pp= are missing (or have been deleted because empty)
//
// removes |no-pp= from {{cite journal}} and from {{citation |journal=<anything> |...}} because journal cites
// don't use p. and pp. prefixes
//
// this function to be called after empty parameters have been removed
//
private string nopp_remove (string template, ref int nopp_count)
{
string pattern = @"\|\s*(?:pages?|pp|p)\b"; // remove |no-pp=<anything> when |page=, |pages=, |p=, and |pp= not present
if (!Regex.Match(template, pattern).Success) // if no page(s) parameters
{
pattern = @"\|\s*no\-?pp\s*=\s*\b[^\|\}]*([\|\}])"; // |no-pp= with any assigned value
template = counted_replace (template, pattern, "$1", ref nopp_count);
return template;
}
pattern = @"\{\{\s*[Cc]ite\s*[Jj]ournal";
bool journal = Regex.Match(template, pattern).Success;
pattern = @"\{\{\s*(?:[Cc]itation|[Cc]ite)(?=\s*\|)[^\}]*\|\s*journal\s*=\s*[^\|\}]+";
bool citation = Regex.Match(template, pattern).Success;
if (journal || citation)
{
pattern = @"\|\s*no\-?pp\s*=\s*\b[^\|\}]*([\|\}])"; // |no-pp= with any assigned value
template = counted_replace (template, pattern, "$1", ref nopp_count);
}
return template;
}
//---------------------------< O R I G _Y E A R _ R E M O V E >-----------------------------------------------
//
// removes |orig-year=<anything> when |year= and |date= are missing (or have been deleted because empty)
//
// this function to be called after empty parameters have been removed and after hyphenation applied
//
private string orig_year_remove (string template, ref int orig_count)
{
string pattern = @"\|\s*(?:year|date|air-date)\b"; // remove |no-pp=<anything> when |page=, |pages=, |p=, and |pp= not present
if (!Regex.Match(template, pattern).Success) // if |year=, |date=, and aliases not present
{
pattern = @"\|\s*publication\-date\b"; // |publication-date= promotes to |date= when |date= and |year= not present
if (!Regex.Match(template, pattern).Success) // if |publication-date= not present
{
pattern = @"\|\s*orig-year\s*=\s*\b[^\|\}]*([\|\}])"; // |orig-year= with any assigned value
template = counted_replace (template, pattern, "$1", ref orig_count);
}
}
return template;
}
//---------------------------< N A M E _ L I S T _ S T Y L E _ R E M O V E >----------------------------------
//
// removes |name-list-style=<value> when:
// <value> == 'vanc' and no |<first-alias>2=<name> parameters
// <value> == 'amp' | 'ampersand' | 'and' | '&' | 'serial' and no |<last-alias>2=<name> parameters
//
// this function to be called after empty parameters have been removed
//
private string name_list_style_remove (string template, ref int nls_count)
{
string pattern = @"\|\s*name\-list\-style\s*=(\s*(?:ampersand|amp|and|&|serial|vanc))\s*[\|\}]";
Match nls_match = Regex.Match (template, pattern);
bool val = false;
// |name-list-style= requires one of these with value for amp, ampersand, and, &, serial
string nls_names_pattern = @"last2|author\-last2|author2\-last|author2|surname2|subject2|host2|contributor\-last2|contributor2\-last|contributor2|contributor\-surname2|contributor2\-surname|editor\-last2|editor#\-last|editor2|editor\-surname2|editor2\-surname|interviewer\-last2|interviewer2\-last|interviewer2|translator\-last2|translator2\-last|translator2|translator\-surname2|translator2\-surname";
// |name-list-style= requires one of these with value for vanc
string nls_vanc_pattern = @"first\d*|author\-first\d*|author\d*\-first|given\d*|author\-given\d*|author\d*\-given|contributor\-first\d*|contributor\d*\-first|contributor\-given\d*|contributor\d*\-given|editor\-first\d*|editor\d*\-first|editor\-given\d*|editor\d*\-given|interviewer\-first\d*|interviewer\d*\-first|interviewer\-given\d*|interviewer\d*\-given|translator\-first\d*|translator\d*\-first|translator\-given\d*|translator\d*\-given";
if (nls_match.Success) // if there is |name-list-style=<value>
{
val = nls_match.Groups[1].Value.Trim().Equals ("vanc"); // get and trim <value>; see if it is 'vanc'
if ((val && !Regex.Match (template, nls_vanc_pattern).Success) || (!val && !Regex.Match (template, nls_names_pattern).Success))
{
pattern = @"\|\s*name\-list\-style\s*=[^\|\}]*([\|\}])"; // none of the required parameters present so delete
template = counted_replace (template, pattern, "$1", ref nls_count);
}
}
return template;
}
//---------------------------< N A M E _ L I N K _ M A S K _ R E M O V E >----------------------------------------------
//
// removes |<name>-mask=<value>, |<name>n-mask=<value>, and |<name>-maskn=<value>, when:
// |<name>n= is missing;
//
// this function to be called after empty parameters have been removed
//
private string name_link_mask_remove (string template, ref int count, string link_mask)
{
string lm_param_name = @"";
string lm_param_enum = @"";
string lm_pattern1 = @"((?:author|subject|contributor|editor|interviewer|translator))\-" + link_mask + @"(\d*)";
string lm_pattern2 = @"((?:author|subject|contributor|editor|interviewer|translator))(\d*)\-" + link_mask;
string pattern = @"";
string[] param_array; // array of parameters in <template>; [0] is '{{<template name' so ignore that; [n] has '}}' in rvalue but rvalue is stripped
Match match;
param_array = template.Split (new string[] {"|"}, StringSplitOptions.None); // split the template at the pipe
for (int i=1; i < param_array.Length; i++) // begin at [1] because param_array[0] has template name and opening braces
{
lm_param_name = @""; // reset the name
pattern = @"\s*([^=]+)\s*=.*"; // pattern to remove leading white space, assignment operator, and rvalue
param_array[i] = Regex.Replace (param_array[i], pattern, "$1"); // strip; keep only the parameter name
match = Regex.Match (param_array[i], lm_pattern1); // look for params with trailing enumerators
if (match.Success)
{
lm_param_name = match.Groups[1].Value; // this is the root name (without enumerator and without '-mask' or '-link'
lm_param_enum = match.Groups[2].Value; // here is the enumerator or empty string
}
else
{
match = Regex.Match (param_array[i], lm_pattern2); // look for link|mask params with embedded enumerators
if (match.Success)
{
lm_param_name = match.Groups[1].Value;
lm_param_enum = match.Groups[2].Value;
}
else
continue; // not a mask or link parameter so do next parameter in param_array
}
if ("" == lm_param_enum || "1" == lm_param_enum) // when no enumerator, same as enumerator of 1
lm_param_enum = "1?"; // rewrite enumerator to accept zero or one "1" as enumerator
lm_param_name = lm_param_name.Trim();
switch (lm_param_name) // get a pattern according to the link|mask parameter's root name
{
case "subject":
case "author":
pattern = @"(?:last" + lm_param_enum + @"|author\-last" + lm_param_enum + @"|author" + lm_param_enum + @"\-last|author" + lm_param_enum + @"|surname" + lm_param_enum + @"|host" + lm_param_enum + @"subject" + lm_param_enum + @"|vauthors)";
break;
case "contributor":
pattern = @"(?:contributor\-last" + lm_param_enum + @"|contributor" + lm_param_enum + @"\-last|contributor" + lm_param_enum + @"|contributor\-surname" + lm_param_enum + @"|contributor" + lm_param_enum + @"\-surname)";
break;
case "editor":
pattern = @"(?:editor\-last" + lm_param_enum + @"|editor" + lm_param_enum + @"\-last|editor" + lm_param_enum + @"|editor\-surname" + lm_param_enum + @"|editor" + lm_param_enum + @"\-surname|veditors)";
break;
case "interviewer":
pattern = @"(?:interviewer\-last" + lm_param_enum + @"|interviewer" + lm_param_enum + @"\-last|interviewer" + lm_param_enum + @")";
break;
case "translator":
pattern = @"(?:translator\-last" + lm_param_enum + @"|translator" + lm_param_enum + @"\-last|translator" + lm_param_enum + @"|translator\-surname" + lm_param_enum + @"|translator" + lm_param_enum + @"\-surname)";
break;
}
if ("author" == lm_param_name || "editor" == lm_param_name)
{
match = Regex.Match (template, @"\|\s*" + @"(?:vauthors|veditors)" + @"\s*=([^\|\}]*)");
if (match.Success)
{
int i_lm_param_enum = ("1?" == lm_param_enum)? 1 : int.Parse(lm_param_enum);
string vparam = match.Groups[1].Value;
string[] vp_cnt = vparam.Split(',');
if (i_lm_param_enum > vp_cnt.Length) // if enumerator is less than or equal to the number of names (number of commas + 1)
{
pattern = @"\|\s*" + param_array[i] + @"\s*=[^\r\n\|\}]*"; // make a regex for the parameter to be deleted
template = counted_replace (template, pattern, "", ref count); // and delete it
continue;
}
}
}
pattern = @"\|\s*" + pattern + @"\s*=[^\|\}]*"; // add introductory pipe, assignment operator, and rvalue regex
match = Regex.Match (template, pattern); // try to match the enumerated <name> param
if (!match.Success)
{
pattern = @"\|\s*" + param_array[i] + @"\s*=[^\r\n\|\}]*"; // make a regex for the parameter to be deleted
template = counted_replace (template, pattern, "", ref count); // and delete it
}
}
return template;
}
//===========================<< S T A T I C D A T A >>======================================================
static string IS_CS1 = @"(?:[Cc]ite[_\-\s]*(?=album\-notes|[Aa][Vv] media|[Aa][Vv] media notes|article|ar[Xx]iv|audio|bio[Rr]xiv|blog|book|chapter|conference|contribution|dictionary|dissertation|document|DVD|dvd|encyclopa?edia|episode|image|interview|[Jj]ournal|letter|liner notes|[Mm]agazine|mailing ?list|manual|map|media release|media|newsgroup|newspaper|(?:[Nn]ews(?!group|paper))|[Nn]ew|paper|plaque|podcast|press release|press|publication|pr|radio|report|serial|sign|speech|techreport|thesis|video|[Ww]eb|periodical)|[Cc]itation(?=\s*\|)|[Cc]ite(?=\s*\|)|[Cc]it news|[Cc]it web|[Cc]ita web|[Cc]itar notícia|[Cc]itat web|[Cc]ite we|[Ww]eb cite)";
static string template_pattern = @"\{\{\s*" + IS_CS1 + @"[^\}]+\}\}"; // basic cs1|2 template pattern
static string comment_params_1_pattern = @"(\|\s*)(\<!\-{2,}\s*)([^=\>\}]*=\s*[^\>]+)(\|\s*)(\-{2,}\>)"; // $2$1$3$5$4
static string comment_params_2_pattern = @"(\|\s*)(\<!\-{2,}\s*)([^=\>\}]*=\s*\b[^\>]+\-{2,}\>)"; // $2$1$3 always test this first; move pipe inside comment markup
static string comment_params_3_pattern = @"\|\s*\<!\-{2,}\s*[^=\>\}]*=\s*\-{2,}\>\s*"; // next; delete when match
static string comment_params_4_pattern = @"\|\s*[^=\>\}]*=\s*(\-{2,}\>)"; // $1 remove empty parameter leave comment close
static string comment_params_5_pattern = @"(\|\s*)(\<!\-{2,}\s*)([^=\>\|\}]*\-{2,}\>\s*[\|\}])"; // $2$1$3 |<!--Added by DASHBot--> | or } to <!--|Added by DASHBot--> | or }
static string empty_comment_pattern = @"\<!\-{2,}\s*\-{2,}\>"; // remove empty html comments
static string empty_positional_parameter_nl = @"\|[ \t]*([\r\n]+[ \t]*[\|\}])"; // remove empty positional parameters – newline variant to keep next line on the next line
static string empty_positional_parameter = @"\|\s*([\|\}])"; // remove empty positional parameters
static string ref_harv_pattern = @"\|\s*ref\s*=\s*harv[\t ]*"; // remove |ref=harv
//---------------------------< H Y P H E N A T I O N D A T A >----------------------------------------------
//
// unhyphenated parameter names (key of dictionary k/v pair) are replaced with hyphenated parameter names (value
// of dictionary k/pair). Some keys are regex patters (enumerated parameter names)
//
static Dictionary<string, string> hyphenate_map = new Dictionary<string, string>()
{
{"accessdate", "access-date"},
{"archivedate", "archive-date"},
{"archiveurl", "archive-url"},
{"authorlink", "author-link"},
{"authormask", "author-mask"},
{"booktitle", "book-title"},
{"chapterurl", "chapter-url"},
{"conferenceurl", "conference-url"},
{"contributionurl", "contribution-url"},
{"displayauthors", "display-authors"},
{"editorlink", "editor-link"},
{"episodelink", "episode-link"},
{"laydate", "lay-date"},
{"laysource", "lay-source"},
{"layurl", "lay-url"},
{"mailinglist", "mailing-list"},
{"mapurl", "map-url"},
{"nopp", "no-pp"},
{"notracking", "no-tracking"},
{"origyear", "orig-year"},
{"publicationdate", "publication-date"},
{"publicationplace", "publication-place"},
{"sectionurl", "section-url"},
{"serieslink", "series-link"},
{"seriesno", "series-no"},
{"subjectlink", "subject-link"},
{"timecaption", "time-caption"},
{"titlelink", "title-link"},
{"transcripturl", "transcript-url"},
{"author(\\d+)link", "author$2-link"},
{"authorlink(\\d+)", "author-link$2"},
{"author(\\d+)mask", "author$2-mask"},
{"authormask(\\d+)", "author-mask$2"},
{"editor(\\d+)link", "editor$2-link"},
{"editorlink(\\d+)", "editor-link$2"},
{"editor(\\d+)mask", "editor$2-mask"},
{"editormask(\\d+)", "editor-mask$2"},
{"subject(\\d+)link", "subject$2-link"},
{"subjectlink(\\d+)", "subject-link$2"},
};
//---------------------------< L A N G U A G E S >------------------------------------------------------------
//
// language names in |language= or |lang= (key of dictionary k/v pair) are replaced with MediaWiki language codes
// (value of dictionary k/pair). These k/v pairs are taken from the list at:
// [[Template:Citation_Style_documentation/language/doc#Language_names]]
//
static Dictionary<string, string> language_map = new Dictionary<string, string>
{
{"abaza", "abq"},
{"abkhazian", "ab"},
{"achinese", "ace"},
{"acoli", "ach"},
{"adangme", "ada"},
{"adyghe (cyrillic script)", "ady-cyrl"},
{"adyghe", "ady"},
{"afar", "aa"},
{"afrihili", "afh"},
{"afrikaans", "af"},
{"aghem", "agq"},
{"ainu", "ain"},
{"akan", "ak"},
{"akkadian", "akk"},
{"akkala sámi", "sia"},
{"akoose", "bss"},
{"alabama", "akz"},
{"albanian", "sq"},
{"aleut", "ale"},
{"algerian arabic", "arq"},
{"ambonese malay", "abs"},
{"american english", "en-us"},
{"american sign language", "ase"},
{"amharic", "am"},
{"amis", "ami"},
{"ancient egyptian", "egy"},
{"ancient greek", "grc"},
{"angika", "anp"},
{"ao naga", "njo"},
{"arabic", "ar"},
{"aragonese", "an"},
{"aramaic", "arc"},
{"araona", "aro"},
{"arapaho", "arp"},
{"arawak", "arw"},
{"armenian", "hy"},
{"aromanian", "roa-rup"},
{"arpitan", "frp"},
{"assamese", "as"},
{"asturian", "ast"},
{"asu", "asa"},
{"atikamekw", "atj"},
{"atsam", "cch"},
{"australian english", "en-au"},
{"austrian german", "de-at"},
{"avaric", "av"},
{"avestan", "ae"},
{"awadhi", "awa"},
{"aymara", "ay"},
{"azerbaijani", "az"},
{"badaga", "bfq"},
{"bafia", "ksf"},
{"bafut", "bfd"},
{"bakhtiari", "bqi"},
{"balinese", "ban"},
{"balkan romani", "rmn"},
{"baltic romani", "rml"},
{"baluchi", "bal"},
{"bambara", "bm"},
{"bamun", "bax"},
{"banjar", "bjn"},
{"basa banyumasan", "map-bms"},
{"basaa", "bas"},
{"bashkir", "ba"},
{"basque", "eu"},
{"batak mandailing", "btm"},
{"batak toba (latin script)", "bbc-latn"},
{"batak toba", "bbc"},
{"bavarian", "bar"},
{"beja", "bej"},
{"belarusian (taraškievica orthography)", "be-x-old"},
{"belarusian", "be"},
{"bemba", "bem"},
{"bena", "bez"},
{"bengali", "bn"},
{"betawi", "bew"},
{"bhojpuri", "bho"},
{"biblical hebrew", "hbo"},
{"bihari", "bh"},
{"bikol", "bik"},
{"bini", "bin"},
{"bishnupriya", "bpy"},
{"bislama", "bi"},
{"blackfoot", "bla"},
{"blin", "byn"},
{"blissymbols", "zbl"},
{"bodo", "brx"},
{"bosnian", "bs"},
{"brahui", "brh"},
{"braj", "bra"},
{"brazilian portuguese", "pt-br"},
{"breton", "br"},
{"british english", "en-gb"},
{"buginese", "bug"},
{"bulgarian", "bg"},
{"bulu", "bum"},
{"bunun", "bnn"},
{"buriat", "bua"},
{"burmese", "my"},
{"caddo", "cad"},
{"cajun french", "frc"},
{"canadian english", "en-ca"},
{"canadian french", "fr-ca"},
{"cantonese", "zh-yue"},
{"capiznon", "cps"},
{"carib", "car"},
{"carpathian romani", "rmc"},
{"catalan", "ca"},
{"cayuga", "cay"},
{"cebuano", "ceb"},
{"central atlas tamazight", "tzm"},
{"central bikol", "bcl"},
{"central dusun", "dtp"},
{"central kurdish", "ckb"},
{"central yupik", "esu"},
{"chadian arabic", "shu"},
{"chagatai", "chg"},
{"chakma", "ccp"},
{"chamorro", "ch"},
{"chavacano", "cbk-zam"},
{"chechen", "ce"},
{"cherokee", "chr"},
{"cheyenne", "chy"},
{"chibcha", "chb"},
{"chickasaw", "cic"},
{"chiga", "cgg"},
{"chilcotin", "clc"},
{"chimborazo highland quichua", "qug"},
{"chinese (china)", "zh-cn"},
{"chinese (hong kong)", "zh-hk"},
{"chinese (macau)", "zh-mo"},
{"chinese (malaysia)", "zh-my"},
{"chinese (min nan)", "zh-min-nan"},
{"chinese (singapore)", "zh-sg"},
{"chinese (taiwan)", "zh-tw"},
{"chinese", "zh"},
{"chinook jargon", "chn"},
{"chipewyan", "chp"},
{"choctaw", "cho"},
{"chukchi", "ckt"},
{"church slavic", "cu"},
{"chuukese", "chk"},
{"chuvash", "cv"},
{"classical chinese", "zh-classical"},
{"classical newari", "nwc"},
{"classical syriac", "syc"},
{"comorian", "swb"},
{"congo swahili", "sw-cd"},
{"coptic", "cop"},
{"cornish", "kw"},
{"corsican", "co"},
{"cree", "cr"},
{"crimean tatar (cyrillic script)", "crh-cyrl"},
{"crimean tatar (latin script)", "crh-latn"},
{"crimean tatar", "crh"},
{"croatian", "hr"},
{"cypriot greek", "el-cy"},
{"czech", "cs"},
{"dagbani", "dag"},
{"dakota", "dak"},
{"dalecarlian", "dlc"},
{"danish", "da"},
{"dargwa", "dar"},
{"dari", "prs"},
{"dazaga", "dzg"},
{"delaware", "del"},
{"dinka", "din"},
{"divehi", "dv"},
{"dogri", "doi"},
{"dogrib", "dgr"},
{"doteli", "dty"},
{"duala", "dua"},
{"dutch", "nl"},
{"dyula", "dyu"},
{"dzongkha", "dz"},
{"east cree", "crl"},
{"eastern balochi", "bgp"},
{"eastern canadian (aboriginal syllabics)", "ike-cans"},
{"eastern canadian (latin script)", "ike-latn"},
{"eastern cham (arabic script)", "cjm-arab"},
{"eastern cham (cham script)", "cjm-cham"},
{"eastern cham (latin script)", "cjm-latn"},
{"eastern cham", "cjm"},
{"eastern frisian", "frs"},
{"eastern mari", "mhr"},
{"eastern pwo", "kjp"},
{"eastern yiddish", "ydd"},
{"efik", "efi"},
{"egyptian arabic", "arz"},
{"ekajuk", "eka"},
{"elamite", "elx"},
{"embu", "ebu"},
{"emilian", "egl"},
{"emiliano-romagnolo", "eml"},
{"english", "en"},
{"erzya", "myv"},
{"esperanto", "eo"},
{"estonian", "et"},
{"etruscian", "ett"},
{"european portuguese", "pt-pt"},
{"european spanish", "es-es"},
{"ewe", "ee"},
{"ewondo", "ewo"},
{"extremaduran", "ext"},
{"eyak", "eya"},
{"fang", "fan"},
{"fanti", "fat"},
{"faroese", "fo"},
{"fiji hindi (latin script)", "hif-latn"},
{"fiji hindi", "hif"},
{"fijian", "fj"},
{"filipino", "fil"},
{"finnish kalo", "rmf"},
{"finnish", "fi"},
{"flemish", "nl-be"},
{"fon", "fon"},
{"frafra", "gur"},
{"french", "fr"},
{"friulian", "fur"},
{"fulah", "ff"},
// {"ga", "gaa"}, // language name ga is same as code for language name Irish; this is the only case
{"gagauz", "gag"},
{"galician", "gl"},
{"gamilaraay", "kld"},
{"gan (simplified)", "gan-hans"},
{"gan (traditional)", "gan-hant"},
{"gan chinese", "gan"},
{"ganda", "lg"},
{"gayo", "gay"},
{"gbaya", "gba"},
{"geez", "gez"},
{"georgian", "ka"},
{"german (formal address)", "de-formal"},
{"german", "de"},
{"gheg albanian", "aln"},
{"ghomala", "bbj"},
{"gilaki", "glk"},
{"gilbertese", "gil"},
{"goan konkani (devanagari script)", "gom-deva"},
{"goan konkani (latin script)", "gom-latn"},
{"goan konkani", "gom"},
{"gondi", "gon"},
{"gorontalo", "gor"},
{"gothic", "got"},
{"grebo", "grb"},
{"greek", "el"},
{"guarani", "gn"},
{"guernésiais", "nrf-gg"},
{"guianan creole", "gcr"},
{"gujarati", "gu"},
{"gusii", "guz"},
{"gwichʼin", "gwi"},
{"haida", "hai"},
{"haitian creole", "ht"},
{"hakka chinese", "hak"},
{"hausa", "ha"},
{"hawaiian", "haw"},
{"hazaragi", "haz"},
{"hebrew", "he"},
{"herero", "hz"},
{"hiligaynon", "hil"},
{"hindi", "hi"},
{"hiri motu", "ho"},
{"hittite", "hit"},
{"hmong", "hmn"},
{"hungarian", "hu"},
{"hunsrik", "hrx"},
{"hupa", "hup"},
{"iban", "iba"},
{"ibibio", "ibb"},
{"icelandic", "is"},
{"ido", "io"},
{"igbo", "ig"},
{"ilocano", "ilo"},
{"inari sami", "smn"},
{"indonesian", "id"},
{"ingrian", "izh"},
{"ingush", "inh"},
{"innu", "moe"},
{"interlingua", "ia"},
{"interlingue", "ie"},
{"inuktitut", "iu"},
{"inupiaq", "ik"},
{"iriga bicolano", "bto"},
{"irish", "ga"},
{"island carib", "crb"},
{"italian", "it"},
{"jamaican creole english", "jam"},
{"japanese (hiragana script)", "ja-hira"},
{"japanese (kana script)", "ja-hrkt"},
{"japanese (kanji script)", "ja-hani"},
{"japanese (katakana script)", "ja-kana"},
{"japanese", "ja"},
{"javanese", "jv"},
{"jinyu (simplified)", "cjy-hans"},
{"jinyu (traditional)", "cjy-hant"},
{"jinyu", "cjy"},
{"jju", "kaj"},
{"jola-fonyi", "dyo"},
{"judeo-arabic", "jrb"},
{"judeo-persian", "jpr"},
{"jutish", "jut"},
{"jèrriais", "nrf-je"},
{"kabardian (cyrillic script)", "kbd-cyrl"},
{"kabardian", "kbd"},
{"kabiye", "kbp"},
{"kabuverdianu", "kea"},
{"kabyle", "kab"},
{"kachin", "kac"},
{"kaingang", "kgp"},
{"kako", "kkj"},
{"kalaallisut", "kl"},
{"kalenjin", "kln"},
{"kalmyk", "xal"},
{"kamba", "kam"},
{"kanembu", "kbl"},
{"kannada", "kn"},
{"kanuri", "kr"},
{"kara-kalpak", "kaa"},
{"karachay-balkar", "krc"},
{"karelian", "krl"},
{"kashmiri (arabic script)", "ks-arab"},
{"kashmiri (devanagari script)", "ks-deva"},
{"kashmiri", "ks"},
{"kashubian", "csb"},
{"kawi", "kaw"},
{"kawésqar", "alc"},
{"kazakh (arabic script)", "kk-arab"},
{"kazakh (china)", "kk-cn"},
{"kazakh (cyrillic script)", "kk-cyrl"},
{"kazakh (kazakhstan)", "kk-kz"},
{"kazakh (latin script)", "kk-latn"},
{"kazakh (turkey)", "kk-tr"},
{"kazakh", "kk"},
{"kelantan-pattani malay", "mfa"},
{"kemi sámi", "sjk"},
{"kenyang", "ken"},
{"khakas", "kjh"},
{"khasi", "kha"},
{"khmer", "km"},
{"khotanese", "kho"},
{"khowar", "khw"},
{"kikuyu", "ki"},
{"kildin sami", "sjd"},
{"kimbundu", "kmb"},
{"kinaray-a", "krj"},
{"kinyarwanda", "rw"},
{"kirmanjki", "kiu"},
{"klingon", "tlh"},
{"kom", "bkm"},
{"komi-permyak", "koi"},
{"komi", "kv"},
{"kongo", "kg"},
{"konkani", "kok"},
{"korean (north korea)", "ko-kp"},
{"korean", "ko"},
{"koro", "kfo"},
{"kosraean", "kos"},
{"kotava", "avk"},
{"koyra chiini", "khq"},
{"koyraboro senni", "ses"},
{"koyukon", "koy"},
{"kpelle", "kpe"},
{"krio", "kri"},
{"kuanyama", "kj"},
{"kumyk", "kum"},
{"kurdish (arabic script)", "ku-arab"},
{"kurdish (latin script)", "ku-latn"},
{"kurdish", "ku"},
{"kurukh", "kru"},
{"kutenai", "kut"},
{"kvensk", "fkv"},
{"kwasio", "nmg"},
{"kyrgyz", "ky"},
{"kölsch", "ksh"},
{"kʼicheʼ", "quc"},
{"ladin", "lld"},
{"ladino", "lad"},
{"lahnda", "lah"},
{"lak", "lbe"},
{"laki", "lki"},
{"lakota", "lkt"},
{"lamba", "lam"},
{"langi", "lag"},
{"lao", "lo"},
{"latgalian", "ltg"},
{"latin american spanish", "es-419"},
{"latin", "la"},
{"latvian", "lv"},
{"laz", "lzz"},
{"lezghian", "lez"},
{"ligurian", "lij"},
{"limburgish", "li"},
{"lingala", "ln"},
{"lingua franca nova", "lfn"},
{"literary chinese", "lzh"},
{"lithuanian", "lt"},
{"livonian", "liv"},
{"livvi-karelian", "olo"},
{"lojban", "jbo"},
{"lombard", "lmo"},
{"louisiana creole", "lou"},
{"low german", "nds"},
{"low saxon", "nds-nl"},
{"lower silesian", "sli"},
{"lower sorbian", "dsb"},
{"lozi", "loz"},
{"luba-katanga", "lu"},
{"luba-lulua", "lua"},
{"luiseno", "lui"},
{"lule sami", "smj"},
{"lunda", "lun"},
{"luo", "luo"},
{"luxembourgish", "lb"},
{"luyia", "luy"},
{"maba", "mde"},
{"macedonian", "mk"},
{"machame", "jmc"},
{"madurese", "mad"},
{"mafa", "maf"},
{"magahi", "mag"},
{"maharashtrian konkani", "knn"},
{"main-franconian", "vmf"},
{"maithili", "mai"},
{"makasar", "mak"},
{"makhuwa-meetto", "mgh"},
{"makonde", "kde"},
{"malagasy", "mg"},
{"malay", "ms"},
{"malayalam", "ml"},
{"maltese", "mt"},
{"manchu", "mnc"},
{"mandaic", "mid"},
{"mandar", "mdr"},
{"mandingo", "man"},
{"manipuri", "mni"},
{"manx", "gv"},
{"maori", "mi"},
{"mapuche", "arn"},
{"mara", "mrh"},
{"marathi", "mr"},
{"mari", "chm"},
{"marshallese", "mh"},
{"marwari (india)", "rwr"},
{"marwari", "mwr"},
{"masai", "mas"},
{"mazanderani", "mzn"},
{"medumba", "byv"},
{"megleno-romanian (cyrillic script)", "ruq-cyrl"},
{"megleno-romanian (greek script)", "ruq-grek"},
{"megleno-romanian (latin script)", "ruq-latn"},
{"megleno-romanian", "ruq"},
{"mende", "men"},
{"mentawai", "mwv"},
{"meru", "mer"},
{"metaʼ", "mgo"},
{"mexican spanish", "es-mx"},
{"mi'kmaq", "mic"},
{"middle dutch", "dum"},
{"middle english", "enm"},
{"middle french", "frm"},
{"middle high german", "gmh"},
{"middle irish", "mga"},
{"middle low german", "gml"},
{"min dong chinese", "cdo"},
{"min nan chinese", "nan"},
{"minangkabau", "min"},
{"mingrelian", "xmf"},
{"mirandese", "mwl"},
{"mizo", "lus"},
{"modern standard arabic", "ar-001"},
{"mohawk", "moh"},
{"moksha", "mdf"},
{"moldavian", "ro-md"},
{"moldovan", "mo"},
{"mon", "mnw"},
{"mongo", "lol"},
{"mongolian", "mn"},
{"montenegrin", "cnr"},
{"monégasque", "lij-mc"},
{"morisyen", "mfe"},
{"moroccan arabic", "ary"},
{"mossi", "mos"},
{"mundang", "mua"},
{"munsee", "umu"},
{"muscogee", "mus"},
{"musi", "mui"},
{"muslim tat", "ttt"},
{"mycenaean greek", "gmy"},
{"myene", "mye"},
{"najdi arabic", "ars"},
{"nama", "naq"},
{"naskapi", "nsk"},
{"nauru", "na"},
{"navajo", "nv"},
{"ndonga", "ng"},
{"neapolitan", "nap"},
{"nederlands (informeel)", "nl-informal"},
{"nepali", "ne"},
{"newari", "new"},
{"ngambay", "sba"},
{"ngiemboon", "nnh"},
{"ngomba", "jgo"},
{"nheengatu", "yrl"},
{"nias", "nia"},
{"nigerian pidgin", "pcm"},
{"niuean", "niu"},
{"nogai", "nog"},
{"norfuk / pitkern", "pih"},
{"norman", "nrm"},
{"north ndebele", "nd"},
{"northern frisian", "frr"},
{"northern luri", "lrc"},
{"northern sami", "se"},
{"northern sotho", "nso"},
{"northern thai", "nod"},
{"norwegian bokmål", "nb"},
{"norwegian nynorsk", "nn"},
{"norwegian", "no"},
{"novial", "nov"},
{"nuer", "nus"},
{"numidian", "nxm"},
{"nyamwezi", "nym"},
{"nyanja", "ny"},
{"nyankole", "nyn"},
{"nyasa tonga", "tog"},
{"nyoro", "nyo"},
{"nyungar", "nys"},
{"nzima", "nzi"},
{"nāhuatl", "nah"},
{"n’ko", "nqo"},
{"o'odham", "ood"},
{"occitan", "oc"},
{"odia", "or"},
{"ojibwa", "oj"},
{"old english", "ang"},
{"old french", "fro"},
{"old high german", "goh"},
{"old irish", "sga"},
{"old japanese (hiragana script)", "ojp-hira"},
{"old japanese (kanji script)", "ojp-hani"},
{"old japanese", "ojp"},
{"old norse", "non"},
{"old persian", "peo"},
{"old provençal", "pro"},
{"old turkish", "otk"},
{"oromo", "om"},
{"osage", "osa"},
{"ossetic", "os"},
{"ottoman turkish", "ota"},
{"pahlavi", "pal"},
{"paiwan", "pwn"},
{"palatine german", "pfl"},
{"palauan", "pau"},
{"pali (siddham script)", "pi-sidd"},
{"pali", "pi"},
{"pampanga", "pam"},
{"pangasinan", "pag"},
{"papiamento", "pap"},
{"papora-hoanya", "ppu"},
{"pashto", "ps"},
{"pazeh", "uun"},
{"pennsylvania german", "pdc"},
{"persian", "fa"},
{"phoenician (latin script)", "phn-latn"},
{"phoenician (phoenician script)", "phn-phnx"},
{"phoenician", "phn"},
{"picard", "pcd"},
{"piedmontese", "pms"},
{"pite sami", "sje"},
{"pitjantjatjara", "pjt"},
{"plautdietsch", "pdt"},
{"pohnpeian", "pon"},
{"polish", "pl"},
{"pontic", "pnt"},
{"portuguese", "pt"},
{"prussian", "prg"},
{"pular", "fuf"},
{"punic", "xpu"},
{"punjabi", "pa"},
{"putèr", "rm-puter"},
{"puyuma", "pyu"},
{"quechua", "qu"},
{"quenya", "qya"},
{"rajasthani", "raj"},
{"rapanui", "rap"},
{"rarotongan", "rar"},
{"riffian", "rif"},
{"romagnol", "rgn"},
{"romanian", "ro"},
{"romansh", "rm"},
{"romany", "rom"},
{"rombo", "rof"},
{"rotuman", "rtm"},
{"roviana", "rug"},
{"rumantsch grischun", "rm-rumgr"},
{"rundi", "rn"},
{"russia buriat", "bxr"},
{"russian", "ru"},
{"rusyn", "rue"},
{"rwa", "rwk"},
{"saho", "ssy"},
{"saisiyat", "xsy"},
{"sakha", "sah"},
{"sakizaya", "szy"},
{"samaritan aramaic", "sam"},
{"samburu", "saq"},
{"samoan", "sm"},
{"samogitian", "sgs"},
{"sandawe", "sad"},
{"sango", "sg"},
{"sangu", "sbp"},
{"sanskrit (siddham script)", "sa-sidd"},
{"sanskrit", "sa"},
{"santali", "sat"},
{"saraiki (arabic script)", "skr-arab"},
{"saraiki", "skr"},
{"sardinian", "sc"},
{"sasak", "sas"},
{"sassarese sardinian", "sdc"},
{"saterland frisian", "stq"},
{"saurashtra", "saz"},
{"scots", "sco"},
{"scottish gaelic", "gd"},
{"selayar", "sly"},
{"selkup", "sel"},
{"sena", "seh"},
{"seneca", "see"},
{"serbian (cyrillic script)", "sr-ec"},
{"serbian (latin script)", "sr-el"},
{"serbian", "sr"},
{"serbo-croatian", "sh"},
{"serer", "srr"},
{"seri", "sei"},
{"seselwa creole french", "crs"},
{"shambala", "ksb"},
{"shan", "shn"},
{"shawiya (arabic script)", "shy-arab"},
{"shawiya (latin script)", "shy-latn"},
{"shawiya (tifinagh script)", "shy-tfng"},
{"shawiya", "shy"},
{"shona", "sn"},
{"sichuan yi", "ii"},
{"sicilian", "scn"},
{"sidamo", "sid"},
{"silesian", "szl"},
{"simple english", "simple"},
{"simplified chinese", "zh-hans"},
{"sindarin", "sjn"},
{"sindhi", "sd"},
{"sinhala", "si"},
{"sinte romani", "rmo"},
{"siraya", "fos"},
{"sirionó", "srq"},
{"skolt sami", "sms"},
{"slave", "den"},
{"slovak", "sk"},
{"slovenian", "sl"},
{"soga", "xog"},
{"sogdien", "sog"},
{"somali", "so"},
{"soninke", "snk"},
{"south azerbaijani", "azb"},
{"south ndebele", "nr"},
{"southern altai", "alt"},
{"southern balochi", "bcc"},
{"southern kurdish", "sdh"},
{"southern luri", "luz"},
{"southern sami", "sma"},
{"southern sotho", "st"},
{"spanish", "es"},
{"sranan tongo", "srn"},
{"standard moroccan tamazight", "zgh"},
{"sukuma", "suk"},
{"sumerian", "sux"},
{"sundanese", "su"},
{"surmiran", "rm-surmiran"},
{"sursilvan", "rm-sursilv"},
{"susu", "sus"},
{"sutsilvan", "rm-sutsilv"},
{"swahili", "sw"},
{"swati", "ss"},
{"swedish", "sv"},
{"swiss french", "fr-ch"},
{"swiss german", "gsw"},
{"swiss high german", "de-ch"},
{"syriac", "syr"},
{"tachelhit (latin script)", "shi-latn"},
{"tachelhit (tifinagh script)", "shi-tfng"},
{"tachelhit", "shi"},
{"tagalog", "tl"},
{"tahitian", "ty"},
{"taita", "dav"},
{"tajik (cyrillic script)", "tg-cyrl"},
{"tajik (latin script)", "tg-latn"},
{"tajik", "tg"},
{"talossan", "tzl"},
{"talysh", "tly"},
{"tamashek", "tmh"},
{"tamil", "ta"},
{"tarantino", "roa-tara"},
{"taroko", "trv"},
{"tasawaq", "twq"},
{"tatar (cyrillic script)", "tt-cyrl"},
{"tatar (latin script)", "tt-latn"},
{"tatar", "tt"},
{"tayal", "tay"},
{"taíno", "tnq"},
{"telugu", "te"},
{"ter sámi", "sjt"},
{"tereno", "ter"},
{"teso", "teo"},
{"tetum", "tet"},
{"thai", "th"},
{"thao", "ssf"},
{"tibetan", "bo"},
{"tigre", "tig"},
{"tigrinya", "ti"},
{"timne", "tem"},
{"tiv", "tiv"},
{"tlingit", "tli"},
{"tobelo", "tlb"},
{"tok pisin", "tpi"},
{"tokelau", "tkl"},
{"tongan", "to"},
{"tornedalen finnish", "fit"},
{"tosk albanian", "als"},
{"traditional chinese", "zh-hant"},
{"traveller norwegian", "rmg"},
{"tsakhur", "tkr"},
{"tsakonian", "tsd"},
{"tsimshian", "tsi"},
{"tsonga", "ts"},
{"tswana", "tn"},
{"tulu", "tcy"},
{"tumbuka", "tum"},
{"tungag", "lcm"},
{"tunisian arabic (arabic script)", "aeb-arab"},
{"tunisian arabic (latin script)", "aeb-latn"},
{"tunisian arabic", "aeb"},
{"turkish", "tr"},
{"turkmen", "tk"},
{"turoyo", "tru"},
{"tuvalu", "tvl"},
{"tuvinian", "tyv"},
{"twi", "tw"},
{"tyap", "kcg"},
{"udmurt", "udm"},
{"ugaritic", "uga"},
{"ukrainian", "uk"},
{"umbundu", "umb"},
{"ume sami", "sju"},
{"unsupported language", "mis"},
{"upper sorbian", "hsb"},
{"urdu", "ur"},
{"uyghur (arabic script)", "ug-arab"},
{"uyghur (latin script)", "ug-latn"},
{"uyghur", "ug"},
{"uzbek (cyrillic script)", "uz-cyrl"},
{"uzbek (latin script)", "uz-latn"},
{"uzbek", "uz"},
{"vai", "vai"},
{"vallader", "rm-vallader"},
{"venda", "ve"},
{"venetian", "vec"},
{"veps", "vep"},
{"vietnamese", "vi"},
{"vlax romani", "rmy"},
{"volapük", "vo"},
{"votic", "vot"},
{"vunjo", "vun"},
{"võro", "fiu-vro"},
{"wallisian", "wls"},
{"walloon", "wa"},
{"walser", "wae"},
{"waray", "war"},
{"warlpiri", "wbp"},
{"washo", "was"},
{"wayuu", "guc"},
{"welsh-romani", "rmw"},
{"welsh", "cy"},
{"west coast bajau", "bdr"},
{"west flemish", "vls"},
{"western abenaki", "abe"},
{"western armenian", "hyw"},
{"western balochi", "bgn"},
{"western cham (arabic script)", "cja-arab"},
{"western cham (cham script)", "cja-cham"},
{"western cham (latin script)", "cja-latn"},
{"western cham", "cja"},
{"western frisian", "fy"},
{"western mari", "mrj"},
{"western punjabi", "pnb"},
{"wolaytta", "wal"},
{"wolof", "wo"},
{"wu chinese", "wuu"},
{"xhosa", "xh"},
{"xiang chinese", "hsn"},
{"yangben", "yav"},
{"yao", "yao"},
{"yapese", "yap"},
{"yemba", "ybb"},
{"yiddish", "yi"},
{"yoruba", "yo"},
{"zapotec", "zap"},
{"zarma", "dje"},
{"zaza", "zza"},
{"zazaki", "diq"},
{"zeelandic", "zea"},
{"zenaga", "zen"},
{"zhuang", "za"},
{"zoroastrian dari", "gbz"},
{"zulu", "zu"},
{"zuni", "zun"},
{"español (formal)", "es-formal"},
{"magyar (formal)", "hu-formal"},
{"multiple languages", "mul"},
{"no linguistic content", "zxx"},
{"unknown language", "und"},
{"себертатар", "sty"},
{"ᬩᬲᬩᬮᬶ", "ban-bali"},
};
//---------------------------< S K I P _ L I S T >------------------------------------------------------------
//
// list of articles to be skipped for whatever reason
//
//
Dictionary<string, bool> skip_map = new Dictionary<string, bool>()
{
// {<article name here>, true},
{"Bist du bei mir", true}, // per reverts by Editor Francis Schonken
{"Concerto for Two Violins (Bach)", true},
};
// Monkbot_task_18_cosmetic_cs1_template_and_parameter_fixes.cs