Bug report: bad decoding of U+03B5 ε (epsilon)

edit

About U+03B5 ε GREEK SMALL LETTER EPSILON (ε ε)

  • Issue: after resolving HTML entity ε by mw.text.decode(), the plain character is not found by mw.ustring.gsub(). No issue with alternative HTML entity ε. ε good, ε bad.
Report limitations: Original report and bug reproduction is at enwiki Module talk:DecodeEncode, from where en:module:DecodeEncode and en:module:String are used live. At phabricator pseudocode may be used and some "results" may be hardcoded. In-text the escape & is used, not in-function. Lua patterns not used ("no %").
  • To reproduce:
1. Create research string:
Xε1Xε2X (shows live and unedited as: Xε1Xε2X)
2. Render the string by decode() (as inner function)
3. then on rendered result use gsub() to replace plain character εE: (as outer function)
mw.ustring.gsub( s=(mw.text.decode( s=Xε1Xε2X, decodeNamedEntities=true ) ), pattern=ε, repl=E ) [is pseudo-code, see note. 21:10, 7 February 2023 (UTC)]
4. Result3 (s&r pattern use ε from Xε1X):
XE1XE2X
5. Result4 (s&r pattern use ε from Xε2X):
XE1XE2X
  • Expected: XE1XE2X (only one character ε exists)
{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s=Xε1Xε2X}}|pattern=ε|replace=E|plain=true}}
→ XE1XE2X
-DePiep (talk) 21:10, 7 February 2023 (UTC)Reply

Workaround A, ad hoc

edit

Workaround A, ad hoc: add innermost function to first replace in the research string εε:

A1: {{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s={{#invoke:String|replace|source=Xε1Xε2X|pattern=ε|replace=ε|plain=true}}}}|pattern=ε|replace=E|plain=true}}
XE1XE2X

Workaround B, in module (THIN SPACE example)

edit

Workaround B: early in :en:module:DecodeEncode, replace εε

About THIN SPACE: it looks like character U+2009 THIN SPACE (   ) has a samilar issue.   good,   bad.

Currently in code:

function p._decode( s, subset_only )
	local ret = nil;
    s = mw.ustring.gsub( s, ' ', ' ' ) -- Workaround for bug:   gets properly decoded in decode, but   doesn't.
	ret = mw.text.decode( s, not subset_only )
	return ret
end

In en:module:DecodeEncode/sandbox, I have coded a similar handling of EPSILON:

module:DecodeEncode, module:DecodeEncode/sandbox diff
function p._decode( s, subset_only )
	local ret = nil;
	-- U+2009 THIN SPACE: workaround for bug: HTML entity   is decoded incorrect. Entity   gets decoded properly
	s = mw.ustring.gsub( s, ' ', ' ' )
	-- U+03B5 ε GREEK SMALL LETTER EPSILON: workaround for bug (phab:T328840): HTML entity ε is decoded incorrect for gsub(). Entity ε gets decoded properly
	s = mw.ustring.gsub( s, 'ε', 'ε' )
	ret = mw.text.decode( s, not subset_only )
	return ret
end
  • /sandbox tests:
B. {{#invoke:String|replace|source={{#invoke:DecodeEncode/sandbox|decode|s=Xε1Xε2X}}|pattern=ε|replace=E|plain=true}}
B1. ResultB1 (s&r pattern use ε from Xε1X): XE1XE2X
B2. ResultB2 (s&r pattern use ε from Xε2X): XE1XE2X

I propose to edit the module along this way.

Workaround C (mw, Lua)

edit

Changes in mw, Lua: I have not idea.

testcases EPSILON

edit
  • Original failure, now solved=not showing any more:
(hardcoded explanation here): in cell marked  N, the result showed as "XE1Xε2X". That is: wikitext input "ε" was not recognised & replaced. -DePiep (talk) 07:49, 19 February 2023 (UTC)Reply
EPSILON ε ε error & fix proposal (16 Feb 2023)
1 2 3 4 5 6
id entity code plain mod:.. decode(&entity;) replace(decode(..)) with E
pattern=hardcoded ⟨ε⟩ from plain
(s=&entity;)
(s=checkstring)
mod:..decode/sandbox
checkstring X&epsi;1X&epsilon;2X >Xε1Xε2X< >Xε1Xε2X<
EPSI &epsi; >ε< >ε< E
XE1XE2X
E
XE1XE2X
EPSILON &epsilon; >ε< >ε< E
XE1XE2X
 N
E
XE1XE2X
Similar fix as U+2009 THIN SPACE (&thinsp;, &ThinSpace;) has (though original cause bug may be different for THIN SPACE).
  • Phabricator T328840 did not gain traction. Would be mw-level, not this module.
-DePiep (talk) 06:22, 16 February 2023 (UTC)Reply

Template-protected edit request on 16 February 2023

edit
Issue: bad decoding of HTML entity &epsi;  N
re U+03B5 ε GREEK SMALL LETTER EPSILON (&epsi;, &epsilon;)
Change: fix by replacing with entity &epsilon;  Y before applying decode(). See § Workaround B for code diff & backgrounds; minor comment change
Discussion: (1) reported at T328840, no responses (mw-level); (2) bug report here not challenged
Testcases: See § testcases EPSILON.
DePiep (talk) 06:49, 16 February 2023 (UTC)Reply
  Done * Pppery * it has begun... 03:11, 19 February 2023 (UTC)Reply

NBSP behaviour

edit

Leaving this note here.

About NBSP, U+00A0   NO-BREAK SPACE (&nbsp;, &NonBreakingSpace;). With input &nbsp; I am experiencing problems reminding of § epsilon (T328840, now resolved).

When nested like: (replace|s=(decode|s=AB&nbsp;YZ)|replace=AB_YZ) returns breaking code (breaking when used in/with HTML/css code like span, sup, class).

No time to build the reproduction/test, so have to leave it for now. Not reported on phab. DePiep (talk) 07:27, 20 February 2023 (UTC)Reply

Template-protected edit request on 21 March 2023

edit

Please replace all code Module:DecodeEncode with module:DecodeEncode/sandbox. (compare )

Change: apply require('strict'), and declade function local explicit. DePiep (talk) 14:34, 21 March 2023 (UTC)Reply

Invitation is out. -DePiep (talk) 14:49, 21 March 2023 (UTC)Reply
Upd: Gonnym has made large improvements, so the sandboxdiff is large. I do not see strict-related changes. DePiep (talk) 21:31, 21 March 2023 (UTC)Reply
The changes are good and no globals remain. The two mw.ustring could be string. Johnuniq (talk) 06:40, 22 March 2023 (UTC)Reply
thx. As said, please someone with trust perform ER because me editing/commenting in between does not help. DePiep (talk) 08:18, 22 March 2023 (UTC)Reply
  Done — Martin (MSGJ · talk) 18:35, 22 March 2023 (UTC)Reply