Module talk:Plain text

Latest comment: 1 month ago by Johnjbarton in topic Strips plus signs.

strip_apostrophe_markup

edit

@Galobtter: The function string.gsub() is quite forgiving, so you don't need to test for each case. Also ' doesn't need to be escaped when used as a search pattern. You can't sensibly export the strip_apostrophe_markup function, so it should be local, or could just go inline. You can simplify strip_apostrophe_markup to

local function strip_apostrophe_markup(txt)
	txt = txt:gsub("'''''", ""):gsub("''''", ""):gsub("'''", ""):gsub("''", "")
	return txt
end

In the main function, text should be a local variable:

local text = frame.args[1]

I don't like altering code while others are developing it, so I'll leave you to update it as you see fit. --RexxS (talk) 19:56, 14 April 2018 (UTC)Reply

I replaced the mw.ustring.gsub with plain gsub because ustring is a lot slower than gsub and is not needed in this module. The optimization is not necessary but since people are looking at the code I thought it worth mentioning that wikitext will always use UTF-8 and that means Lua gsub with the patterns in this module will work well. Lua gsub works in any language with a pattern like '[12]' ('1' or '2') but mw.ustring.gsub would be needed for a pattern like ['১২'] (that might be used at the Bengali Wikipedia to search for their equivalent). In the first case (Lua gsub), the pattern finds the first location matching any of the bytes between [ and ]. In the Bengali case, each digit is three bytes in UTF-8, so there are six bytes between the square brackets. If Lua gsub were used, it would look for any of those bytes. Johnuniq (talk) 09:47, 18 April 2018 (UTC)Reply

Could remove indentations

edit

Can be comnbined with leading spaces: gsub("^[:;%s]+", "") — 𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚 (talk) 20:31, 24 May 2021 (UTC)Reply

Performance improvements (and other) in the sandbox

edit

I made a few performance (and other) improvements to this module in the sandbox based on the work with Module:User scripts table (for which I started using Module:Plain text, and ended up forking and customising it for the needs there). The two performance improvements are:

  1. Use greedy [^x]+x instead of ungreedy .-x whenever possible; and
  2. Use a single gsub for all File:, Category:, Media:, etc, instead of a gsub for each.

𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚  13:48, 21 June 2021 (UTC)Reply

nowiki text removed?

edit

The documentation example has in its example: <nowiki>?</nowiki> (question mark in nowiki tag).

The module removes this wikitext altogether, including the question mark. Why is this "other stuff" to be removed? -DePiep (talk) 13:41, 2 September 2021 (UTC)Reply

Tag stripping

edit

Currently, this module strips both the tags and their contents for all HTML-style tags, except for <span>, <i>, <b>, <em>, and <strong> (and the last three only because I just added cases for them). However, there are a variety of other tags which are valid in wikitext, and which contents arguably should be kept after discarding the tags themselves, e.g. <h2>, <dfn>, <sup>, <u>. These could continue to be added here individually, but I think it's probably simpler to reverse the module's behavior, and only discard contents of tags for a curated list, and otherwise keep the contents.

The main issue I can see with that would be for <sub> and <sup>, where just stripping them often results in confusing text, e.g. stripping "232" would produce "232", or "ve" producing "ve"; in these cases it might be better to replace the tags with "^"/"_" (resulting, for the aforementioned examples, in output of "2^32" and "v_e") or other appropriate characters (though the suggested characters, I believe, are the ones most often used for indicating super/subscript when formatting options are limited). ディノ千?!☎ Dinoguy1000 04:13, 6 October 2021 (UTC)Reply

Reasonable, especially when whitelist/blacklist are argued well & systematically -- and so more stable. Module history shows that it was never approached this systematically.
Now, the documentation has this peculiar sentence "other stuff that needs removing from short descriptions". Looks like it was purpose-build for WP:SHORTDESC, WP:SDFORMAT then. But it is actualy unused in {{short description}}; ask WP:WPSHORTDESC? And, what effect on does existing 1M+ usage (that's module; {{Plain text}} has 35k)? For this, the proposed extended removals be put in a separate function? -DePiep (talk) 06:06, 6 October 2021 (UTC)Reply
I saw that the original intention was for SHORTDESCs before editing the module and starting this discussion, though I didn't actually check to see if the module is currently used for that; it's mildly amusing to me that it isn't.
To be honest, I have no idea how this change might affect current uses. If performance isn't too much of a concern, we could simply add tracking for cases where nonspecific tags are being stripped and give things a while to filter in before looking through them and seeing if anything interesting appears. That being said, I'd expect the vast majority of uses to be via other templates or modules; some quick searches show it's only being directly used in ~3 dozen templates and modules, which shouldn't be too hard to look through by hand (though TBH I don't know what I'd be looking for).
So what tags should be fully discarded? There's the obvious <br /> (which is already stripped, though not very robustly), and the currently-not-stripped <hr />, <wbr />, and most (almost all?) of the parser/extension tags (and maybe <!-- comment -->, though that doesn't get displayed anyways); <table> and <div> are potential candidates, though I could definitely see arguments for keeping their contents at least sometimes (so maybe make them optional somehow?).
Conversely, looking through WP:HTML reminds me of <abbr>, which would probably also need some specific consideration akin to <sup>/<sub>. The contents of <bdi>/<bdo> might be able to just be presented as they are in the original string, but I'm not an expert in this area, so at the least it would probably require a bit of discussion.
I'm getting away from this discussion at this point, but after thinking about it earlier I concluded that probably the best method for stripping templates while optionally keeping some of their contents would be for templates/modules to have some sort of "plain output" mode, that would be "safe" for applications like this or WP:NAVPOP, which currently just strips most templates entirely. Though obviously this would require quite a bit of work, and some planning/consideration on the implementation (which I don't have many thoughts on myself, but maybe one idea would be adding some feature to TemplateData to indicate "safe" parameters to output directly, assuming TemplateData doesn't already have such a feature). ディノ千?!☎ Dinoguy1000 08:39, 6 October 2021 (UTC)Reply
I've started {{Navbox wikitext-handling templates}}, to see what is related. -DePiep (talk) 20:06, 6 October 2021 (UTC)Reply
edit

Since I've been thinking about this module today anyways, I realized that the link stripping here duplicates the stripping done by Module:Delink, albeit probably less robust and (I think?) not catching as many cases. Are there any major reasons (other than performance maybe) not to just use Module:Delink for that functionality in this module? ディノ千?!☎ Dinoguy1000 08:42, 6 October 2021 (UTC)Reply

We need a more general approach to any "non-ascii"-stripping. Think HTML, html-tags, wm-extension tags, wikicode like [[{{!}}]], parser-strips, etc. -DePiep (talk) 20:27, 6 October 2021 (UTC)Reply

Keeping contents of <sup>/<sub>

edit

(starting a new topic because #Tag stripping is old and only touches on these tags as a side-comment)

At WP:VPT#Wikilinks from italic titles, I came across this template as a solution for one editor's issue (italics in ship-names), but the lack of preservation for superscript/subscript text breaks for another related use (chemical names). The general goal is to convert a properly formatted visual text into a bluelink. But as simple examples that currently don't work correctly for that use, we would want H<sub>2</sub>O (H2O) to become "H2O" not "HO" and <sup>3</sup>H (3H) to become 3H not H. DMacks (talk) 16:04, 26 June 2022 (UTC)Reply

This happens because of line 21. Could be fixed by adding these lines ahead of line 21:
		:gsub('<sub.->(.-)</sub>', '%1') --remove subscript markup; retain contents
		:gsub('<sup.->(.-)</sup>', '%1') --remove superscript markup; retain contents
caveat lector: not tested
Trappist the monk (talk) 16:25, 26 June 2022 (UTC)Reply
I came here to ask for this feature, once again. Given existing behaviour & usage, I suggest this should be a opt-in by parameter (|"keep-tag-content"=T [F-by-default]. (alternative: by fork)
Other tag content to be kept? <sup>, <sub>, <hn>, <dfn>, <u>? <nowiki> content can not be kept because of unknown code injection.
From #Tag stripping: replace <sup>, .. with ⟨^⟩ or generic whitespace, etcetera? As option? -DePiep (talk) 07:53, 21 February 2023 (UTC)Reply
I have adjusted the sandbox and added underline, subscript, and superscript test cases to the testcases page. Is there further testing that needs to be done, or should I roll this out to a million pages? – Jonesey95 (talk) 15:41, 23 July 2023 (UTC)Reply
Looks good! Johnuniq (talk) 23:27, 23 July 2023 (UTC)Reply
  Done. Thanks to Trappist the monk for the code. – Jonesey95 (talk) 05:29, 24 July 2023 (UTC)Reply

Strips plus signs.

edit

Sadly I discovered this template strips plus signs. And Template:Strip tags only strips span and div. Johnjbarton (talk) 22:52, 8 October 2024 (UTC)Reply