Support entities (anyway, please!) about djot HOT 11 OPEN

jgm commented on June 19, 2024

Support entities (anyway, please!)

from djot.

Comments (11)

marrus-sh commented on June 19, 2024 2

i think supporting inserting characters by codepoint is a good thing—especially with invisible or confusable characters it can be useful. i think HTML entity names are not so good; many of them are essentially legacy and the coverage is not necessarily complete or well‐thought‐out.

i don’t like the XML/HTML entity reference syntax because it makes the decimal form of codepoints &#NNNN; easier to type than the hexadecimal form &#xNNNN;. hexadecimal makes much more sense for unicode and i’m not sure that it even makes sense for decimal codepoints to be supported.

why not extend the emoji syntax to allow arbitrary characters by unicode codepoint, like :U+2764:? perhaps even multiple characters could be included, such as :U+2764.FE0E: (. is commonly used in unicode documentation for delimiting sequences of codepoints). emoji are already a kind of entity reference, after all.

from djot.

jgm commented on June 19, 2024 2

entities are essential for comfort writing of mixed-language texts - e.g. when mixing right-to-left and left-to-right languages as is common e.g. in United Arab Emirates, Qatar, etc.

Can you explain a bit more why entities help with this? (E.g. give an example?)

from djot.

marrus-sh commented on June 19, 2024 2

@uvtc the bigger concern is invisible characters, for example variation selectors, right‐to‐left and left‐to‐right marks, ligation marks (zero‐width joiner and zero‐width non‐joiner), characters which allow breaks (zero‐width space) and prevent them (word joiner), “shy” hyphens, etc…… in some text editors it may be possible to inspect whether these characters are present (CotEditor for example is very good), but in others it may not, and regardless simply having those characters written out in the text is often much easier to handle.

as an example, the codepoint U+3402 㐂 has five different registered variations, which may be indicated by appending the variation selectors U+E0100..U+E0104. the font you are using when composing your document is not necessarily going to be the same one that you use when rendering it, so it may not support all of the different variants. it would be very useful to be able to write 㐂{:U+E0102:} to explicitly indicate the third variant, because (depending on fonts etc) the composed form 㐂󠄂 might not look any different than the character without any variations applied.

similar arguments extend to things like wanting to type no‐{:U+2060:}break to add a word‐joiner to suppress line breaking, etc…

as for having to remember the unicode codepoints as opposed to the names, i think many people probably would prefer writing {:U+E0102:} rather than {:variation_selector-19:}; it’s much shorter and easier to skim over in a line of text. in any case, supporting the latter would require unicode character name lookups, which would make implementation a little bit more difficult.

from djot.

bpj commented on June 19, 2024 1

If so Lua 5.3 style with braces \u{123} so that one need only type as few digits as necessary.

from djot.

bpj commented on June 19, 2024 1

This said I think :0x14b: and :331: and hopefully {:0x14b:} and {:331:} would be a reasonable syntax as an extension of existing emoji syntax (which IMO should include {:emoji:}) since it might allow processors to support custom names; :entity:, {:Unicode name:} or whatever.

from djot.

wooorm commented on June 19, 2024

As there are escapes already, why not add unicode escapes as supported in many programming languages? Along the lines of \u1234

from djot.

dumblob commented on June 19, 2024

I do not care about the syntax here but would like to point out entities are essential for comfort writing of mixed-language texts - e.g. when mixing right-to-left and left-to-right languages as is common e.g. in United Arab Emirates, Qatar, etc.

So any solution you come up in here has to be well readable (and comfortable to write) for characters changing the direction etc.

from djot.

uvtc commented on June 19, 2024

Is the purpose for supporting entities to let you put in unicode characters when you're unable to insert the actual unicode character into your source? (That is, you know the character you want but cannot copy/paste it into your content file? Is it common to know the codepoint but not be able to copy/paste the character in?)

:U+2192: (for "→") is pretty syntax, and symmetric with emoji syntax, but not very readable (unless you happen to know that 2192 mean "→"). Those html entities are more readable :&rarr: (and potentially easier to remember), though I agree with @marrus-sh about their problems.

I didn't realize that the list of djot-supported emojis was so large. Seems like adding 10 or 20 commonly-used readable unicode char names like :right-arrow: wouldn't be too crazy, would it?

from djot.

dumblob commented on June 19, 2024

@jgm sorry for the delay - yes, the intent is mostly what @marrus-sh wrote above. Namely to make visible all those characters (incl. future ones) which change or influence overall "style", "form", "layout", "paragraphing", etc.

from djot.

jgm commented on June 19, 2024

See my idea in #112 of generalizing the syntax currently used for emojis.
The idea would be that :smile: is parsed as, say,

special text="smile"

If you use emojis, you can use this syntax for them with a filter that inserts the emoji character proper to the alias. But you could just as easily use a different filter to associate whatever unicode string you like.

from djot.

bpj commented on June 19, 2024

I have written a simple Pandoc filter which replaces codepoint escapes like :0x14b:, :331:
in strings with characters.

Gotcha: a literal colon (:) next to a digit must be escaped as :58:/:0x3a:!

local char = utf8.char
local pat = '(%:(%w+)%:)'
local function subst (match, id)
  --'If we can numify it it is probably a codepoint!'
  local cp = tonumber(id)
  if cp then
    --'If the codepoint is out of range char throws a scarcely helpful error!'
    local ok, res = pcall(char, cp)
    if ok then
      return res
    end
    error("Failed to convert " .. tostring(match) .. " to a character:\n\t" .. tostring(res))
  end
  return match
end
function Str (str)
  return pandoc.Str(str.text:gsub(pat, subst))
end

It could easily be ported to a djot filter using my pure-Lua char function from #44 (comment)

from djot.

Support entities (anyway, please!) about djot HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent