s from GNU Unifont¶wcwidth()
sometimes gives unreasonable widths for characters. To try to fix it, I've decided to try reading it in from GNU Unifont, a monospaced font with extensive support for the Unicode space.
This following section requires FontForge to be installed. The TTF files are converted to FontForge's SFD format, which is plaintext and easy to parse.
for fontfile in ["unifont-$version", "unifont_upper-$version"]
isfile("$fontfile.ttf") || download("http://unifoundry.com/pub/unifont-$version/font-builds/$fontfile.ttf", "$fontfile.ttf")
run(`fontforge -c "Open(\"$fontfile.ttf\");Save(\"$fontfile.sfd\");Quit(0);"`)
#Read sfdfile for character widths
function parsesfd(filename::String, CharWidths::Dict{Int,Int}=Dict{Int,Int}())
for line in readlines(open(filename))
if state==:seekchar #StartChar: nonmarkingreturn
if contains(line, "StartChar: ")
codepoint = nothing
width = nothing
state = :readdata
elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
contains(line, "Width:") && (width = int(split(line)[2]))
if codepoint!=nothing && width!=nothing
state = :seekchar
@time CharWidths=parsesfd("unifont-$version.sfd")
println("Number of character widths read: ", length(CharWidths))
@time CharWidths=parsesfd("unifont_upper-$version.sfd", CharWidths)
println("Number of character widths read: ", length(CharWidths))
This section assumes that libmojibake is installed and available somewhere in the current library path.
#Load data for Unicode general categories
"Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
"No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
"Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]
catcode(c)=unsafe_load(ccall((:utf8proc_get_property,:libmojibake), Ptr{Uint16}, (Int32,), c))
catabbr (generic function with 1 method)
#Load data for Unicode character names
isfile("UnicodeData.txt") || download("http://www.unicode.org/Public/UNIDATA/UnicodeData.txt",
for line in readlines(open("UnicodeData.txt"))
tokens = split(line, ';')
length(tokens)≥11 && (charname[uint32("0x"*tokens[1])] = tokens[2]*"/"*tokens[11])
#UAX 11: East Asian Width
isfile("EastAsianWidth.txt") || download("http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt", "EastAsianWidth.txt")
for line in readlines(open("EastAsianWidth.txt"))
#Strip comments
line[1] == '#' && continue
precomment = split(line, '#')[1]
#Parse code point range and width code
tokens = split(precomment, ';')
length(tokens)≥2 || continue
charrange = tokens[1]
width = strip(tokens[2])
#Parse code point range into Julia UnitRange
rangetokens = split(charrange, "..")
charstart = uint32("0x"*rangetokens[1])
charend = uint32("0x"*rangetokens[length(rangetokens)>1 ? 2 : 1])
#Assign widths
for c in charstart:charend
width=="N" && continue #Ignore neutral characters
CharWidths[c]=(width=="W" || width=="F") ? 2 : #Wide or full
(width=="Na"|| width=="H" || width=="A") ? 1 : #Narrow or half or ambiguous (default to narrow in non-East-Asian contexts, which we can assume to be the default)
error("Unknown East Asian width code: $width for code point: $c")
#Unicode character type for pretty printing
import Base:convert, show
type UniChar
convert(::Type{Char}, u::UniChar)=u.c
function show(io::IO, uc::UniChar)
print(io, "0x", hex(c,6), " '", char(c))
print(io, "' category: ", catabbr(c), " name: ", get(charname, c, "N/A"))
UniChar(0x0423), UniChar('A')
(0x000423 'У' category: Lu name: CYRILLIC CAPITAL LETTER U/,0x000041 'A' category: Lu name: LATIN CAPITAL LETTER A/)
I assume here that a printable character is a valid Unicode character that is not in the Cn, Cs or Cc Unicode general categories.
#Working definition of isprintable
isprintable(c::Union(Char,Integer)) = c ≤ 0x10ffff && is_valid_char(c) && isprintable_category(catcode(c))
function isprintable_category(category)
!( category==Base.UTF8proc.UTF8PROC_CATEGORY_CN #Unassigned
|| category==Base.UTF8proc.UTF8PROC_CATEGORY_CS #Surrogate
|| category==Base.UTF8proc.UTF8PROC_CATEGORY_CC #Control
|| category==0 #Unknown; most of these should be CN (JuliaLang/julia#7792)
isprintable_category (generic function with 1 method)
The following code classifies all printable Unicode code points by
.for (c,v) in CharWidths
CharWidths[c] = v÷512
#Classify characters
for c in 0x0000:0x10ffff
width=isprintable(c) ? (haskey(CharWidths,c) ? CharWidths[c]: -1) : -1
idx = (width, int(ccall(:wcwidth, Int32, (Uint32,), c)))
Boxes[idx] = push!(get(Boxes, idx, {}), c)
#Output table in GFM format
for j=-1:2
print("\t | ", j )
println("\n"*"------- | "^3 * "-------")
for i=-1:2
print("__", i, "__")
for j=-1:2
print("\t | ")
i==j && print("__")
print(length(get(Boxes, (i,j), {})))
i==j && print("__")
fnt\sys | -1 | 0 | 1 | 2 ------- | ------- | ------- | ------- __-1__ | __866893__ | 1 | 2604 | 0 __0__ | 833 | __89__ | 448 | 0 __1__ | 2803 | 111 | __143522__ | 0 __2__ | 8078 | 2 | 3685 | __85043__
#Break down discrepancies by general category
function printbreakdown(characters)
for c in characters
ca = catabbr(c)
catcounts[ca] = push!(get(catcounts, ca, {}), c)
for (ca, cnt) in sort!([c for c in catcounts])
println(ca,": ",length(cnt))
for i=-1:2, j=-1:2
i==j && continue
println("\n\nfont = $i, system = $j")
haskey(Boxes,(i,j)) && printbreakdown(Boxes[i,j])
font = -1, system = 0 Cc: 1 font = -1, system = 1 00: 2 Cs: 2048 Ll: 100 Lu: 91 So: 363 font = -1, system = 2 font = 0, system = -1 Cf: 144 Lo: 2 Mc: 168 Me: 4 Mn: 515 font = 0, system = 1 Cf: 2 Mc: 115 Mn: 327 So: 2 Zl: 1 Zp: 1 font = 0, system = 2 font = 1, system = -1 Ll: 411 Lm: 159 Lo: 1071 Lu: 241 Mn: 241 Nd: 49 Nl: 59 No: 150 Pc: 1 Pd: 6 Pe: 7 Pf: 6 Pi: 6 Po: 89 Ps: 8 Sc: 12 Sk: 61 Sm: 4 So: 222 font = 1, system = 0 Mn: 111 font = 1, system = 2 font = 2, system = -1 Cf: 4 Ll: 51 Lm: 11 Lo: 6389 Lu: 29 Mc: 1 Mn: 2 Nd: 162 Nl: 13 No: 106 Pe: 2 Po: 150 Ps: 2 Sc: 6 Sm: 46 So: 1104 font = 2, system = 0 Mn: 2 font = 2, system = 1 Ll: 223 Lm: 2 Lo: 1650 Lu: 207 Mn: 6 Nd: 137 Nl: 1 No: 60 Pd: 2 Pe: 7 Po: 62 Ps: 7 Sc: 4 Sm: 431 So: 883 Zs: 3
#Let's drill down into the Cf characters more closely
for c in sort!([keys(CharWidths)...])
if catabbr(c)=="Cf"
println(charwidth(char(c)), " ", CharWidths[c], " ", UniChar(c))
Most of these should have width zero, except the Arabic characters [[0x0601:0x0603]..., 0x06dd]
, which the standard says:
Unlike most other format control
characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order." - Unicode Standard v.6.2.0, Section 8.2 - Arabic (p.256)
for c in sort!([keys(CharWidths)...])
if catabbr(c)=="Cf" && c∉[0x0601, 0x0602, 0x0603,0x06dd]
for c in sort!([keys(CharWidths)...])
if catabbr(c)=="Cf" && (charwidth(char(c)) != CharWidths[c])
println(charwidth(char(c)), " ", CharWidths[c], " ", UniChar(c))
1 0 0x0000ad '' category: Cf name: SOFT HYPHEN/ 0 2 0x000601 '' category: Cf name: ARABIC SIGN SANAH/ 0 2 0x000602 '' category: Cf name: ARABIC FOOTNOTE MARKER/ 0 2 0x000603 '' category: Cf name: ARABIC SIGN SAFHA/ 0 2 0x0006dd '' category: Cf name: ARABIC END OF AYAH/ 1 0 0x00200b '' category: Cf name: ZERO WIDTH SPACE/
I think the remaining discrepancies can all be resolved in favor of the current values.
#Let's drill down into the spacing characters more closely
for c in 0x0000:0xffff
if catabbr(c)=="Zl" || catabbr(c)=="Zp" || catabbr(c)=="Zs"
println(charwidth(char(c)), " ", CharWidths[c], " ", UniChar(c))
1 1 0x000020 ' ' category: Zs name: SPACE/ 1 1 0x0000a0 ' ' category: Zs name: NO-BREAK SPACE/NON-BREAKING SPACE 1 2 0x001680 ' ' category: Zs name: OGHAM SPACE MARK/ 1 1 0x002000 ' ' category: Zs name: EN QUAD/ 1 1 0x002001 ' ' category: Zs name: EM QUAD/ 1 1 0x002002 ' ' category: Zs name: EN SPACE/ 1 1 0x002003 ' ' category: Zs name: EM SPACE/ 1 1 0x002004 ' ' category: Zs name: THREE-PER-EM SPACE/ 1 1 0x002005 ' ' category: Zs name: FOUR-PER-EM SPACE/ 1 1 0x002006 ' ' category: Zs name: SIX-PER-EM SPACE/ 1 1 0x002007 ' ' category: Zs name: FIGURE SPACE/ 1 1 0x002008 ' ' category: Zs name: PUNCTUATION SPACE/ 1 1 0x002009 ' ' category: Zs name: THIN SPACE/ 1 1 0x00200a ' ' category: Zs name: HAIR SPACE/ 1 2 0x002028 ' ' category: Zl name: LINE SEPARATOR/ 1 2 0x002029 ' ' category: Zp name: PARAGRAPH SEPARATOR/ 1 2 0x00202f ' ' category: Zs name: NARROW NO-BREAK SPACE/ 1 1 0x00205f ' ' category: Zs name: MEDIUM MATHEMATICAL SPACE/ 2 2 0x003000 ' ' category: Zs name: IDEOGRAPHIC SPACE/
#By definition, should have zero width (on the same line)
#0x002028 '
' category: Zl name: LINE SEPARATOR/
#0x002029 '
' category: Zp name: PARAGRAPH SEPARATOR/
#By definition, should be narrow = width of 1 en space
#0x00202f ' ' category: Zs name: NARROW NO-BREAK SPACE/
#By definition, should be wide = width of 1 em space
#0x002001 ' ' category: Zs name: EM QUAD/
#0x002003 ' ' category: Zs name: EM SPACE/
for c in sort!([keys(CharWidths)...])
if (catabbr(c)=="Zl" || catabbr(c)=="Zp" || catabbr(c)=="Zs") && (charwidth(char(c)) != CharWidths[c])
println(charwidth(char(c)), " ", CharWidths[c], " ", UniChar(c))
1 2 0x001680 ' ' category: Zs name: OGHAM SPACE MARK/ 1 2 0x002001 ' ' category: Zs name: EM QUAD/ 1 2 0x002003 ' ' category: Zs name: EM SPACE/ 1 0 0x002028 ' ' category: Zl name: LINE SEPARATOR/ 1 0 0x002029 ' ' category: Zp name: PARAGRAPH SEPARATOR/
Except for the Ogham space mark, which seems like it could be either width 1 or 2 depending on the choice of font glyph, the remaining discrepancies can all be resolved in favor of the current values in CharWidths
¶function printccode(ucs::UnitRange, v::Integer)
if length(ucs)==1
println("\tif (ucs==0x", hex(firstc), ") return ", v, ";")
println("\tif (ucs>=0x", hex(firstc), " && ucs<=0x", hex(lastc), ") return ", currentv, ";")
currentv = 0
firstc = 0
lastc = 0
println("int mk_wcwidth(wchar_t ucs)")
for c in 0x0000:0x10ffff
#If not printable character, wcwidth = -1
#If charwidth is not defined, assume category Cn (unassigned = not printable)
v = isprintable(c) ? (haskey(CharWidths, c) ? CharWidths[c] : -1) : -1
if v ≠ currentv
lastc>0 && printccode(firstc:lastc, currentv)
firstc, currentv = c, v
lastc = c
printccode(firstc:lastc, currentv)
