Skip to content

Conversation

@clsource
Copy link

@clsource clsource commented Mar 31, 2021

Regarding #891

This implementation is not UTF-8 complete (no emojis).
But at least it gives a room for most roman/latin languages to use variable names
with non ascii chars + some symbols.

@mhermier
Copy link
Contributor

mhermier commented Apr 1, 2021

I'm not against that change at all, but prior to accept this change, I think the wren-cli should fixed to accept UTF8 to be typed.

@PureFox48
Copy link
Contributor

I'm in favor of a change of this nature as it would improve language coverage considerably without going the full UTF-8 route.

I'd certainly include all the upper and lower case letters in IS0 8859-1 as these come in nice big blocks though I'm not so sure about allowing symbols as well as we don't even allow those (except underscore) in the ASCII range.

@clsource
Copy link
Author

clsource commented Apr 1, 2021

I'm in favor of a change of this nature as it would improve language coverage considerably without going the full UTF-8 route.

I'd certainly include all the upper and lower case letters in IS0 8859-1 as these come in nice big blocks though I'm not so sure about allowing symbols as well as we don't even allow those (except underscore) in the ASCII range.

There are some symbols like Ø or ß which are letters in some languages in Europe.
So I though was a good idea too to include some mathematical symbols as well like π or Δ
and some currency symbols like $ and the "at" symbol @. Since they are not keywords or have meaning to Wren.

@clsource
Copy link
Author

clsource commented Apr 1, 2021

You know I have been thinking about this implementation and it maybe easier to have a trie data structure for this that returns true or false depending on the bytes of the character.

https://www.techiedelight.com/trie-implementation-insert-search-delete/

@mhermier
Copy link
Contributor

mhermier commented Apr 1, 2021 via email

@clsource
Copy link
Author

clsource commented Apr 1, 2021

The problem is that at this stage in compilation the function is called char by char.
And each char byte by byte (for non ascii values).

if I want to have Ñ I would see two values: 0xc2 0x91 (I guess) separated in two function calls.

So if I not mistaken all the values in utf8 start from 0xc0 but I failed to implement something only using that value.

At wren_utils.c we got some interesting functions that tells how many bytes to expect from a single char

int wrenUtf8EncodeNumBytes(int value)
{
  ASSERT(value >= 0, "Cannot encode a negative value.");
  
  if (value <= 0x7f) return 1;
  if (value <= 0x7ff) return 2;
  if (value <= 0xffff) return 3;
  if (value <= 0x10ffff) return 4;
  return 0;
}

@mhermier
Copy link
Contributor

mhermier commented Apr 1, 2021 via email

@mhermier
Copy link
Contributor

mhermier commented Apr 1, 2021

Looking at the code, I think "currentChar" should become a value, and we should store the pointer to the current position to replace the actual "currentChar--". This will have some impact and will need to be bench-marked properly unfortunately...

@clsource
Copy link
Author

clsource commented Apr 1, 2021

Ok I somehow achieved full UTF8 support. But it requires "duck" thinking.

static bool isAsciiName(uint8_t c)
{
  return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_');
}

static bool isUtf8Name(uint8_t ch)
{
  if (ch < 0 || ch > 0x10ffff) {
    return false;
  }
  
  // Support the traditional ascii only identifiers
  if (ch <= 0x7f && isAsciiName(ch)) {
    return true;
  }

  // Asume is a valid utf8 character
  // If the value is above 127 (ascii) and below maximum utf8 value and the bytes are lower than 4
  return (ch > 0x7f && ch <= 0x10ffff);
}

@mhermier
Copy link
Contributor

mhermier commented Apr 1, 2021

The type is invalid (uint8_t -> int). You can simplify the logic to:

static bool isAsciiName(uint8_t c)
{
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         c == '_';
}

static bool isUtf8Name(int ch)
{
  return isAsciiName(ch) ||
         (ch > 0x7f || ch <= 0x10ffff);   // Asume is a valid utf8 character
}

Since most passed char are (hopefully) valid and probably ascii, lets expect the expected first. Pre-filtering will most probably only make sense when refining utf-8 characters.
The chance 'ch' to be promoted to a register value is high, so the comparison logic should be fast and probably be optimized by the compiler, but writing it that way will help with debugging.

@PureFox48
Copy link
Contributor

When I have some spare time, I like to read back through old issues as this helps me understand how Wren got to be where it is now (I've only been involved since 2020).

I came across #68 earlier today and, when asked whether not supporting UTF-8 identifiers was by design, @munificent replied as follows:

By design. From what I've heard from Java and JS folks, Unicode identifiers ended up being a security and maintainability headache. For what it's worth, Ruby only allows ASCII letters in identifiers and Matz and company are Japanese.

Lua apparently uses the current locale to decide which identifiers are valid, which I think means code may not be portable across machines.

Although that was back in 2015, I think those remarks are still pertinent to this PR today.

I think it would be asking for trouble to try and support the whole of UTF-8. I'm not against extending the identifier range to include the letters in ISO-8859-1 because I think that would be generally useful and I can't see any particular problems there - it is always clear whether they're lower case or upper case and any accent is part of the symbol. But I think it would be unwise to go beyond that.

@clsource
Copy link
Author

clsource commented Apr 14, 2021

I think that many of those languages have a heavy load of baggage that Wren simply do not have (they were born in the 90s and lots of years before UTF8 became a standard). Hence they have those maintainability problems.

http://xahlee.info/comp/unicode_support_ruby_python_elisp.html

  • Ruby has robust support of Unicode, starting with version 1.9. (2007)
  • Unicode can be in variable or function names.
  • Non-letter unicode math symbol allowed.
# -*- coding: utf-8 -*-
# ruby

 = "♥"

def λ n
  n + "美"
end

p (λ ) == "♥美" # true

What I can see is maybe a way to define utf8 variable name support explicitly.

Other languages use

# -*- coding: utf-8 -*-  # also emacs, python, convention.

Since Wren now have attributes we can use them to define that UTF8 Variable names are allowed.
At the start of the file.

#! /usr/bin/env wren
#! coding = "utf-8"

Using that attribute would tell the compiler to allow utf8 identifiers and consider other security measures if needed. Since it would be optional it would not affect performance to previous non utf8 files.

This would affect only the identifiers. Since all the other UTF8 components are already in standard Wren.

I think the only missing part for full utf8 support in Wren is to allow utf8 identifiers :)

@PureFox48
Copy link
Contributor

Yes, but Wren has a problem which most other programming languages do not have - it needs to be able to distinguish between upper and lower case letters for class scope purposes.

To figure that out for characters with codepoints > 255, you'd probably need to refer to a table.

Also some writing systems (Chinese, Japanese, Korean, Arabic etc) have no distinction between upper and lower case characters. What are you going to do about those?

Another problem is that accented symbols in Unicode can either be a single character or two characters (the accent being separate) though these may be indistinguishable in some fonts. So should these be treated as single letters even if they are represented by two Unicode codepoints?

The use of emojis as identifiers can present even worse problems as the emoji may look like a single character though in reality it consists of several codepoints.

It's these sort of problems which have led me to believe that supporting codepoints beyond 255 is more trouble than it's worth for Wren.

It's possible that you might be able to support other alphabets which have upper/lower case distinction such as Greek and Cyrillic by the use of attributes though I think that's something for the future as attributes are just user-defined at present.

@clsource
Copy link
Author

Yes, but Wren has a problem which most other programming languages do not have - it needs to be able to distinguish between upper and lower case letters for class scope purposes.

Also some writing systems (Chinese, Japanese, Korean, Arabic etc) have no distinction between upper and lower case characters. What are you going to do about those?

I think one easy solution to this is just consider any non ascii as a lowercase by default. If you really need to use emojis or non ascii symbols for your classname then just start with a capital ASCII like Class_♥.

Another problem is that accented symbols in Unicode can either be a single character or two characters (the accent being separate) though these may be indistinguishable in some fonts. So should these be treated as single letters even if they are represented by two Unicode codepoints?

The use of emojis as identifiers can present even worse problems as the emoji may look like a single character though in reality it consists of several codepoints.

I don't know which problems would be for the usage of accented characters or emojis in identifiers. They worked well in the tests. Are there any other test that should be made to certify they work well?

@PureFox48
Copy link
Contributor

I think one easy solution to this is just consider any non ascii as a lowercase by default. If you really need to use emojis or non ascii symbols for your classname then just start with a capital ASCII like Class_♥.

Yes, but that would mean that they would be inaccessible from code within a class which only sees identifiers beginning with a capital letter.

Also, if you're a Chinese programmer, you might not be too happy about sticking say an 'X' on the front of your identifiers so they can accessed from within a class.

I don't know which problems would be for the usage of accented characters or emojis in identifiers. They worked well in the tests. Are there any other test that should be made to certify they work well?

They will work for many characters because it won't matter if they are represented by single or multiple code-points. The problems start when you have what looks to the user like the same character but can be represented by different combinations in Unicode. A table might help with some of these but not with some more recent emojis which can consist of numerous code-points, some of them zero width.

Another problem is that most Westerners find it difficult to read identifiers written in oriental languages because of the large number of ideographs used. In contrast, it is a relatively easy matter for say a Chinese person to read identifiers written in European languages because our alphabets are quite small.

@ChayimFriedman2
Copy link
Contributor

ChayimFriedman2 commented Apr 14, 2021

Let me ask something more basic:

What's the use?

Non-ASCII characters are hard to type in keyboards, and even if you do have the character in yours, in a world of code collaboration you can't be sure that other programmers working on your code will have it as well.

But even if you don't have this problem, what will you gain from emoji identifiers? This is nice and all, but doesn't really help to clarify the meaning of the code.

And about writing code in your language, well:

  1. Collaboration, see above.
  2. Programming languages are not even English, and they're looks like other languages even less than that. The keywords are still English, and class Dog { bark() { ... } } does not look like a natural language. In a human language I would say, "there is a thing named "dog" that can bark, and when it barks it does ...".
    Designing a programming language in other natural language may be worthy, but changing only the identifiers isn't enough.
  3. Do experienced programmers need that? No they don't: we all speak English. Do we need to change the language for people (kids?) that are not even English speakers?

Lastly, I want to apologize if someone has gotten the feel that I attacked them - this is definitely not personal, I just want the best for Wren.

@PureFox48
Copy link
Contributor

@ChayimFriedman2

Would you be in favor of extending the identifier character set to include ISO-8859-1 letters which is (more or less) what @clsource started with?

This increases coverage considerably as it includes nearly all the characters used in the major Western European languages. It's true that some of them will be awkward to type for folks who use English keyboards but won't be for people who are using a keyboard appropriate for their (supported) native language.

@ChayimFriedman2
Copy link
Contributor

I am not European and I program only in English, but I think the question is simple: will you use those characters for identifiers in your work? And what about your scripts?

If not, then it's not worthy. If yes, we can talk if the complexities of implementing it are greater than the benefits or not.

@PureFox48
Copy link
Contributor

PureFox48 commented Apr 14, 2021

Apart from an occasional é I don't think as a native English speaker I'll use them myself.

But I'm sure that @clsource as a native Spanish speaker will use characters such as ñ and the rest of us will have no trouble reading them.

@ChayimFriedman2
Copy link
Contributor

ChayimFriedman2 commented Apr 14, 2021

Then the question is, why not just use n?

We may have no trouble reading them, but typing them is... not fun. And either the word is very similar to English (apart from the ñ) - then why not just use English, or it's very different but with Latin characters - then we won't be able to read it anyway.

@PureFox48
Copy link
Contributor

Well, sometimes you can 'read' words written in foreign languages even if you don't know what they mean :)

My problem is that, if say my native language were French or German, I'd feel frustrated that I couldn't write my identifiers in my native language. This makes me feel guilty, as an English speaker, that I don't have this problem.

@ChayimFriedman2
Copy link
Contributor

Well, I'm not a native English speaker, and I can tell you I've never felt this frustration. So you can feel better 😄

@PureFox48
Copy link
Contributor

Fair enough :)

@PureFox48
Copy link
Contributor

Although I like the idea of supporting ISO-8859-15 letters in principle, it would be slightly more awkward to do than ISO-8859-1 because the additional letters don't of course appear in a Unicode block and would therefore need to be checked for individually.

@PureFox48
Copy link
Contributor

Incidentally, even ISO-8859-15 would not be perfect as I gather that (from 2017) a capital ß (or ) has been officially introduced into German orthography.

@clsource
Copy link
Author

clsource commented Apr 24, 2021

If you did that, then it would be awkward to access methods beginning with a non-ASCII character from within the class itself. Static methods would always need to be prefixed by the class name and instance methods by this.

In that case, then is better to use an approach similar to Golang.

The first letter must be Ascii compatible

https://golang.org/ref/spec#Identifiers

https://stackoverflow.com/questions/33947871/does-go-golang-actually-permit-unicode-characters-in-variable-names

I know that may be a nuance to write an ASCII compatible letter first to allow Wren to detect scope properly (and avoid using tables for knowing which symbols are letters and which are numbers). But it's a small price to pay to allow chinese, japanese, korean and other non latin based symbols.

@PureFox48
Copy link
Contributor

PureFox48 commented Apr 24, 2021

Well, the identifier naming rules for Go are a generalization of the usual ASCII rules (letters, digits or underscores and not beginning with a digit) to the full Unicode character set.

Letters are defined as any characters in the Lu, Ll, Lt, Lm or Lo categories.

Digits are defined as any characters in the Nd category.

Now there are more than 130,000 characters classified as letters and 650 characters classified as digits. So you'd certainly need a large table to sort that out and keep it synchronized with successive Unicode versions.

The nearest thing Go has to Wren's class scope problem is their exported identifier rule which determines whether or not an identifier can be accessed from another package. To be exportable, an identifier must begin with a Unicode upper case letter (category Lu) of which there are 1,791 characters in Unicode 13.0.

This means that you can't create an exportable identifier in a language such as Chinese and the usual solution to this is to stick some upper case character such as X at the front of an otherwise Chinese identifier. The Go team know this isn't satisfactory but have been unable to come up with a better solution.

If I understand it correctly, your latest idea is to allow a Wren identifier to consist of any Unicode characters whatsoever except that the first character must belong to the Lu category for it to be regarded as upper case and must not be a digit (to avoid confusion with numbers). We'd still need a (smallish) table to sort that out but I suppose it could be made to work.

However, you'd still have the other problems - separable accents, Emojis - I referred to earlier and, as Unicode symbols would be allowed, we could end up with some very strange looking identifiers.

So, personally, I'm still not keen on the idea.

@clsource
Copy link
Author

If I understand it correctly, your latest idea is to allow a Wren identifier to consist of any Unicode characters whatsoever except that the first character must belong to the Lu category for it to be regarded as 'upper case'. We'd still need a (smallish) table to sort that out but I suppose it could be made to work.

More or less. But I don't know if a table would be really needed for this to work.

The algorithm I was thinking is

// Current C functions
static bool isAsciiName(uint8_t c)
{
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         c == '_';
}

static bool isUtf8Name(int ch)
{
  return isAsciiName(ch) ||
         (ch > 0x7f || ch <= 0x10ffff);   // Asume is a valid utf8 character
}
// Pseudocode

if (identifier_detected(name)) {

 var first_character = name[0]

 if (isAsciiName(first_character)) {

   // The first character is a valid letter from [a-z, A-Z, _]
   return isUtf8Name(name)

 } else {
   throw new Error("Not a valid identifier Error")
 }
}

However, you'd still have the other problems - separable accents, Emojis - I referred to earlier and, as Unicode symbols would be allowed, we could end up with some very strange looking identifiers.

With the proposed algorithm this is solved since it is forcing you to use a valid ascii letter either in lowercase or uppercase identifiers that contains UTF8 characters.

xÑandú = "Big Bird"
x🇨🇱 = "CL - Chile"

@PureFox48
Copy link
Contributor

PureFox48 commented Apr 24, 2021

Incidentally, it struck me after making my earlier post that you'd still need to disallow identifiers beginning with digits to avoid confusion with numbers (I've edited the post accordingly) though you might be able to restrict that to the digits 0-9 rather than the full Nd character set.

However, I'm confused now by what you're saying here:

With the proposed algorithm this is solved since it is forcing you to use a valid ascii letter either in lowercase or uppercase identifiers that contains UTF8 characters.

You seem to be saying here that a valid identifier would either need to begin with an ASCII (not Lu or Ll) letter or underscore. If so, that wouldn't be a good solution for those languages which can distinguish between upper and lower case letters.

@clsource
Copy link
Author

clsource commented Apr 24, 2021

You seem to be saying here that a valid identifier would either need to begin with an ASCII (not Lu or Lt) letter or underscore. If so, that wouldn't be a good solution for those languages which can distinguish between upper and lower case letters.

Yes in my current proposal the first character must be a tradicional Wren identifier. I know this can be a little odd to some languages to require at least the first character to be ASCII a-z, A-Z and _. But its the most complete way of supporting UTF8 without the need for tables and complex validations that I can currently think of.

To solve that issue maybe a little patch is to choose a character from the normal ascii table that could be used to delimit a UTF8 character identifier.

like $.

static bool isAsciiName(uint8_t c)
{
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         c == '_' || c == '$'; // added $ as a valid char
}
// using $ symbol to avoid making odd words.
var $ñandú = "Ñandú"

// In Classes you could use normal C or another letter of your choosing
class CÑandú {}

// Can also separate them with underscore

var u_ñandú = "Ñándú"

class C_Ñandú {}

@PureFox48
Copy link
Contributor

I'm not fond of the idea of allowing identifiers to begin with $.

If you don't want to use a table, then you could extend the first letter requirement from ASCII to ISO-8859-1(5) which would at least enable speakers of the major European languages to write the entire identifier in their native language and to easily determine whether it was upper case or not.

@clsource
Copy link
Author

To support those without a table would need to check which values in utf8 are from letters that have a uppercase and lowercase values and which are from letters that only have lowercase/upppercase or are emojis.

So for supporting more than ASCII in the first letter. It would be needed to know the value ranges were we can differentiate the letters allowed by the range the value is in.

if (value_is_in_valid_range(first_character)) {
   return isUtf8Name(identifier)
}

So in summary we need to determine the max byte value that would be allowed in the first character to determine if this is a uppercase or lowercase letter. If the character is an emoji or a letter that is not in the allowed range then it would reject it.

@PureFox48
Copy link
Contributor

True, though that would be relatively easy for ISO-8859-1, a subset of Unicode, because the upper and lower case characters have been conveniently placed in blocks.

@clsource
Copy link
Author

clsource commented Apr 24, 2021

Here is the complete list with bytes
https://www.fileformat.info/info/charset/UTF-8/list.htm
https://www.fileformat.info/info/charset/UTF-8/grid.htm

So we only need to check which bytes are allowed in the first char

@mhermier
Copy link
Contributor

It will not help here, but on my personal branch I removed the capital letter requirements using unary left . That capitalization rule always hits you so hard that it becoming unmanageable...

@clsource
Copy link
Author

I think another option for solving this without restricting the first character is just extending the capitalization requirement to consider ISO-8859-1(5)

That way you can use any utf8 in identifiers without first letter requirement. They will be considered all lowercase except the ones in ISO-8859-1.

@clsource
Copy link
Author

Where in the code is defined such rule?. Maybe I can thinker something :^)

@clsource
Copy link
Author

I think one of the ways is:

Just putting the current implementation as an experimental optional flag when compiling wren.

-DUTF8IDENTIFIERS or similar.

That way anyone can have the current utf8 support if needed, but is not officially supported until all the major use cases are properly tested and implemented.

@mhermier
Copy link
Contributor

mhermier commented Apr 29, 2021

One idea, why not add a callback function like to_upper, and to_lower to the configuration so users needing utf-8 support can plug their utf-8 library of choice or what ever user code ? With that at least, we can additionally deduce is_upper, is_lower, str_to_upper, str_to_lower? It would make a situation where things must be utf-8 binary encoded, but the code points not necessary need follow utf-8, but it would be recommended to.

@ChayimFriedman2
Copy link
Contributor

I think it's better to use a preprocessor flag for that. Functions like that adds unnecessary complexity, and I suspect they may also slow down compilation (of Wren code, not of the Wren core itself) significantly.

@mhermier
Copy link
Contributor

Well making a small utility function to test upper/lower case and call a callback if provided is not rocket science, I doubt it will bloat the compilation time that much so it is worrying... I would be more worried to the impact it could have in running code.

@ChayimFriedman2
Copy link
Contributor

Not because of that, because the call will not be inlined.

@mhermier
Copy link
Contributor

Tried the following mock implementation in compiler explorer:

#include <stdint.h>
#include <stddef.h>

typedef uint32_t (*CB)(uint32_t c);
void write(uint32_t);

inline uint32_t wrenToUpper(CB cb, uint32_t codePoint) {
  if (cb != NULL) return cb(codePoint);
  return codePoint >= 'a' && codePoint <= 'z' ? codePoint - 'a' + 'A' : codePoint;
}

void printToUpper(CB cb, const uint32_t *str) {
    while (*str != 0) {
        write(wrenToUpper(cb, *str));
        str++;
    }
}

Implementation is trivial enough to be inlined at -O1 so there is no fear at all about inlining for the default case, and if we call an external library it will be slow anyway.
If you are really afraid it is not a wanted functionality as a bonus, it put the invocation at limited places. But I would not recommend it since it it might affect the API with that solution.

@ChayimFriedman2
Copy link
Contributor

It'll not be inlined since it'll mean you'll have to inline the entire compiler. Your example is just not real enough.

@cxw42
Copy link
Contributor

cxw42 commented Apr 30, 2021

Three brief thoughts -

1.

#948 (comment)

and keep it synchronized with successive Unicode versions.

Just want to emphasize that actual Unicode is indeed an ongoing maintenance burden. I know that would not be the case for some of the non-table approaches discussed above.

2.

My only request would be that punctuation not be included in the set of valid identifier characters. I think it is good to keep syntax options open.

3.

Permitting any Unicode character >0x7f in an identifier lets you use whitespace in the middle of identifiers! Example (contains some profanity). This may or may not be desirable :) .

@mhermier
Copy link
Contributor

For that reason I think it is nice to delegate the responsibility to the user, and stick ASCII-7 has it will virtually never change. Putting users in control, let them use what ever variant/version of UTF-8 they want, and not tight wren to a particular version.

@jube
Copy link

jube commented Apr 30, 2021

I think it is very dangerous to extend valid identifiers without a full Unicode library.

At one point, you will have to tell if two identifiers are equal and that is the start of the problems. Because some "letters" can be "written" in many ways. For example "é" can be written as U+00E9, but it can also be writtent as U+0065 U+0301 (e and acute accent). If your identifier is written once with the first version and once with the second version, are both identifiers equal? This problem is linked to Unicode equivalence and is a very difficult problem. It could lead to security problems.

Another problem with the wrenToUpper that was proposed earlier is that some characters when transformed in uppercase give two characters. The usual example is the German eszett ("ß") that gives "SS" in uppercase. Once again, it's not that simple and it requires a good Unicode library.

@jube
Copy link

jube commented Apr 30, 2021

FYI, my native language is French and I don't care at all about Unicode identifiers, I am totally OK with ASCII and English in programs. In fact, I hate when there is a mix between two languages in a program (even in the comments), it feels very weird. English (or rather a small subset of simple technical English) is the lingua franca in programming, I am fine with this.

@joshgoebel
Copy link

joshgoebel commented Apr 30, 2021

I think the wren-cli should fixed to accept UTF8 to be typed.

I'm a bit late to this party but for anyone who wants to test it I implemented the CLI Repl working with code points instead of bytes:

wren-lang/wren-cli#71

@PureFox48 PureFox48 mentioned this pull request May 17, 2021
@clsource
Copy link
Author

I think I can close this PR. Clearly supporting UTF8 brings unexpected troubles that would require huge amount of discussions, and could overcomplicate the core.

Extending the valid identifier character list is a good idea though. But I will leave that implementation for a future PR.

This PR can be used as an example possible implementation.

Thanks for all who shared your thoughts 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants