-
Notifications
You must be signed in to change notification settings - Fork 589
Extended valid identifier character list #948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended valid identifier character list #948
Conversation
|
I'm not against that change at all, but prior to accept this change, I think the wren-cli should fixed to accept UTF8 to be typed. |
|
I'm in favor of a change of this nature as it would improve language coverage considerably without going the full UTF-8 route. I'd certainly include all the upper and lower case letters in IS0 8859-1 as these come in nice big blocks though I'm not so sure about allowing symbols as well as we don't even allow those (except underscore) in the ASCII range. |
There are some symbols like |
|
You know I have been thinking about this implementation and it maybe easier to have a trie data structure for this that returns true or false depending on the bytes of the character. https://www.techiedelight.com/trie-implementation-insert-search-delete/ |
|
This thing seems overkill.
I think we should have an API like:
double wrenStrToDouble(const char* start, const char**end, int *err, struct
config *cfg)
That way we can retrieve an error, return the end pointer after parsing
even if interpreting fails, and pass various configuration, like default
number base locale... if required. That way we should be able to plug
almost all conversion to that.
|
|
The problem is that at this stage in compilation the function is called char by char. if I want to have So if I not mistaken all the values in utf8 start from At int wrenUtf8EncodeNumBytes(int value)
{
ASSERT(value >= 0, "Cannot encode a negative value.");
if (value <= 0x7f) return 1;
if (value <= 0x7ff) return 2;
if (value <= 0xffff) return 3;
if (value <= 0x10ffff) return 4;
return 0;
} |
|
Responding to to much thread at once... I don't understand what you are
trying to solve? We have all the minimal functions to decide utf-8, you
only have to use them on the source...
|
|
Looking at the code, I think "currentChar" should become a value, and we should store the pointer to the current position to replace the actual "currentChar--". This will have some impact and will need to be bench-marked properly unfortunately... |
|
Ok I somehow achieved full UTF8 support. But it requires "duck" thinking. static bool isAsciiName(uint8_t c)
{
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_');
}
static bool isUtf8Name(uint8_t ch)
{
if (ch < 0 || ch > 0x10ffff) {
return false;
}
// Support the traditional ascii only identifiers
if (ch <= 0x7f && isAsciiName(ch)) {
return true;
}
// Asume is a valid utf8 character
// If the value is above 127 (ascii) and below maximum utf8 value and the bytes are lower than 4
return (ch > 0x7f && ch <= 0x10ffff);
} |
|
The type is invalid (uint8_t -> int). You can simplify the logic to: static bool isAsciiName(uint8_t c)
{
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
static bool isUtf8Name(int ch)
{
return isAsciiName(ch) ||
(ch > 0x7f || ch <= 0x10ffff); // Asume is a valid utf8 character
}Since most passed char are (hopefully) valid and probably ascii, lets expect the expected first. Pre-filtering will most probably only make sense when refining utf-8 characters. |
|
When I have some spare time, I like to read back through old issues as this helps me understand how Wren got to be where it is now (I've only been involved since 2020). I came across #68 earlier today and, when asked whether not supporting UTF-8 identifiers was by design, @munificent replied as follows:
Although that was back in 2015, I think those remarks are still pertinent to this PR today. I think it would be asking for trouble to try and support the whole of UTF-8. I'm not against extending the identifier range to include the letters in ISO-8859-1 because I think that would be generally useful and I can't see any particular problems there - it is always clear whether they're lower case or upper case and any accent is part of the symbol. But I think it would be unwise to go beyond that. |
|
I think that many of those languages have a heavy load of baggage that Wren simply do not have (they were born in the 90s and lots of years before UTF8 became a standard). Hence they have those maintainability problems. http://xahlee.info/comp/unicode_support_ruby_python_elisp.html
# -*- coding: utf-8 -*-
# ruby
♥ = "♥"
def λ n
n + "美"
end
p (λ ♥) == "♥美" # trueWhat I can see is maybe a way to define utf8 variable name support explicitly. Other languages use Since Wren now have attributes we can use them to define that UTF8 Variable names are allowed. Using that attribute would tell the compiler to allow utf8 identifiers and consider other security measures if needed. Since it would be optional it would not affect performance to previous non utf8 files. This would affect only the identifiers. Since all the other UTF8 components are already in standard Wren. I think the only missing part for full utf8 support in Wren is to allow utf8 identifiers :) |
|
Yes, but Wren has a problem which most other programming languages do not have - it needs to be able to distinguish between upper and lower case letters for class scope purposes. To figure that out for characters with codepoints > 255, you'd probably need to refer to a table. Also some writing systems (Chinese, Japanese, Korean, Arabic etc) have no distinction between upper and lower case characters. What are you going to do about those? Another problem is that accented symbols in Unicode can either be a single character or two characters (the accent being separate) though these may be indistinguishable in some fonts. So should these be treated as single letters even if they are represented by two Unicode codepoints? The use of emojis as identifiers can present even worse problems as the emoji may look like a single character though in reality it consists of several codepoints. It's these sort of problems which have led me to believe that supporting codepoints beyond 255 is more trouble than it's worth for Wren. It's possible that you might be able to support other alphabets which have upper/lower case distinction such as Greek and Cyrillic by the use of attributes though I think that's something for the future as attributes are just user-defined at present. |
I think one easy solution to this is just consider any non ascii as a lowercase by default. If you really need to use emojis or non ascii symbols for your classname then just start with a capital ASCII like
I don't know which problems would be for the usage of accented characters or emojis in identifiers. They worked well in the tests. Are there any other test that should be made to certify they work well? |
Yes, but that would mean that they would be inaccessible from code within a class which only sees identifiers beginning with a capital letter. Also, if you're a Chinese programmer, you might not be too happy about sticking say an 'X' on the front of your identifiers so they can accessed from within a class.
They will work for many characters because it won't matter if they are represented by single or multiple code-points. The problems start when you have what looks to the user like the same character but can be represented by different combinations in Unicode. A table might help with some of these but not with some more recent emojis which can consist of numerous code-points, some of them zero width. Another problem is that most Westerners find it difficult to read identifiers written in oriental languages because of the large number of ideographs used. In contrast, it is a relatively easy matter for say a Chinese person to read identifiers written in European languages because our alphabets are quite small. |
|
Let me ask something more basic: What's the use? Non-ASCII characters are hard to type in keyboards, and even if you do have the character in yours, in a world of code collaboration you can't be sure that other programmers working on your code will have it as well. But even if you don't have this problem, what will you gain from emoji identifiers? This is nice and all, but doesn't really help to clarify the meaning of the code. And about writing code in your language, well:
Lastly, I want to apologize if someone has gotten the feel that I attacked them - this is definitely not personal, I just want the best for Wren. |
|
Would you be in favor of extending the identifier character set to include ISO-8859-1 letters which is (more or less) what @clsource started with? This increases coverage considerably as it includes nearly all the characters used in the major Western European languages. It's true that some of them will be awkward to type for folks who use English keyboards but won't be for people who are using a keyboard appropriate for their (supported) native language. |
|
I am not European and I program only in English, but I think the question is simple: will you use those characters for identifiers in your work? And what about your scripts? If not, then it's not worthy. If yes, we can talk if the complexities of implementing it are greater than the benefits or not. |
|
Apart from an occasional But I'm sure that @clsource as a native Spanish speaker will use characters such as |
|
Then the question is, why not just use We may have no trouble reading them, but typing them is... not fun. And either the word is very similar to English (apart from the |
|
Well, sometimes you can 'read' words written in foreign languages even if you don't know what they mean :) My problem is that, if say my native language were French or German, I'd feel frustrated that I couldn't write my identifiers in my native language. This makes me feel guilty, as an English speaker, that I don't have this problem. |
|
Well, I'm not a native English speaker, and I can tell you I've never felt this frustration. So you can feel better 😄 |
|
Fair enough :) |
|
Although I like the idea of supporting ISO-8859-15 letters in principle, it would be slightly more awkward to do than ISO-8859-1 because the additional letters don't of course appear in a Unicode block and would therefore need to be checked for individually. |
|
Incidentally, even ISO-8859-15 would not be perfect as I gather that (from 2017) a capital |
In that case, then is better to use an approach similar to Golang. The first letter must be Ascii compatible https://golang.org/ref/spec#Identifiers I know that may be a nuance to write an ASCII compatible letter first to allow Wren to detect scope properly (and avoid using tables for knowing which symbols are letters and which are numbers). But it's a small price to pay to allow chinese, japanese, korean and other non latin based symbols. |
|
Well, the identifier naming rules for Go are a generalization of the usual ASCII rules (letters, digits or underscores and not beginning with a digit) to the full Unicode character set. Letters are defined as any characters in the Lu, Ll, Lt, Lm or Lo categories. Digits are defined as any characters in the Nd category. Now there are more than 130,000 characters classified as letters and 650 characters classified as digits. So you'd certainly need a large table to sort that out and keep it synchronized with successive Unicode versions. The nearest thing Go has to Wren's class scope problem is their exported identifier rule which determines whether or not an identifier can be accessed from another package. To be exportable, an identifier must begin with a Unicode upper case letter (category Lu) of which there are 1,791 characters in Unicode 13.0. This means that you can't create an exportable identifier in a language such as Chinese and the usual solution to this is to stick some upper case character such as X at the front of an otherwise Chinese identifier. The Go team know this isn't satisfactory but have been unable to come up with a better solution. If I understand it correctly, your latest idea is to allow a Wren identifier to consist of any Unicode characters whatsoever except that the first character must belong to the Lu category for it to be regarded as upper case and must not be a digit (to avoid confusion with numbers). We'd still need a (smallish) table to sort that out but I suppose it could be made to work. However, you'd still have the other problems - separable accents, Emojis - I referred to earlier and, as Unicode symbols would be allowed, we could end up with some very strange looking identifiers. So, personally, I'm still not keen on the idea. |
More or less. But I don't know if a table would be really needed for this to work. The algorithm I was thinking is // Current C functions
static bool isAsciiName(uint8_t c)
{
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
static bool isUtf8Name(int ch)
{
return isAsciiName(ch) ||
(ch > 0x7f || ch <= 0x10ffff); // Asume is a valid utf8 character
}// Pseudocode
if (identifier_detected(name)) {
var first_character = name[0]
if (isAsciiName(first_character)) {
// The first character is a valid letter from [a-z, A-Z, _]
return isUtf8Name(name)
} else {
throw new Error("Not a valid identifier Error")
}
}
With the proposed algorithm this is solved since it is forcing you to use a valid ascii letter either in lowercase or uppercase identifiers that contains UTF8 characters. xÑandú = "Big Bird"
x🇨🇱 = "CL - Chile" |
|
Incidentally, it struck me after making my earlier post that you'd still need to disallow identifiers beginning with digits to avoid confusion with numbers (I've edited the post accordingly) though you might be able to restrict that to the digits 0-9 rather than the full Nd character set. However, I'm confused now by what you're saying here:
You seem to be saying here that a valid identifier would either need to begin with an ASCII (not Lu or Ll) letter or underscore. If so, that wouldn't be a good solution for those languages which can distinguish between upper and lower case letters. |
Yes in my current proposal the first character must be a tradicional Wren identifier. I know this can be a little odd to some languages to require at least the first character to be ASCII To solve that issue maybe a little patch is to choose a character from the normal ascii table that could be used to delimit a UTF8 character identifier. like static bool isAsciiName(uint8_t c)
{
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_' || c == '$'; // added $ as a valid char
}// using $ symbol to avoid making odd words.
var $ñandú = "Ñandú"
// In Classes you could use normal C or another letter of your choosing
class CÑandú {}
// Can also separate them with underscore
var u_ñandú = "Ñándú"
class C_Ñandú {} |
|
I'm not fond of the idea of allowing identifiers to begin with If you don't want to use a table, then you could extend the first letter requirement from ASCII to ISO-8859-1(5) which would at least enable speakers of the major European languages to write the entire identifier in their native language and to easily determine whether it was upper case or not. |
|
To support those without a table would need to check which values in utf8 are from letters that have a uppercase and lowercase values and which are from letters that only have lowercase/upppercase or are emojis. So for supporting more than ASCII in the first letter. It would be needed to know the value ranges were we can differentiate the letters allowed by the range the value is in. if (value_is_in_valid_range(first_character)) {
return isUtf8Name(identifier)
}So in summary we need to determine the max byte value that would be allowed in the first character to determine if this is a uppercase or lowercase letter. If the character is an emoji or a letter that is not in the allowed range then it would reject it. |
|
True, though that would be relatively easy for ISO-8859-1, a subset of Unicode, because the upper and lower case characters have been conveniently placed in blocks. |
|
Here is the complete list with bytes So we only need to check which bytes are allowed in the first char |
|
It will not help here, but on my personal branch I removed the capital letter requirements using unary left |
|
I think another option for solving this without restricting the first character is just extending the capitalization requirement to consider ISO-8859-1(5) That way you can use any utf8 in identifiers |
|
Where in the code is defined such rule?. Maybe I can thinker something :^) |
|
I think one of the ways is: Just putting the current implementation as an experimental optional flag when compiling wren.
That way anyone can have the current utf8 support if needed, but is not officially supported until all the major use cases are properly tested and implemented. |
|
One idea, why not add a callback function like |
|
I think it's better to use a preprocessor flag for that. Functions like that adds unnecessary complexity, and I suspect they may also slow down compilation (of Wren code, not of the Wren core itself) significantly. |
|
Well making a small utility function to test upper/lower case and call a callback if provided is not rocket science, I doubt it will bloat the compilation time that much so it is worrying... I would be more worried to the impact it could have in running code. |
|
Not because of that, because the call will not be inlined. |
|
Tried the following mock implementation in compiler explorer: #include <stdint.h>
#include <stddef.h>
typedef uint32_t (*CB)(uint32_t c);
void write(uint32_t);
inline uint32_t wrenToUpper(CB cb, uint32_t codePoint) {
if (cb != NULL) return cb(codePoint);
return codePoint >= 'a' && codePoint <= 'z' ? codePoint - 'a' + 'A' : codePoint;
}
void printToUpper(CB cb, const uint32_t *str) {
while (*str != 0) {
write(wrenToUpper(cb, *str));
str++;
}
}Implementation is trivial enough to be inlined at -O1 so there is no fear at all about inlining for the default case, and if we call an external library it will be slow anyway. |
|
It'll not be inlined since it'll mean you'll have to inline the entire compiler. Your example is just not real enough. |
|
Three brief thoughts - 1.
Just want to emphasize that actual Unicode is indeed an ongoing maintenance burden. I know that would not be the case for some of the non-table approaches discussed above. 2. My only request would be that punctuation not be included in the set of valid identifier characters. I think it is good to keep syntax options open. 3. Permitting any Unicode character >0x7f in an identifier lets you use whitespace in the middle of identifiers! Example (contains some profanity). This may or may not be desirable :) . |
|
For that reason I think it is nice to delegate the responsibility to the user, and stick ASCII-7 has it will virtually never change. Putting users in control, let them use what ever variant/version of UTF-8 they want, and not tight wren to a particular version. |
|
I think it is very dangerous to extend valid identifiers without a full Unicode library. At one point, you will have to tell if two identifiers are equal and that is the start of the problems. Because some "letters" can be "written" in many ways. For example "é" can be written as U+00E9, but it can also be writtent as U+0065 U+0301 (e and acute accent). If your identifier is written once with the first version and once with the second version, are both identifiers equal? This problem is linked to Unicode equivalence and is a very difficult problem. It could lead to security problems. Another problem with the |
|
FYI, my native language is French and I don't care at all about Unicode identifiers, I am totally OK with ASCII and English in programs. In fact, I hate when there is a mix between two languages in a program (even in the comments), it feels very weird. English (or rather a small subset of simple technical English) is the lingua franca in programming, I am fine with this. |
I'm a bit late to this party but for anyone who wants to test it I implemented the CLI Repl working with code points instead of bytes: |
|
I think I can close this PR. Clearly supporting UTF8 brings unexpected troubles that would require huge amount of discussions, and could overcomplicate the core. Extending the valid identifier character list is a good idea though. But I will leave that implementation for a future PR. This PR can be used as an example possible implementation. Thanks for all who shared your thoughts 👍 |
Regarding #891
This implementation is not UTF-8 complete (no emojis).
But at least it gives a room for most roman/latin languages to use variable names
with non ascii chars + some symbols.