Implemented IE-0005

This commit is contained in:
Graham Nelson 2022-10-29 12:11:58 +01:00
parent cb252deab4
commit 13894d3816
30 changed files with 35312 additions and 16080 deletions

View File

@ -1,6 +1,6 @@
# Inform 7
[Version](notes/versioning.md): 10.2.0-beta+6V87 'Krypton' (28 October 2022)
[Version](notes/versioning.md): 10.2.0-beta+6V88 'Krypton' (29 October 2022)
## About Inform
@ -147,8 +147,6 @@ Other extensions shipped with Inform are not presented as webs, but as single fi
* [English Language by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/English Language.i7x>) - __v1__
* [Metric Units by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Metric Units.i7x>) - __v2__
* [Rideable Vehicles by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Rideable Vehicles.i7x>) - __v3__
* [Unicode Character Names by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Unicode Character Names.i7x>) - __v1__
* [Unicode Full Character Names by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Unicode Full Character Names.i7x>) - __v1__
### Website templates and interpreters shipped with Inform

View File

@ -1,3 +1,3 @@
Prerelease: beta
Build Date: 28 October 2022
Build Number: 6V87
Build Date: 29 October 2022
Build Number: 6V88

View File

@ -2,7 +2,7 @@
"is": {
"type": "kit",
"title": "BasicInformExtrasKit",
"version": "10.2.0-beta+6V87"
"version": "10.2.0-beta+6V88"
},
"kit-details": {
"has-priority": 1

View File

@ -2,7 +2,7 @@
"is": {
"type": "kit",
"title": "BasicInformKit",
"version": "10.2.0-beta+6V87"
"version": "10.2.0-beta+6V88"
},
"needs": [ {
"unless": {

View File

@ -2,7 +2,7 @@
"is": {
"type": "kit",
"title": "CommandParserKit",
"version": "10.2.0-beta+6V87"
"version": "10.2.0-beta+6V88"
},
"needs": [ {
"need": {

View File

@ -2,7 +2,7 @@
"is": {
"type": "kit",
"title": "EnglishLanguageKit",
"version": "10.2.0-beta+6V87"
"version": "10.2.0-beta+6V88"
},
"needs": [ {
"need": {

View File

@ -2,7 +2,7 @@
"is": {
"type": "kit",
"title": "WorldModelKit",
"version": "10.2.0-beta+6V87"
"version": "10.2.0-beta+6V88"
},
"needs": [ {
"need": {

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,3 @@
Include Unicode Character Names by Graham Nelson.
[The problem here arises from the Greek letter lambda actually being
known to Unicode as lamda.]

View File

@ -1,3 +0,0 @@
At sigil translates into Unicode as 64.
At sigil translates into Unicode as 64.
At sigil translates into Unicode as 67.

View File

@ -0,0 +1 @@
At sigil translates into Unicode as 64.

View File

@ -1 +0,0 @@
Bell sound translates into Unicode as "7".

View File

@ -1 +1 @@
Roman Dimidia Sextula Sign translates into Unicode as 65940.
Laboratory is a room. "A sign shows: [unicode Roman Dimidia Sextula Sign]."

View File

@ -1,20 +1,18 @@
Inform 7 build 6L26 has started.
I've now read your source text, which is 33 words long.
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
I've also read English Language by Graham Nelson, which is 2288 words long.
I've also read Unicode Character Names by Graham Nelson, which is 28382 words long.
Inform 7 v10.2.0 has started.
I've now read your source text, which is 26 words long.
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
I've also read English Language by Graham Nelson, which is 2330 words long.
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
Problem__ PM_MidTextUnicode
>--> In the sentence 'admire unicode Greek capital letter lambda' (source
text, line 11), I was expecting to read a unicode character, but instead
text, line 9), I was expecting to read a unicode character, but instead
found some text that I couldn't understand - 'unicode Greek capital letter
lambda'. Maybe you intended this to produce a Unicode character? Unicode
characters can be written either using their decimal numbers - for
instance, 'Unicode 2041' - or with their standard names - 'Unicode Latin
small ligature oe'. For efficiency reasons these names are only available
if you ask for them; to make them available, you need to 'Include Unicode
Character Names by Graham Nelson' or, if you really need more, 'Include
Unicode Full Character Names by Graham Nelson'.
small ligature oe'. For the full list of those names, see the Unicode
standard version 15.0.0.
I was trying to match this phrase:
admire (unicode greek capital letter lambda - unicode character)
But I didn't recognise 'unicode greek capital letter lambda'.
Inform 7 has finished: 68 centiseconds used.
Inform 7 has finished.

View File

@ -1,9 +0,0 @@
Inform 7 build 6L26 has started.
I've now read your source text, which is 21 words long.
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
I've also read English Language by Graham Nelson, which is 2288 words long.
Problem__ PM_UnicodeAlready
>--> You wrote 'At sigil translates into Unicode as 67' (source text, line 3):
but this Unicode character name has already been translated, so there must
be some duplication somewhere.
Inform 7 has finished: 17 centiseconds used.

View File

@ -0,0 +1,13 @@
Inform 7 v10.2.0 has started.
I've now read your source text, which is 7 words long.
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
I've also read English Language by Graham Nelson, which is 2330 words long.
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
Problem__ PM_UnicodeDeprecated
>--> You wrote 'At sigil translates into Unicode as 64' (source text, line 1):
but the sentence 'X translates into Unicode as Y' has been removed from the
Inform language, because it is now redundant. Inform already knows all the
names in the Unicode standard. If you're getting this problem message
because you included the extension 'Unicode Full Character Names' or
'Unicode Character Names', all you need do is to not include it.
Inform 7 has finished.

View File

@ -1,9 +0,0 @@
Inform 7 build 6L26 has started.
I've now read your source text, which is 7 words long.
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
I've also read English Language by Graham Nelson, which is 2288 words long.
Problem__ PM_UnicodeNonLiteral
>--> You wrote 'Bell sound translates into Unicode as "7"' (source text, line 1):
but a Unicode character name must be translated into a literal decimal
number written out in digits, which this seems not to be.
Inform 7 has finished: 17 centiseconds used.

View File

@ -1,9 +1,11 @@
Inform 7 build 6L26 has started.
I've now read your source text, which is 9 words long.
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
I've also read English Language by Graham Nelson, which is 2288 words long.
Inform 7 v10.2.0 has started.
I've now read your source text, which is 12 words long.
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
I've also read English Language by Graham Nelson, which is 2330 words long.
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
Problem__ PM_UnicodeOutOfRange
>--> You wrote 'Roman Dimidia Sextula Sign translates into Unicode as 65940' (source
text, line 1): but Inform can only handle Unicode characters in the 16-bit
range, from 0 to 65535.
Inform 7 has finished: 17 centiseconds used.
>--> You wrote '"A sign shows: [unicode Roman Dimidia Sextula Sign]."' (source
text, line 1): but this character value is beyond the range which the
current story could handle, which is from 0 to (hexadecimal) FFFF for
stories compiled to the Z-machine, and otherwise 0 to 1FFFF.
Inform 7 has finished.

View File

@ -88,14 +88,12 @@ sense once kinds and instances exist.
}
@h Translation into Unicode.
The following handles sentences like:
The following sentence form is now deprecated:
>> leftwards harpoon with barb upwards translates into Unicode as 8636.
The subject "leftwards harpoon with barb upwards" is parsed against the
Unicode character names known already to make sure that this new translation
doesn't disagree with an existing one (that is, doesn't translate to a
different code number).
Until Inform 10.1, this equated a Unicode name to its code point value; see
IE-0005 and //values: Unicode Literals// for what now happens instead.
The sentence "X translates into Y as Z" has this sense provided Y matches:
@ -104,6 +102,7 @@ The sentence "X translates into Y as Z" has this sense provided Y matches:
unicode
@ =
int PM_UnicodeDeprecated_thrown = FALSE;
int Translations::translates_into_unicode_as_SMF(int task, parse_node *V, wording *NPs) {
wording SW = (NPs)?(NPs[0]):EMPTY_WORDING;
wording OW = (NPs)?(NPs[1]):EMPTY_WORDING;
@ -119,55 +118,22 @@ int Translations::translates_into_unicode_as_SMF(int task, parse_node *V, wordin
}
break;
case PASS_2_SMFT:
@<Create the Unicode character name@>;
if (PM_UnicodeDeprecated_thrown == FALSE) {
PM_UnicodeDeprecated_thrown = TRUE;
StandardProblems::sentence_problem(Task::syntax_tree(),
_p_(PM_UnicodeDeprecated),
"the sentence 'X translates into Unicode as Y' has been removed "
"from the Inform language",
"because it is now redundant. Inform already knows all the names "
"in the Unicode standard. If you're getting this problem message "
"because you included the extension 'Unicode Full Character Names' "
"or 'Unicode Character Names', all you need do is to not include it.");
}
break;
}
return FALSE;
}
@ And this parses the noun phrases of such sentences. Note that the numeric
values has to be given in decimal -- I was tempted to allow hexadecimal here,
but life's too short. Unicode translation sentences are really only
technicalities needed by the built-in extensions, and those are mechanically
generated anyway; Inform authors never type them.
=
<translates-into-unicode-sentence-subject> ::=
( ... ) |
...
<translates-into-unicode-sentence-object> ::=
<cardinal-number-unlimited> | ==> { UnicodeLiterals::max(R[1]), - }
... ==> @<Issue PM_UnicodeNonLiteral problem@>
@<Issue PM_UnicodeNonLiteral problem@> =
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeNonLiteral),
"a Unicode character name must be translated into a literal decimal "
"number written out in digits",
"which this seems not to be.");
return FALSE;
@ And here the name is created as a miscellaneous excerpt meaning.
@<Create the Unicode character name@> =
wording SP = Node::get_text(V->next);
wording OP = Node::get_text(V->next->next);
if (<translates-into-unicode-sentence-object>(OP) == FALSE) return FALSE;
int cc = <<r>>;
<translates-into-unicode-sentence-subject>(SP);
wording CN = GET_RW(<translates-into-unicode-sentence-subject>, 1);
if ((<unicode-character-name>(CN)) && (<<r>> != cc)) {
StandardProblems::sentence_problem(Task::syntax_tree(),
_p_(PM_UnicodeAlready),
"this Unicode character name has already been translated",
"so there must be some duplication somewhere.");
return FALSE;
}
Nouns::new_proper_noun(CN, NEUTER_GENDER, ADD_TO_LEXICON_NTOPT, MISCELLANEOUS_MC,
Diagrams::new_PROPER_NOUN(OP), Task::language_of_syntax());
@h Translation into Inter.
There are three sentences here, but the first is now deprecated: it has split
off into two different meanings, each with its own wording for clarity.

View File

@ -16,6 +16,7 @@ which use this module:
@e TEXT_SUBSTITUTIONS_DA
@e VARIABLE_CREATIONS_DA
@e TABLES_DA
@e UNICODE_DATA_MREASON
=
COMPILE_WRITER(instance *, Instances::log)
@ -31,6 +32,7 @@ void ValuesModule::start(void) {
Log::declare_aspect(TEXT_SUBSTITUTIONS_DA, L"text substitutions", FALSE, FALSE);
Log::declare_aspect(VARIABLE_CREATIONS_DA, L"variable creations", FALSE, FALSE);
Log::declare_aspect(TABLES_DA, L"table construction", FALSE, FALSE);
Memory::reason_name(UNICODE_DATA_MREASON, "Unicode data");
REGISTER_WRITER('O', Instances::log);
REGISTER_WRITER('q', Equations::log);
REGISTER_WRITER('Z', NonlocalVariables::log);

View File

@ -2,7 +2,8 @@
To manage the names assigned to Unicode character values.
@ The following is called only on excerpts from the source where it is a
@h Parsing.
The following is called only on excerpts from the source where it is a
fairly safe bet that a Unicode character is referred to. For example, when
the player types either of these:
@ -17,11 +18,18 @@ the player types either of these:
<unicode-character-name> ==> { -, Rvalues::from_Unicode(R[1], W) }
<unicode-character-name> internal {
parse_node *p = Lexicon::retrieve(MISCELLANEOUS_MC, W);
if ((p) && (Node::get_type(p) == PROPER_NOUN_NT)) {
int N = Vocabulary::get_literal_number_value(
Lexer::word(Wordings::first_wn(Node::get_text(p))));
==> { N, - };
TEMPORARY_TEXT(N)
WRITE_TO(N, "%W", W);
for (int i=0; i<Str::len(N); i++)
Str::put_at(N, i, Characters::toupper(Str::get_at(N, i)));
int U = UnicodeLiterals::parse(N);
DISCARD_TEXT(N)
if (U >= 0) {
if ((TargetVMs::is_16_bit(Task::vm())) && (U >= 0x10000)) {
@<Issue PM_UnicodeOutOfRange@>;
U = 65;
}
==> { UnicodeLiterals::max(U), - };
return TRUE;
}
==> { fail nonterminal };
@ -31,11 +39,296 @@ the player types either of these:
=
int UnicodeLiterals::max(int cc) {
if ((cc < 0) || (cc >= 0x10000)) {
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeOutOfRange),
"Inform can only handle Unicode characters in the 16-bit range",
"from 0 to 65535.");
if ((cc < 0) || (cc >= MAX_UNICODE_CODE_POINT)) {
@<Issue PM_UnicodeOutOfRange@>;
return 65;
}
return cc;
}
@<Issue PM_UnicodeOutOfRange@> =
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeOutOfRange),
"this character value is beyond the range which the current story "
"could handle",
"which is from 0 to (hexadecimal) FFFF for stories compiled to the "
"Z-machine, and otherwise 0 to 1FFFF.");
@h Code points.
Each distinct code point in the Unicode specification will correspond to one
of these:
@d MAX_UNICODE_CODE_POINT 0x20000
@e Cc_UNICODE_CAT from 1 /* Other, Control */
@e Cf_UNICODE_CAT /* Other, Format */
@e Cn_UNICODE_CAT /* Other, Not Assigned: no character actually has this */
@e Co_UNICODE_CAT /* Other, Private Use */
@e Cs_UNICODE_CAT /* Other, Surrogate */
@e Ll_UNICODE_CAT /* Letter, Lowercase */
@e Lm_UNICODE_CAT /* Letter, Modifier */
@e Lo_UNICODE_CAT /* Letter, Other */
@e Lt_UNICODE_CAT /* Letter, Titlecase */
@e Lu_UNICODE_CAT /* Letter, Uppercase */
@e Mc_UNICODE_CAT /* Mark, Spacing Combining */
@e Me_UNICODE_CAT /* Mark, Enclosing */
@e Mn_UNICODE_CAT /* Mark, Non-Spacing */
@e Nd_UNICODE_CAT /* Number, Decimal Digit */
@e Nl_UNICODE_CAT /* Number, Letter */
@e No_UNICODE_CAT /* Number, Other */
@e Pc_UNICODE_CAT /* Punctuation, Connector */
@e Pd_UNICODE_CAT /* Punctuation, Dash */
@e Pe_UNICODE_CAT /* Punctuation, Close */
@e Pf_UNICODE_CAT /* Punctuation, Final quote */
@e Pi_UNICODE_CAT /* Punctuation, Initial quote */
@e Po_UNICODE_CAT /* Punctuation, Other */
@e Ps_UNICODE_CAT /* Punctuation, Open */
@e Sc_UNICODE_CAT /* Symbol, Currency */
@e Sk_UNICODE_CAT /* Symbol, Modifier */
@e Sm_UNICODE_CAT /* Symbol, Math */
@e So_UNICODE_CAT /* Symbol, Other */
@e Zl_UNICODE_CAT /* Separator, Line */
@e Zp_UNICODE_CAT /* Separator, Paragraph */
@e Zs_UNICODE_CAT /* Separator, Space */
=
typedef struct unicode_point {
int code_point; /* in the range 0 to MAX_UNICODE_CODE_POINT - 1 */
struct text_stream *name; /* e.g. "RIGHT-FACING ARMENIAN ETERNITY SIGN" */
int category; /* one of the |*_UNICODE_CAT| values above */
int tolower; /* -1 if no mapping to lower case is available, or a code point */
int toupper; /* -1 if no mapping to upper case is available, or a code point */
int totitle; /* -1 if no mapping to title case is available, or a code point */
} unicode_point;
unicode_point UnicodeLiterals::new_code_point(int C) {
unicode_point up;
up.code_point = C;
up.name = NULL;
up.category = Cn_UNICODE_CAT;
up.tolower = -1;
up.toupper = -1;
up.totitle = -1;
return up;
}
@ Storage for these is managed on demand, in a flexibly-sized array:
=
unicode_point *unicode_points = NULL; /* array indexed by code point */
int unicode_points_extent = 0; /* current number of entries in that array */
int max_known_unicode_point = 0;
unicode_point *UnicodeLiterals::code_point(int U) {
if ((U < 0) || (U >= MAX_UNICODE_CODE_POINT)) internal_error("Unicode point out of range");
UnicodeLiterals::ensure_data();
if (U >= unicode_points_extent) {
int new_extent = unicode_points_extent;
if (new_extent == 0) new_extent = 1;
while (new_extent <= U) new_extent = 2*new_extent;
unicode_point *new_unicode_points = (unicode_point *)
(Memory::calloc(new_extent, sizeof(unicode_point), UNICODE_DATA_MREASON));
for (int i=0; i<unicode_points_extent; i++)
new_unicode_points[i] = unicode_points[i];
for (int i=unicode_points_extent; i<new_extent; i++)
new_unicode_points[i] = UnicodeLiterals::new_code_point(i);
if (unicode_points_extent > 0)
Memory::I7_array_free(unicode_points,
UNICODE_DATA_MREASON, unicode_points_extent, sizeof(unicode_point));
unicode_points = new_unicode_points;
unicode_points_extent = new_extent;
}
if (U > max_known_unicode_point) max_known_unicode_point = U;
return &(unicode_points[U]);
}
@ The standard Inform distribution includes the current Unicode specification's
main data file. Although parsing that file is relatively fast, we do it only
on demand, because it's not small (about 2 MB of text) and is often not needed.
=
dictionary *UnicodeData_lookup = NULL;
void UnicodeLiterals::ensure_data(void) {
if (UnicodeData_lookup == NULL) {
UnicodeData_lookup = Dictionaries::new(65536, FALSE);
filename *F = InstalledFiles::filename(UNICODE_DATA_IRES);
TextFiles::read(F, FALSE, "can't open UnicodeData file", TRUE,
&UnicodeLiterals::read_line, NULL, NULL);
LOG("Read Unicode data to code point 0x%06x in %f\n", max_known_unicode_point, F);
}
}
@ The format of this file is admirably stable. Lines look like so:
= (text)
0067;LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
1C85;CYRILLIC SMALL LETTER THREE-LEGGED TE;Ll;0;L;;;;;N;;;0422;;0422
1FAA1;SEWING NEEDLE;So;0;ON;;;;;N;;;;;
=
Each line corresponds to a code point. They're presented in the file in ascending
order of these values, but we make no use of that fact. Each line contains fields
divided by semicolons, and semicolon characters are illegal in any field.
@d CODE_VALUE_UNICODE_DATA_FIELD 0
@d NAME_UNICODE_DATA_FIELD 1
@d GENERAL_CATEGORY_UNICODE_DATA_FIELD 2
@d COMBINING_CLASSES_UNICODE_DATA_FIELD 3
@d BIDIRECTIONAL_CATEGORY_UNICODE_DATA_FIELD 4
@d DECOMPOSITION_MAPPING_UNICODE_DATA_FIELD 5
@d DECIMAL_DIGIT_VALUE_UNICODE_DATA_FIELD 6
@d DIGIT_VALUE_UNICODE_DATA_FIELD 7
@d NUMERIC_VALUE_UNICODE_DATA_FIELD 8
@d MIRRORED_UNICODE_DATA_FIELD 9
@d OLD_NAME_UNICODE_DATA_FIELD 10
@d ISO_10646_COMMENT_UNICODE_DATA_FIELD 11
@d UC_MAPPING_UNICODE_DATA_FIELD 12
@d LC_MAPPING_UNICODE_DATA_FIELD 13
@d TC_MAPPING_UNICODE_DATA_FIELD 14
=
void UnicodeLiterals::read_line(text_stream *text, text_file_position *tfp, void *vm) {
Str::trim_white_space(text);
wchar_t c = Str::get_first_char(text);
if (c == 0) return;
text_stream *name = Str::new();
TEMPORARY_TEXT(category)
int U[16], field_number = 0;
for (int f=0; f<16; f++) U[f] = 0;
@<Parse the fields@>;
if ((field_number > 1) && (U[CODE_VALUE_UNICODE_DATA_FIELD] < MAX_UNICODE_CODE_POINT)) {
int c = Cn_UNICODE_CAT;
@<Determine the category code@>;
unicode_point *up = UnicodeLiterals::code_point(U[CODE_VALUE_UNICODE_DATA_FIELD]);
@<Initialise the unicode point structure@>;
@<Add to the dictionary of character names@>;
}
DISCARD_TEXT(category)
}
@<Parse the fields@> =
for (int i=0; i<Str::len(text); i++) {
wchar_t c = Str::get_at(text, i);
if (c == ';') field_number++;
else switch (field_number) {
case CODE_VALUE_UNICODE_DATA_FIELD:
case UC_MAPPING_UNICODE_DATA_FIELD:
case LC_MAPPING_UNICODE_DATA_FIELD:
case TC_MAPPING_UNICODE_DATA_FIELD: {
int H = -1;
if ((c >= '0') && (c <= '9')) H = (int) (c - '0');
if ((c >= 'A') && (c <= 'F')) H = (int) (c - 'A' + 10);
if (H >= 0) U[field_number] = U[field_number]*16 + H;
break;
}
case NAME_UNICODE_DATA_FIELD:
PUT_TO(name, c);
break;
case GENERAL_CATEGORY_UNICODE_DATA_FIELD:
PUT_TO(category, c);
break;
}
}
@<Determine the category code@> =
if (Str::eq(category, I"Cc")) c = Cc_UNICODE_CAT;
else if (Str::eq(category, I"Cf")) c = Cf_UNICODE_CAT;
else if (Str::eq(category, I"Cn")) c = Cn_UNICODE_CAT;
else if (Str::eq(category, I"Co")) c = Co_UNICODE_CAT;
else if (Str::eq(category, I"Cs")) c = Cs_UNICODE_CAT;
else if (Str::eq(category, I"Ll")) c = Ll_UNICODE_CAT;
else if (Str::eq(category, I"Lm")) c = Lm_UNICODE_CAT;
else if (Str::eq(category, I"Lo")) c = Lo_UNICODE_CAT;
else if (Str::eq(category, I"Lt")) c = Lt_UNICODE_CAT;
else if (Str::eq(category, I"Lu")) c = Lu_UNICODE_CAT;
else if (Str::eq(category, I"Mc")) c = Mc_UNICODE_CAT;
else if (Str::eq(category, I"Me")) c = Me_UNICODE_CAT;
else if (Str::eq(category, I"Mn")) c = Mn_UNICODE_CAT;
else if (Str::eq(category, I"Nd")) c = Nd_UNICODE_CAT;
else if (Str::eq(category, I"Nl")) c = Nl_UNICODE_CAT;
else if (Str::eq(category, I"No")) c = No_UNICODE_CAT;
else if (Str::eq(category, I"Pc")) c = Pc_UNICODE_CAT;
else if (Str::eq(category, I"Pd")) c = Pd_UNICODE_CAT;
else if (Str::eq(category, I"Pe")) c = Pe_UNICODE_CAT;
else if (Str::eq(category, I"Pf")) c = Pf_UNICODE_CAT;
else if (Str::eq(category, I"Pi")) c = Pi_UNICODE_CAT;
else if (Str::eq(category, I"Po")) c = Po_UNICODE_CAT;
else if (Str::eq(category, I"Ps")) c = Ps_UNICODE_CAT;
else if (Str::eq(category, I"Sc")) c = Sc_UNICODE_CAT;
else if (Str::eq(category, I"Sk")) c = Sk_UNICODE_CAT;
else if (Str::eq(category, I"Sm")) c = Sm_UNICODE_CAT;
else if (Str::eq(category, I"So")) c = So_UNICODE_CAT;
else if (Str::eq(category, I"Zl")) c = Zl_UNICODE_CAT;
else if (Str::eq(category, I"Zp")) c = Zp_UNICODE_CAT;
else if (Str::eq(category, I"Zs")) c = Zs_UNICODE_CAT;
else LOG("Unknown category '%S'\n", category);
@<Initialise the unicode point structure@> =
up->name = name;
up->category = c;
up->tolower = U[LC_MAPPING_UNICODE_DATA_FIELD];
up->toupper = U[UC_MAPPING_UNICODE_DATA_FIELD];
up->totitle = U[TC_MAPPING_UNICODE_DATA_FIELD];
@ Control codes in Unicode, a residue of ASCII, are given no names by the
standard. For example:
= (text)
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
=
Indeed, at present every code with category |Cc| has the pseudo-name |<control>|.
So we will mostly not allow these to be referred to by name in Inform. (In theory we
could read the ISO-10646 comment as if it were a name: here, that would be
"END OF TRANSMISSION", which isn't too bad. But "FORM FEED (FF)" and
"CHARACTER TABULATION" are less persuasive, and anyway, we don't actually want
users to insert control characters into Inform text literals.)
@<Add to the dictionary of character names@> =
text_stream *index = NULL;
if (c == Cc_UNICODE_CAT) {
if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 9) index = I"TAB";
if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 10) index = I"NEWLINE";
} else {
index = name;
}
if (index) {
Dictionaries::create(UnicodeData_lookup, name);
Dictionaries::write_value(UnicodeData_lookup, name, (void *) up);
}
@h Using the Unicode data.
The first lookup here is slow, since it requires us to parse the Unicode
specification data file. But after that everything runs quite swiftly.
=
int UnicodeLiterals::parse(text_stream *N) {
UnicodeLiterals::ensure_data();
if (Dictionaries::find(UnicodeData_lookup, N)) {
unicode_point *up = Dictionaries::read_value(UnicodeData_lookup, N);
return up->code_point;
}
return -1;
}
@ We won't go too far down the Unicode rabbit-hole, but here are functions which
may some day be useful:
=
int UnicodeLiterals::tolower(int C) {
unicode_point *up = UnicodeLiterals::code_point(C);
int D = up->tolower;
if (D >= 0) return D;
return C;
}
int UnicodeLiterals::toupper(int C) {
unicode_point *up = UnicodeLiterals::code_point(C);
int D = up->toupper;
if (D >= 0) return D;
return C;
}
int UnicodeLiterals::totitle(int C) {
unicode_point *up = UnicodeLiterals::code_point(C);
int D = up->totitle;
if (D >= 0) return D;
return C;
}
int UnicodeLiterals::category(int C) {
unicode_point *up = UnicodeLiterals::code_point(C);
return up->category;
}

View File

@ -2817,11 +2817,8 @@ common misunderstanding.
"%PMaybe you intended this to produce a Unicode character? "
"Unicode characters can be written either using their decimal "
"numbers - for instance, 'Unicode 2041' - or with their standard "
"names - 'Unicode Latin small ligature oe'. For efficiency reasons "
"these names are only available if you ask for them; to make them "
"available, you need to 'Include Unicode Character Names by Graham "
"Nelson' or, if you really need more, 'Include Unicode Full "
"Character Names by Graham Nelson'.");
"names - 'Unicode Latin small ligature oe'. For the full list of "
"those names, see the Unicode standard version 15.0.0.");
Problems::issue_problem_end();
@<Issue PM_UnknownCondition problem@> =

View File

@ -75,8 +75,6 @@ And just for fun, this time we'll make the grid prettier, too; but this will wor
say "[state of room 2A][state of room 2B][state of room 2C][state of room 2D][state of room 2E] downstairs[line break]";
say "[unicode box drawings light up and right][bottom bar][unicode box drawings light up and left][variable letter spacing][line break]"
Include Unicode Character Names by Graham Nelson.
To say top bar:
repeat with N running from 1 to 9:
if the remainder after dividing N by 2 is 0, say "[unicode box drawings light down and horizontal]";

View File

@ -12,9 +12,7 @@ The trick here is that colored output is done in different ways by the Z-Machine
For the suit symbols, we'll want the Unicode extension included with Inform:
{**}Include Unicode Character Names by Graham Nelson.
Rule for printing the name of a card (called target) while grouping together:
{**}Rule for printing the name of a card (called target) while grouping together:
say "[rank of the target as abbreviated value][suit of the target as symbol]".
To say (current suit - a suit) as symbol:

View File

@ -8,8 +8,6 @@ The following example puts Inform's support for exotic lettering through its pac
The story headline is "Pushing the Limits of Unicode in IF". The story description is "This is a demanding test for Unicode compliance by Z-machine interpreters."
Include Unicode Character Names by Graham Nelson.
Include Basic Screen Effects by Emily Short.
The Château Bibliothèque Français is east of the Deutsche Universität Bücherei. "From this Borgesian construction, doorways lead into anterooms in each of the four cardinal directions." South of the Bibliothèque is the Miscellany Mañana. North of the Bibliothèque is the Íslendingabók. East of the Bibliothèque is Alphabet Soup.

View File

@ -2868,10 +2868,8 @@ Moreover, the player is not allowed to type these characters in commands during
(d) <b>Characters which might work in quoted text, or might not</b>. The Arabic and Hebrew alphabets are fairly likely to be available; miscellaneous symbols are sometimes legible to the player, sometimes not. Other alphabets are chancier still. (If a work of IF depends on these being visible, it may be necessary to instruct players to use specific interpreters, or to provide a way for the player to test that all will be well.)
[x] Unicode characters {PM_SayUnicode} {PM_MidTextUnicode}
[x] Unicode characters {PM_SayUnicode} {PM_MidTextUnicode} {PM_UnicodeOutOfRange}
^^{Unicode Character Names / Full Character Names+ext+} ^^{extensions: specific extensions: Unicode Character Names}
^^{extensions: specific extensions: Unicode Full Character Names}
^^{characters (letters): Unicode (arbitrary symbols)}
^^{characters (letters) <-- Unicode}
^^^{+tosay+"[(unicode character)]" --> unicode character}
@ -2882,30 +2880,35 @@ Unicode characters can be named (or numbered) directly in text. For example:
"[unicode 321]odz Churchyard"
produces a Polish slashed L. If the Unicode Character Names or Unicode Full Character Names extensions are included, characters can also be named as well as numbered:
produces a Polish slashed L. Characters can also be named as well as numbered:
"[unicode Latin capital letter L with stroke]odz Churchyard"
The Unicode standard assigns character numbers to essentially every marking used in text from any human language: its full range is enormous. (Note that Inform writes these numbers in decimal: many reference charts show them in hexadecimal, or base 16, which can cause confusion.) Inform can only handle codes [unicode 32] up to [unicode 65535], so it is not quite so catholic, but the range is still enormous enough that code numbers are unfamiliar to the eye. Inform therefore allows us to use the official Unicode 4.1 names for characters, instead of their decimal numbers, <i>provided</i> we have Included the necessary extension like so:
The Unicode standard assigns character numbers to essentially every marking used in text from any human language: its full range is enormous. (Note that Inform writes these numbers in decimal: many reference charts show them in hexadecimal, or base 16, which can cause confusion.)
{*}Include Unicode Character Names by Graham Nelson.
This extension provides names for some 2900 of the most commonly used characters. It means, for instance, that we can write text such as:
This means, for instance, that we can write text such as:
"Dr Zarkov unveils the new [unicode Hebrew letter alef] Nought drive."
"Omar plays 4[unicode black spade suit] with an air of triumph."
Admittedly, these can get a little verbose:
Admittedly, character names can get a little verbose:
"[unicode Greek small letter omega with psili and perispomeni and ypogegrammeni]"
Inform can "only" handle codes [unicode 32] up to [unicode 131071], and note that if the story settings are to compile to the Z-machine, this range stops at 65535: thus many emoji characters - say, [unicode fish cake with swirl design] - can only be used if the story will compile to Glulx or another modern target. But by default, stories are compiled the modern way, so this should not be a problem in practice.
There are far too many possible names to list here: formally, any character name in the Basic Multilingual Plane or the Supplementary Multilingual Plane of version 15.0.0 of the Unicode standard can be used. See:
https://en.wikipedia.org/wiki/Plane_(Unicode)
But before getting carried away, we should remember the hazards: Inform allows us to type, say, "[unicode Saturn]" (an astrological sign) but it appears only as a black square if the resulting story is played by an interpreter using a font which lacks the relevant sign. For instance, Zoom for OS X uses the Lucida Grande and Apple Symbol fonts by default, and this combination does contain the Saturn sign: but Windows Frotz tends to use the Tahoma font by default, which does not. (Another issue is that the fixed letter spacing font, such as used in the status line, may not contain all the characters that the font of the main text contains.) To write something with truly outré characters is therefore a little chancy: users would have to be told quite carefully what interpreter and font to use to play it.
The "Unicode Character Names" extension, which is pre-installed in the standard distribution of Inform, defines names for the Latin, Greek, Cyrillic, Hebrew and Braille alphabets, together with currency and miscellaneous other symbols, including some for drawing boxes and arrows. It is only optionally installed because even this is quite large: but in case it should still prove inadequate, an alternative can be used:
At one time, Inform could only use named Unicode values in a story which had first included an extension:
{*}Include Unicode Full Character Names by Graham Nelson.
Include Unicode Character Names by Graham Nelson.
Include Unicode Full Character Names by Graham Nelson.
This includes all 12,997 named characters in the 16-bit range of the Unicode 4.1 standard: it is the size of a small novel and its inclusion will slow Inform down. But if you want to experiment with Arabic, ecclesiastical Georgian, Cherokee, Tibetan, Syriac, the International Phonetic Alphabet, hexagrams or the unified Canadian aboriginal syllabics, "Unicode Full Character Names" (again built into Inform) is the extension for you.
This is no longer the case: no such inclusion need now be made, and indeed, those extensions have been removed from Inform as redundant.
[x] Displaying quotations
@ -18058,20 +18061,20 @@ A third way to define an adjective, which should be used only if speed is except
The escape "*1" is expanded to the value on which the adjective is being tested. (This is usually faster than calling a routine, but in case of side-effects, the "*1" should occur only once in the condition, just as with a C macro.) To repeat: if in doubt, use the I6 routine method above.
[x] Naming Unicode characters {PM_UnicodeAlready} {PM_UnicodeNonLiteral} {PM_UnicodeOutOfRange}
[x] Naming Unicode characters
^^{characters (letters): Unicode (arbitrary symbols): defining new names for}
^^{translates as...+assert+: Unicode characters}
^^{Unicode Character Names / Full Character Names+ext+} ^^{extensions: specific extensions: Unicode Character Names}
^^{extensions: specific extensions: Unicode Full Character Names}
Inform allows the Unicode characters to be identified either with a decimal number or by name, but it has none of the character names built-in, and for efficiency reasons it only learns them when necessary.
Users normally teach these names to Inform by including one of the extensions "Unicode Character Names" or "Unicode Full Character Names", which consist of many hundreds of sentences like so:
At one time Inform allowed names to be given to Unicode character values with
sentences like so:
anticlockwise open circle arrow translates into Unicode as 8634.
Nothing restricts this usage to those extensions.
These sentences now throw problem messages, and instead Inform allows exactly
those names in the Unicode standard.
[x] Overriding definitions in kits {PM_BadI6Inclusion} {PM_BeforeTheLibrary} {PM_WhenDefiningUnknown} {PM_IncludeInsteadOf}

View File

@ -166,8 +166,6 @@ Other extensions shipped with Inform are not presented as webs, but as single fi
{extension author: Graham Nelson title: English Language}
{extension author: Graham Nelson title: Metric Units}
{extension author: Graham Nelson title: Rideable Vehicles}
{extension author: Graham Nelson title: Unicode Character Names}
{extension author: Graham Nelson title: Unicode Full Character Names}
### Website templates and interpreters shipped with Inform

View File

@ -23,6 +23,7 @@ but they're just plain old files, and are not managed by Inbuild as "copies".
@e EXTENSION_DOCUMENTATION_MODEL_IRES
@e RESOURCE_JSON_REQS_IRES
@e REGISTRY_JSON_REQS_IRES
@e UNICODE_DATA_IRES
=
filename *InstalledFiles::filename(int ires) {
@ -44,6 +45,8 @@ filename *InstalledFiles::filename(int ires) {
return Filenames::in(misc, I"resource.jsonr");
case REGISTRY_JSON_REQS_IRES:
return Filenames::in(misc, I"registry.jsonr");
case UNICODE_DATA_IRES:
return Filenames::in(misc, I"UnicodeData.txt");
case CBLORB_REPORT_MODEL_IRES:
return InstalledFiles::varied_by_platform(models, I"CblorbModel.html");