mirror of https://github.com/ganelson/inform.git
Implemented IE-0005
This commit is contained in:
parent
cb252deab4
commit
13894d3816
|
@ -1,6 +1,6 @@
|
|||
# Inform 7
|
||||
|
||||
[Version](notes/versioning.md): 10.2.0-beta+6V87 'Krypton' (28 October 2022)
|
||||
[Version](notes/versioning.md): 10.2.0-beta+6V88 'Krypton' (29 October 2022)
|
||||
|
||||
## About Inform
|
||||
|
||||
|
@ -147,8 +147,6 @@ Other extensions shipped with Inform are not presented as webs, but as single fi
|
|||
* [English Language by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/English Language.i7x>) - __v1__
|
||||
* [Metric Units by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Metric Units.i7x>) - __v2__
|
||||
* [Rideable Vehicles by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Rideable Vehicles.i7x>) - __v3__
|
||||
* [Unicode Character Names by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Unicode Character Names.i7x>) - __v1__
|
||||
* [Unicode Full Character Names by Graham Nelson](<inform7/Internal/Extensions/Graham Nelson/Unicode Full Character Names.i7x>) - __v1__
|
||||
|
||||
### Website templates and interpreters shipped with Inform
|
||||
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
Prerelease: beta
|
||||
Build Date: 28 October 2022
|
||||
Build Number: 6V87
|
||||
Build Date: 29 October 2022
|
||||
Build Number: 6V88
|
||||
|
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -2,7 +2,7 @@
|
|||
"is": {
|
||||
"type": "kit",
|
||||
"title": "BasicInformExtrasKit",
|
||||
"version": "10.2.0-beta+6V87"
|
||||
"version": "10.2.0-beta+6V88"
|
||||
},
|
||||
"kit-details": {
|
||||
"has-priority": 1
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"is": {
|
||||
"type": "kit",
|
||||
"title": "BasicInformKit",
|
||||
"version": "10.2.0-beta+6V87"
|
||||
"version": "10.2.0-beta+6V88"
|
||||
},
|
||||
"needs": [ {
|
||||
"unless": {
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"is": {
|
||||
"type": "kit",
|
||||
"title": "CommandParserKit",
|
||||
"version": "10.2.0-beta+6V87"
|
||||
"version": "10.2.0-beta+6V88"
|
||||
},
|
||||
"needs": [ {
|
||||
"need": {
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"is": {
|
||||
"type": "kit",
|
||||
"title": "EnglishLanguageKit",
|
||||
"version": "10.2.0-beta+6V87"
|
||||
"version": "10.2.0-beta+6V88"
|
||||
},
|
||||
"needs": [ {
|
||||
"need": {
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"is": {
|
||||
"type": "kit",
|
||||
"title": "WorldModelKit",
|
||||
"version": "10.2.0-beta+6V87"
|
||||
"version": "10.2.0-beta+6V88"
|
||||
},
|
||||
"needs": [ {
|
||||
"need": {
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,5 +1,3 @@
|
|||
Include Unicode Character Names by Graham Nelson.
|
||||
|
||||
[The problem here arises from the Greek letter lambda actually being
|
||||
known to Unicode as lamda.]
|
||||
|
||||
|
|
|
@ -1,3 +0,0 @@
|
|||
At sigil translates into Unicode as 64.
|
||||
At sigil translates into Unicode as 64.
|
||||
At sigil translates into Unicode as 67.
|
|
@ -0,0 +1 @@
|
|||
At sigil translates into Unicode as 64.
|
|
@ -1 +0,0 @@
|
|||
Bell sound translates into Unicode as "7".
|
|
@ -1 +1 @@
|
|||
Roman Dimidia Sextula Sign translates into Unicode as 65940.
|
||||
Laboratory is a room. "A sign shows: [unicode Roman Dimidia Sextula Sign]."
|
||||
|
|
|
@ -1,20 +1,18 @@
|
|||
Inform 7 build 6L26 has started.
|
||||
I've now read your source text, which is 33 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2288 words long.
|
||||
I've also read Unicode Character Names by Graham Nelson, which is 28382 words long.
|
||||
Inform 7 v10.2.0 has started.
|
||||
I've now read your source text, which is 26 words long.
|
||||
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2330 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
|
||||
Problem__ PM_MidTextUnicode
|
||||
>--> In the sentence 'admire unicode Greek capital letter lambda' (source
|
||||
text, line 11), I was expecting to read a unicode character, but instead
|
||||
text, line 9), I was expecting to read a unicode character, but instead
|
||||
found some text that I couldn't understand - 'unicode Greek capital letter
|
||||
lambda'. Maybe you intended this to produce a Unicode character? Unicode
|
||||
characters can be written either using their decimal numbers - for
|
||||
instance, 'Unicode 2041' - or with their standard names - 'Unicode Latin
|
||||
small ligature oe'. For efficiency reasons these names are only available
|
||||
if you ask for them; to make them available, you need to 'Include Unicode
|
||||
Character Names by Graham Nelson' or, if you really need more, 'Include
|
||||
Unicode Full Character Names by Graham Nelson'.
|
||||
small ligature oe'. For the full list of those names, see the Unicode
|
||||
standard version 15.0.0.
|
||||
I was trying to match this phrase:
|
||||
admire (unicode greek capital letter lambda - unicode character)
|
||||
But I didn't recognise 'unicode greek capital letter lambda'.
|
||||
Inform 7 has finished: 68 centiseconds used.
|
||||
Inform 7 has finished.
|
||||
|
|
|
@ -1,9 +0,0 @@
|
|||
Inform 7 build 6L26 has started.
|
||||
I've now read your source text, which is 21 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2288 words long.
|
||||
Problem__ PM_UnicodeAlready
|
||||
>--> You wrote 'At sigil translates into Unicode as 67' (source text, line 3):
|
||||
but this Unicode character name has already been translated, so there must
|
||||
be some duplication somewhere.
|
||||
Inform 7 has finished: 17 centiseconds used.
|
|
@ -0,0 +1,13 @@
|
|||
Inform 7 v10.2.0 has started.
|
||||
I've now read your source text, which is 7 words long.
|
||||
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2330 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
|
||||
Problem__ PM_UnicodeDeprecated
|
||||
>--> You wrote 'At sigil translates into Unicode as 64' (source text, line 1):
|
||||
but the sentence 'X translates into Unicode as Y' has been removed from the
|
||||
Inform language, because it is now redundant. Inform already knows all the
|
||||
names in the Unicode standard. If you're getting this problem message
|
||||
because you included the extension 'Unicode Full Character Names' or
|
||||
'Unicode Character Names', all you need do is to not include it.
|
||||
Inform 7 has finished.
|
|
@ -1,9 +0,0 @@
|
|||
Inform 7 build 6L26 has started.
|
||||
I've now read your source text, which is 7 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2288 words long.
|
||||
Problem__ PM_UnicodeNonLiteral
|
||||
>--> You wrote 'Bell sound translates into Unicode as "7"' (source text, line 1):
|
||||
but a Unicode character name must be translated into a literal decimal
|
||||
number written out in digits, which this seems not to be.
|
||||
Inform 7 has finished: 17 centiseconds used.
|
|
@ -1,9 +1,11 @@
|
|||
Inform 7 build 6L26 has started.
|
||||
I've now read your source text, which is 9 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 42597 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2288 words long.
|
||||
Inform 7 v10.2.0 has started.
|
||||
I've now read your source text, which is 12 words long.
|
||||
I've also read Basic Inform by Graham Nelson, which is 7772 words long.
|
||||
I've also read English Language by Graham Nelson, which is 2330 words long.
|
||||
I've also read Standard Rules by Graham Nelson, which is 34310 words long.
|
||||
Problem__ PM_UnicodeOutOfRange
|
||||
>--> You wrote 'Roman Dimidia Sextula Sign translates into Unicode as 65940' (source
|
||||
text, line 1): but Inform can only handle Unicode characters in the 16-bit
|
||||
range, from 0 to 65535.
|
||||
Inform 7 has finished: 17 centiseconds used.
|
||||
>--> You wrote '"A sign shows: [unicode Roman Dimidia Sextula Sign]."' (source
|
||||
text, line 1): but this character value is beyond the range which the
|
||||
current story could handle, which is from 0 to (hexadecimal) FFFF for
|
||||
stories compiled to the Z-machine, and otherwise 0 to 1FFFF.
|
||||
Inform 7 has finished.
|
||||
|
|
|
@ -88,14 +88,12 @@ sense once kinds and instances exist.
|
|||
}
|
||||
|
||||
@h Translation into Unicode.
|
||||
The following handles sentences like:
|
||||
The following sentence form is now deprecated:
|
||||
|
||||
>> leftwards harpoon with barb upwards translates into Unicode as 8636.
|
||||
|
||||
The subject "leftwards harpoon with barb upwards" is parsed against the
|
||||
Unicode character names known already to make sure that this new translation
|
||||
doesn't disagree with an existing one (that is, doesn't translate to a
|
||||
different code number).
|
||||
Until Inform 10.1, this equated a Unicode name to its code point value; see
|
||||
IE-0005 and //values: Unicode Literals// for what now happens instead.
|
||||
|
||||
The sentence "X translates into Y as Z" has this sense provided Y matches:
|
||||
|
||||
|
@ -104,6 +102,7 @@ The sentence "X translates into Y as Z" has this sense provided Y matches:
|
|||
unicode
|
||||
|
||||
@ =
|
||||
int PM_UnicodeDeprecated_thrown = FALSE;
|
||||
int Translations::translates_into_unicode_as_SMF(int task, parse_node *V, wording *NPs) {
|
||||
wording SW = (NPs)?(NPs[0]):EMPTY_WORDING;
|
||||
wording OW = (NPs)?(NPs[1]):EMPTY_WORDING;
|
||||
|
@ -119,55 +118,22 @@ int Translations::translates_into_unicode_as_SMF(int task, parse_node *V, wordin
|
|||
}
|
||||
break;
|
||||
case PASS_2_SMFT:
|
||||
@<Create the Unicode character name@>;
|
||||
if (PM_UnicodeDeprecated_thrown == FALSE) {
|
||||
PM_UnicodeDeprecated_thrown = TRUE;
|
||||
StandardProblems::sentence_problem(Task::syntax_tree(),
|
||||
_p_(PM_UnicodeDeprecated),
|
||||
"the sentence 'X translates into Unicode as Y' has been removed "
|
||||
"from the Inform language",
|
||||
"because it is now redundant. Inform already knows all the names "
|
||||
"in the Unicode standard. If you're getting this problem message "
|
||||
"because you included the extension 'Unicode Full Character Names' "
|
||||
"or 'Unicode Character Names', all you need do is to not include it.");
|
||||
}
|
||||
break;
|
||||
}
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
@ And this parses the noun phrases of such sentences. Note that the numeric
|
||||
values has to be given in decimal -- I was tempted to allow hexadecimal here,
|
||||
but life's too short. Unicode translation sentences are really only
|
||||
technicalities needed by the built-in extensions, and those are mechanically
|
||||
generated anyway; Inform authors never type them.
|
||||
|
||||
=
|
||||
<translates-into-unicode-sentence-subject> ::=
|
||||
( ... ) |
|
||||
...
|
||||
|
||||
<translates-into-unicode-sentence-object> ::=
|
||||
<cardinal-number-unlimited> | ==> { UnicodeLiterals::max(R[1]), - }
|
||||
... ==> @<Issue PM_UnicodeNonLiteral problem@>
|
||||
|
||||
@<Issue PM_UnicodeNonLiteral problem@> =
|
||||
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeNonLiteral),
|
||||
"a Unicode character name must be translated into a literal decimal "
|
||||
"number written out in digits",
|
||||
"which this seems not to be.");
|
||||
return FALSE;
|
||||
|
||||
@ And here the name is created as a miscellaneous excerpt meaning.
|
||||
|
||||
@<Create the Unicode character name@> =
|
||||
wording SP = Node::get_text(V->next);
|
||||
wording OP = Node::get_text(V->next->next);
|
||||
if (<translates-into-unicode-sentence-object>(OP) == FALSE) return FALSE;
|
||||
int cc = <<r>>;
|
||||
|
||||
<translates-into-unicode-sentence-subject>(SP);
|
||||
wording CN = GET_RW(<translates-into-unicode-sentence-subject>, 1);
|
||||
if ((<unicode-character-name>(CN)) && (<<r>> != cc)) {
|
||||
StandardProblems::sentence_problem(Task::syntax_tree(),
|
||||
_p_(PM_UnicodeAlready),
|
||||
"this Unicode character name has already been translated",
|
||||
"so there must be some duplication somewhere.");
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
Nouns::new_proper_noun(CN, NEUTER_GENDER, ADD_TO_LEXICON_NTOPT, MISCELLANEOUS_MC,
|
||||
Diagrams::new_PROPER_NOUN(OP), Task::language_of_syntax());
|
||||
|
||||
@h Translation into Inter.
|
||||
There are three sentences here, but the first is now deprecated: it has split
|
||||
off into two different meanings, each with its own wording for clarity.
|
||||
|
|
|
@ -16,6 +16,7 @@ which use this module:
|
|||
@e TEXT_SUBSTITUTIONS_DA
|
||||
@e VARIABLE_CREATIONS_DA
|
||||
@e TABLES_DA
|
||||
@e UNICODE_DATA_MREASON
|
||||
|
||||
=
|
||||
COMPILE_WRITER(instance *, Instances::log)
|
||||
|
@ -31,6 +32,7 @@ void ValuesModule::start(void) {
|
|||
Log::declare_aspect(TEXT_SUBSTITUTIONS_DA, L"text substitutions", FALSE, FALSE);
|
||||
Log::declare_aspect(VARIABLE_CREATIONS_DA, L"variable creations", FALSE, FALSE);
|
||||
Log::declare_aspect(TABLES_DA, L"table construction", FALSE, FALSE);
|
||||
Memory::reason_name(UNICODE_DATA_MREASON, "Unicode data");
|
||||
REGISTER_WRITER('O', Instances::log);
|
||||
REGISTER_WRITER('q', Equations::log);
|
||||
REGISTER_WRITER('Z', NonlocalVariables::log);
|
||||
|
|
|
@ -2,7 +2,8 @@
|
|||
|
||||
To manage the names assigned to Unicode character values.
|
||||
|
||||
@ The following is called only on excerpts from the source where it is a
|
||||
@h Parsing.
|
||||
The following is called only on excerpts from the source where it is a
|
||||
fairly safe bet that a Unicode character is referred to. For example, when
|
||||
the player types either of these:
|
||||
|
||||
|
@ -17,11 +18,18 @@ the player types either of these:
|
|||
<unicode-character-name> ==> { -, Rvalues::from_Unicode(R[1], W) }
|
||||
|
||||
<unicode-character-name> internal {
|
||||
parse_node *p = Lexicon::retrieve(MISCELLANEOUS_MC, W);
|
||||
if ((p) && (Node::get_type(p) == PROPER_NOUN_NT)) {
|
||||
int N = Vocabulary::get_literal_number_value(
|
||||
Lexer::word(Wordings::first_wn(Node::get_text(p))));
|
||||
==> { N, - };
|
||||
TEMPORARY_TEXT(N)
|
||||
WRITE_TO(N, "%W", W);
|
||||
for (int i=0; i<Str::len(N); i++)
|
||||
Str::put_at(N, i, Characters::toupper(Str::get_at(N, i)));
|
||||
int U = UnicodeLiterals::parse(N);
|
||||
DISCARD_TEXT(N)
|
||||
if (U >= 0) {
|
||||
if ((TargetVMs::is_16_bit(Task::vm())) && (U >= 0x10000)) {
|
||||
@<Issue PM_UnicodeOutOfRange@>;
|
||||
U = 65;
|
||||
}
|
||||
==> { UnicodeLiterals::max(U), - };
|
||||
return TRUE;
|
||||
}
|
||||
==> { fail nonterminal };
|
||||
|
@ -31,11 +39,296 @@ the player types either of these:
|
|||
|
||||
=
|
||||
int UnicodeLiterals::max(int cc) {
|
||||
if ((cc < 0) || (cc >= 0x10000)) {
|
||||
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeOutOfRange),
|
||||
"Inform can only handle Unicode characters in the 16-bit range",
|
||||
"from 0 to 65535.");
|
||||
if ((cc < 0) || (cc >= MAX_UNICODE_CODE_POINT)) {
|
||||
@<Issue PM_UnicodeOutOfRange@>;
|
||||
return 65;
|
||||
}
|
||||
return cc;
|
||||
}
|
||||
|
||||
@<Issue PM_UnicodeOutOfRange@> =
|
||||
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeOutOfRange),
|
||||
"this character value is beyond the range which the current story "
|
||||
"could handle",
|
||||
"which is from 0 to (hexadecimal) FFFF for stories compiled to the "
|
||||
"Z-machine, and otherwise 0 to 1FFFF.");
|
||||
|
||||
@h Code points.
|
||||
Each distinct code point in the Unicode specification will correspond to one
|
||||
of these:
|
||||
|
||||
@d MAX_UNICODE_CODE_POINT 0x20000
|
||||
|
||||
@e Cc_UNICODE_CAT from 1 /* Other, Control */
|
||||
@e Cf_UNICODE_CAT /* Other, Format */
|
||||
@e Cn_UNICODE_CAT /* Other, Not Assigned: no character actually has this */
|
||||
@e Co_UNICODE_CAT /* Other, Private Use */
|
||||
@e Cs_UNICODE_CAT /* Other, Surrogate */
|
||||
@e Ll_UNICODE_CAT /* Letter, Lowercase */
|
||||
@e Lm_UNICODE_CAT /* Letter, Modifier */
|
||||
@e Lo_UNICODE_CAT /* Letter, Other */
|
||||
@e Lt_UNICODE_CAT /* Letter, Titlecase */
|
||||
@e Lu_UNICODE_CAT /* Letter, Uppercase */
|
||||
@e Mc_UNICODE_CAT /* Mark, Spacing Combining */
|
||||
@e Me_UNICODE_CAT /* Mark, Enclosing */
|
||||
@e Mn_UNICODE_CAT /* Mark, Non-Spacing */
|
||||
@e Nd_UNICODE_CAT /* Number, Decimal Digit */
|
||||
@e Nl_UNICODE_CAT /* Number, Letter */
|
||||
@e No_UNICODE_CAT /* Number, Other */
|
||||
@e Pc_UNICODE_CAT /* Punctuation, Connector */
|
||||
@e Pd_UNICODE_CAT /* Punctuation, Dash */
|
||||
@e Pe_UNICODE_CAT /* Punctuation, Close */
|
||||
@e Pf_UNICODE_CAT /* Punctuation, Final quote */
|
||||
@e Pi_UNICODE_CAT /* Punctuation, Initial quote */
|
||||
@e Po_UNICODE_CAT /* Punctuation, Other */
|
||||
@e Ps_UNICODE_CAT /* Punctuation, Open */
|
||||
@e Sc_UNICODE_CAT /* Symbol, Currency */
|
||||
@e Sk_UNICODE_CAT /* Symbol, Modifier */
|
||||
@e Sm_UNICODE_CAT /* Symbol, Math */
|
||||
@e So_UNICODE_CAT /* Symbol, Other */
|
||||
@e Zl_UNICODE_CAT /* Separator, Line */
|
||||
@e Zp_UNICODE_CAT /* Separator, Paragraph */
|
||||
@e Zs_UNICODE_CAT /* Separator, Space */
|
||||
|
||||
=
|
||||
typedef struct unicode_point {
|
||||
int code_point; /* in the range 0 to MAX_UNICODE_CODE_POINT - 1 */
|
||||
struct text_stream *name; /* e.g. "RIGHT-FACING ARMENIAN ETERNITY SIGN" */
|
||||
int category; /* one of the |*_UNICODE_CAT| values above */
|
||||
int tolower; /* -1 if no mapping to lower case is available, or a code point */
|
||||
int toupper; /* -1 if no mapping to upper case is available, or a code point */
|
||||
int totitle; /* -1 if no mapping to title case is available, or a code point */
|
||||
} unicode_point;
|
||||
|
||||
unicode_point UnicodeLiterals::new_code_point(int C) {
|
||||
unicode_point up;
|
||||
up.code_point = C;
|
||||
up.name = NULL;
|
||||
up.category = Cn_UNICODE_CAT;
|
||||
up.tolower = -1;
|
||||
up.toupper = -1;
|
||||
up.totitle = -1;
|
||||
return up;
|
||||
}
|
||||
|
||||
@ Storage for these is managed on demand, in a flexibly-sized array:
|
||||
|
||||
=
|
||||
unicode_point *unicode_points = NULL; /* array indexed by code point */
|
||||
int unicode_points_extent = 0; /* current number of entries in that array */
|
||||
int max_known_unicode_point = 0;
|
||||
|
||||
unicode_point *UnicodeLiterals::code_point(int U) {
|
||||
if ((U < 0) || (U >= MAX_UNICODE_CODE_POINT)) internal_error("Unicode point out of range");
|
||||
UnicodeLiterals::ensure_data();
|
||||
if (U >= unicode_points_extent) {
|
||||
int new_extent = unicode_points_extent;
|
||||
if (new_extent == 0) new_extent = 1;
|
||||
while (new_extent <= U) new_extent = 2*new_extent;
|
||||
unicode_point *new_unicode_points = (unicode_point *)
|
||||
(Memory::calloc(new_extent, sizeof(unicode_point), UNICODE_DATA_MREASON));
|
||||
for (int i=0; i<unicode_points_extent; i++)
|
||||
new_unicode_points[i] = unicode_points[i];
|
||||
for (int i=unicode_points_extent; i<new_extent; i++)
|
||||
new_unicode_points[i] = UnicodeLiterals::new_code_point(i);
|
||||
if (unicode_points_extent > 0)
|
||||
Memory::I7_array_free(unicode_points,
|
||||
UNICODE_DATA_MREASON, unicode_points_extent, sizeof(unicode_point));
|
||||
unicode_points = new_unicode_points;
|
||||
unicode_points_extent = new_extent;
|
||||
}
|
||||
if (U > max_known_unicode_point) max_known_unicode_point = U;
|
||||
return &(unicode_points[U]);
|
||||
}
|
||||
|
||||
@ The standard Inform distribution includes the current Unicode specification's
|
||||
main data file. Although parsing that file is relatively fast, we do it only
|
||||
on demand, because it's not small (about 2 MB of text) and is often not needed.
|
||||
|
||||
=
|
||||
dictionary *UnicodeData_lookup = NULL;
|
||||
void UnicodeLiterals::ensure_data(void) {
|
||||
if (UnicodeData_lookup == NULL) {
|
||||
UnicodeData_lookup = Dictionaries::new(65536, FALSE);
|
||||
filename *F = InstalledFiles::filename(UNICODE_DATA_IRES);
|
||||
TextFiles::read(F, FALSE, "can't open UnicodeData file", TRUE,
|
||||
&UnicodeLiterals::read_line, NULL, NULL);
|
||||
LOG("Read Unicode data to code point 0x%06x in %f\n", max_known_unicode_point, F);
|
||||
}
|
||||
}
|
||||
|
||||
@ The format of this file is admirably stable. Lines look like so:
|
||||
= (text)
|
||||
0067;LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
|
||||
1C85;CYRILLIC SMALL LETTER THREE-LEGGED TE;Ll;0;L;;;;;N;;;0422;;0422
|
||||
1FAA1;SEWING NEEDLE;So;0;ON;;;;;N;;;;;
|
||||
=
|
||||
Each line corresponds to a code point. They're presented in the file in ascending
|
||||
order of these values, but we make no use of that fact. Each line contains fields
|
||||
divided by semicolons, and semicolon characters are illegal in any field.
|
||||
|
||||
@d CODE_VALUE_UNICODE_DATA_FIELD 0
|
||||
@d NAME_UNICODE_DATA_FIELD 1
|
||||
@d GENERAL_CATEGORY_UNICODE_DATA_FIELD 2
|
||||
@d COMBINING_CLASSES_UNICODE_DATA_FIELD 3
|
||||
@d BIDIRECTIONAL_CATEGORY_UNICODE_DATA_FIELD 4
|
||||
@d DECOMPOSITION_MAPPING_UNICODE_DATA_FIELD 5
|
||||
@d DECIMAL_DIGIT_VALUE_UNICODE_DATA_FIELD 6
|
||||
@d DIGIT_VALUE_UNICODE_DATA_FIELD 7
|
||||
@d NUMERIC_VALUE_UNICODE_DATA_FIELD 8
|
||||
@d MIRRORED_UNICODE_DATA_FIELD 9
|
||||
@d OLD_NAME_UNICODE_DATA_FIELD 10
|
||||
@d ISO_10646_COMMENT_UNICODE_DATA_FIELD 11
|
||||
@d UC_MAPPING_UNICODE_DATA_FIELD 12
|
||||
@d LC_MAPPING_UNICODE_DATA_FIELD 13
|
||||
@d TC_MAPPING_UNICODE_DATA_FIELD 14
|
||||
|
||||
=
|
||||
void UnicodeLiterals::read_line(text_stream *text, text_file_position *tfp, void *vm) {
|
||||
Str::trim_white_space(text);
|
||||
wchar_t c = Str::get_first_char(text);
|
||||
if (c == 0) return;
|
||||
text_stream *name = Str::new();
|
||||
TEMPORARY_TEXT(category)
|
||||
int U[16], field_number = 0;
|
||||
for (int f=0; f<16; f++) U[f] = 0;
|
||||
@<Parse the fields@>;
|
||||
if ((field_number > 1) && (U[CODE_VALUE_UNICODE_DATA_FIELD] < MAX_UNICODE_CODE_POINT)) {
|
||||
int c = Cn_UNICODE_CAT;
|
||||
@<Determine the category code@>;
|
||||
unicode_point *up = UnicodeLiterals::code_point(U[CODE_VALUE_UNICODE_DATA_FIELD]);
|
||||
@<Initialise the unicode point structure@>;
|
||||
@<Add to the dictionary of character names@>;
|
||||
}
|
||||
DISCARD_TEXT(category)
|
||||
}
|
||||
|
||||
@<Parse the fields@> =
|
||||
for (int i=0; i<Str::len(text); i++) {
|
||||
wchar_t c = Str::get_at(text, i);
|
||||
if (c == ';') field_number++;
|
||||
else switch (field_number) {
|
||||
case CODE_VALUE_UNICODE_DATA_FIELD:
|
||||
case UC_MAPPING_UNICODE_DATA_FIELD:
|
||||
case LC_MAPPING_UNICODE_DATA_FIELD:
|
||||
case TC_MAPPING_UNICODE_DATA_FIELD: {
|
||||
int H = -1;
|
||||
if ((c >= '0') && (c <= '9')) H = (int) (c - '0');
|
||||
if ((c >= 'A') && (c <= 'F')) H = (int) (c - 'A' + 10);
|
||||
if (H >= 0) U[field_number] = U[field_number]*16 + H;
|
||||
break;
|
||||
}
|
||||
case NAME_UNICODE_DATA_FIELD:
|
||||
PUT_TO(name, c);
|
||||
break;
|
||||
case GENERAL_CATEGORY_UNICODE_DATA_FIELD:
|
||||
PUT_TO(category, c);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
@<Determine the category code@> =
|
||||
if (Str::eq(category, I"Cc")) c = Cc_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Cf")) c = Cf_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Cn")) c = Cn_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Co")) c = Co_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Cs")) c = Cs_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Ll")) c = Ll_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Lm")) c = Lm_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Lo")) c = Lo_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Lt")) c = Lt_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Lu")) c = Lu_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Mc")) c = Mc_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Me")) c = Me_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Mn")) c = Mn_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Nd")) c = Nd_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Nl")) c = Nl_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"No")) c = No_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Pc")) c = Pc_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Pd")) c = Pd_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Pe")) c = Pe_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Pf")) c = Pf_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Pi")) c = Pi_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Po")) c = Po_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Ps")) c = Ps_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Sc")) c = Sc_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Sk")) c = Sk_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Sm")) c = Sm_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"So")) c = So_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Zl")) c = Zl_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Zp")) c = Zp_UNICODE_CAT;
|
||||
else if (Str::eq(category, I"Zs")) c = Zs_UNICODE_CAT;
|
||||
else LOG("Unknown category '%S'\n", category);
|
||||
|
||||
@<Initialise the unicode point structure@> =
|
||||
up->name = name;
|
||||
up->category = c;
|
||||
up->tolower = U[LC_MAPPING_UNICODE_DATA_FIELD];
|
||||
up->toupper = U[UC_MAPPING_UNICODE_DATA_FIELD];
|
||||
up->totitle = U[TC_MAPPING_UNICODE_DATA_FIELD];
|
||||
|
||||
@ Control codes in Unicode, a residue of ASCII, are given no names by the
|
||||
standard. For example:
|
||||
= (text)
|
||||
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
|
||||
=
|
||||
Indeed, at present every code with category |Cc| has the pseudo-name |<control>|.
|
||||
So we will mostly not allow these to be referred to by name in Inform. (In theory we
|
||||
could read the ISO-10646 comment as if it were a name: here, that would be
|
||||
"END OF TRANSMISSION", which isn't too bad. But "FORM FEED (FF)" and
|
||||
"CHARACTER TABULATION" are less persuasive, and anyway, we don't actually want
|
||||
users to insert control characters into Inform text literals.)
|
||||
|
||||
@<Add to the dictionary of character names@> =
|
||||
text_stream *index = NULL;
|
||||
if (c == Cc_UNICODE_CAT) {
|
||||
if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 9) index = I"TAB";
|
||||
if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 10) index = I"NEWLINE";
|
||||
} else {
|
||||
index = name;
|
||||
}
|
||||
if (index) {
|
||||
Dictionaries::create(UnicodeData_lookup, name);
|
||||
Dictionaries::write_value(UnicodeData_lookup, name, (void *) up);
|
||||
}
|
||||
|
||||
@h Using the Unicode data.
|
||||
The first lookup here is slow, since it requires us to parse the Unicode
|
||||
specification data file. But after that everything runs quite swiftly.
|
||||
|
||||
=
|
||||
int UnicodeLiterals::parse(text_stream *N) {
|
||||
UnicodeLiterals::ensure_data();
|
||||
if (Dictionaries::find(UnicodeData_lookup, N)) {
|
||||
unicode_point *up = Dictionaries::read_value(UnicodeData_lookup, N);
|
||||
return up->code_point;
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
@ We won't go too far down the Unicode rabbit-hole, but here are functions which
|
||||
may some day be useful:
|
||||
|
||||
=
|
||||
int UnicodeLiterals::tolower(int C) {
|
||||
unicode_point *up = UnicodeLiterals::code_point(C);
|
||||
int D = up->tolower;
|
||||
if (D >= 0) return D;
|
||||
return C;
|
||||
}
|
||||
int UnicodeLiterals::toupper(int C) {
|
||||
unicode_point *up = UnicodeLiterals::code_point(C);
|
||||
int D = up->toupper;
|
||||
if (D >= 0) return D;
|
||||
return C;
|
||||
}
|
||||
int UnicodeLiterals::totitle(int C) {
|
||||
unicode_point *up = UnicodeLiterals::code_point(C);
|
||||
int D = up->totitle;
|
||||
if (D >= 0) return D;
|
||||
return C;
|
||||
}
|
||||
int UnicodeLiterals::category(int C) {
|
||||
unicode_point *up = UnicodeLiterals::code_point(C);
|
||||
return up->category;
|
||||
}
|
||||
|
|
|
@ -2817,11 +2817,8 @@ common misunderstanding.
|
|||
"%PMaybe you intended this to produce a Unicode character? "
|
||||
"Unicode characters can be written either using their decimal "
|
||||
"numbers - for instance, 'Unicode 2041' - or with their standard "
|
||||
"names - 'Unicode Latin small ligature oe'. For efficiency reasons "
|
||||
"these names are only available if you ask for them; to make them "
|
||||
"available, you need to 'Include Unicode Character Names by Graham "
|
||||
"Nelson' or, if you really need more, 'Include Unicode Full "
|
||||
"Character Names by Graham Nelson'.");
|
||||
"names - 'Unicode Latin small ligature oe'. For the full list of "
|
||||
"those names, see the Unicode standard version 15.0.0.");
|
||||
Problems::issue_problem_end();
|
||||
|
||||
@<Issue PM_UnknownCondition problem@> =
|
||||
|
|
|
@ -75,8 +75,6 @@ And just for fun, this time we'll make the grid prettier, too; but this will wor
|
|||
say "[state of room 2A][state of room 2B][state of room 2C][state of room 2D][state of room 2E] downstairs[line break]";
|
||||
say "[unicode box drawings light up and right][bottom bar][unicode box drawings light up and left][variable letter spacing][line break]"
|
||||
|
||||
Include Unicode Character Names by Graham Nelson.
|
||||
|
||||
To say top bar:
|
||||
repeat with N running from 1 to 9:
|
||||
if the remainder after dividing N by 2 is 0, say "[unicode box drawings light down and horizontal]";
|
||||
|
|
|
@ -12,9 +12,7 @@ The trick here is that colored output is done in different ways by the Z-Machine
|
|||
|
||||
For the suit symbols, we'll want the Unicode extension included with Inform:
|
||||
|
||||
{**}Include Unicode Character Names by Graham Nelson.
|
||||
|
||||
Rule for printing the name of a card (called target) while grouping together:
|
||||
{**}Rule for printing the name of a card (called target) while grouping together:
|
||||
say "[rank of the target as abbreviated value][suit of the target as symbol]".
|
||||
|
||||
To say (current suit - a suit) as symbol:
|
||||
|
|
|
@ -8,8 +8,6 @@ The following example puts Inform's support for exotic lettering through its pac
|
|||
|
||||
The story headline is "Pushing the Limits of Unicode in IF". The story description is "This is a demanding test for Unicode compliance by Z-machine interpreters."
|
||||
|
||||
Include Unicode Character Names by Graham Nelson.
|
||||
|
||||
Include Basic Screen Effects by Emily Short.
|
||||
|
||||
The Château Bibliothèque Français is east of the Deutsche Universität Bücherei. "From this Borgesian construction, doorways lead into anterooms in each of the four cardinal directions." South of the Bibliothèque is the Miscellany Mañana. North of the Bibliothèque is the Íslendingabók. East of the Bibliothèque is Alphabet Soup.
|
||||
|
|
|
@ -2868,10 +2868,8 @@ Moreover, the player is not allowed to type these characters in commands during
|
|||
|
||||
(d) <b>Characters which might work in quoted text, or might not</b>. The Arabic and Hebrew alphabets are fairly likely to be available; miscellaneous symbols are sometimes legible to the player, sometimes not. Other alphabets are chancier still. (If a work of IF depends on these being visible, it may be necessary to instruct players to use specific interpreters, or to provide a way for the player to test that all will be well.)
|
||||
|
||||
[x] Unicode characters {PM_SayUnicode} {PM_MidTextUnicode}
|
||||
[x] Unicode characters {PM_SayUnicode} {PM_MidTextUnicode} {PM_UnicodeOutOfRange}
|
||||
|
||||
^^{Unicode Character Names / Full Character Names+ext+} ^^{extensions: specific extensions: Unicode Character Names}
|
||||
^^{extensions: specific extensions: Unicode Full Character Names}
|
||||
^^{characters (letters): Unicode (arbitrary symbols)}
|
||||
^^{characters (letters) <-- Unicode}
|
||||
^^^{+tosay+"[(unicode character)]" --> unicode character}
|
||||
|
@ -2882,30 +2880,35 @@ Unicode characters can be named (or numbered) directly in text. For example:
|
|||
|
||||
"[unicode 321]odz Churchyard"
|
||||
|
||||
produces a Polish slashed L. If the Unicode Character Names or Unicode Full Character Names extensions are included, characters can also be named as well as numbered:
|
||||
produces a Polish slashed L. Characters can also be named as well as numbered:
|
||||
|
||||
"[unicode Latin capital letter L with stroke]odz Churchyard"
|
||||
|
||||
The Unicode standard assigns character numbers to essentially every marking used in text from any human language: its full range is enormous. (Note that Inform writes these numbers in decimal: many reference charts show them in hexadecimal, or base 16, which can cause confusion.) Inform can only handle codes [unicode 32] up to [unicode 65535], so it is not quite so catholic, but the range is still enormous enough that code numbers are unfamiliar to the eye. Inform therefore allows us to use the official Unicode 4.1 names for characters, instead of their decimal numbers, <i>provided</i> we have Included the necessary extension like so:
|
||||
The Unicode standard assigns character numbers to essentially every marking used in text from any human language: its full range is enormous. (Note that Inform writes these numbers in decimal: many reference charts show them in hexadecimal, or base 16, which can cause confusion.)
|
||||
|
||||
{*}Include Unicode Character Names by Graham Nelson.
|
||||
|
||||
This extension provides names for some 2900 of the most commonly used characters. It means, for instance, that we can write text such as:
|
||||
This means, for instance, that we can write text such as:
|
||||
|
||||
"Dr Zarkov unveils the new [unicode Hebrew letter alef] Nought drive."
|
||||
"Omar plays 4[unicode black spade suit] with an air of triumph."
|
||||
|
||||
Admittedly, these can get a little verbose:
|
||||
Admittedly, character names can get a little verbose:
|
||||
|
||||
"[unicode Greek small letter omega with psili and perispomeni and ypogegrammeni]"
|
||||
|
||||
Inform can "only" handle codes [unicode 32] up to [unicode 131071], and note that if the story settings are to compile to the Z-machine, this range stops at 65535: thus many emoji characters - say, [unicode fish cake with swirl design] - can only be used if the story will compile to Glulx or another modern target. But by default, stories are compiled the modern way, so this should not be a problem in practice.
|
||||
|
||||
There are far too many possible names to list here: formally, any character name in the Basic Multilingual Plane or the Supplementary Multilingual Plane of version 15.0.0 of the Unicode standard can be used. See:
|
||||
|
||||
https://en.wikipedia.org/wiki/Plane_(Unicode)
|
||||
|
||||
But before getting carried away, we should remember the hazards: Inform allows us to type, say, "[unicode Saturn]" (an astrological sign) but it appears only as a black square if the resulting story is played by an interpreter using a font which lacks the relevant sign. For instance, Zoom for OS X uses the Lucida Grande and Apple Symbol fonts by default, and this combination does contain the Saturn sign: but Windows Frotz tends to use the Tahoma font by default, which does not. (Another issue is that the fixed letter spacing font, such as used in the status line, may not contain all the characters that the font of the main text contains.) To write something with truly outré characters is therefore a little chancy: users would have to be told quite carefully what interpreter and font to use to play it.
|
||||
|
||||
The "Unicode Character Names" extension, which is pre-installed in the standard distribution of Inform, defines names for the Latin, Greek, Cyrillic, Hebrew and Braille alphabets, together with currency and miscellaneous other symbols, including some for drawing boxes and arrows. It is only optionally installed because even this is quite large: but in case it should still prove inadequate, an alternative can be used:
|
||||
At one time, Inform could only use named Unicode values in a story which had first included an extension:
|
||||
|
||||
{*}Include Unicode Full Character Names by Graham Nelson.
|
||||
Include Unicode Character Names by Graham Nelson.
|
||||
Include Unicode Full Character Names by Graham Nelson.
|
||||
|
||||
This includes all 12,997 named characters in the 16-bit range of the Unicode 4.1 standard: it is the size of a small novel and its inclusion will slow Inform down. But if you want to experiment with Arabic, ecclesiastical Georgian, Cherokee, Tibetan, Syriac, the International Phonetic Alphabet, hexagrams or the unified Canadian aboriginal syllabics, "Unicode Full Character Names" (again built into Inform) is the extension for you.
|
||||
This is no longer the case: no such inclusion need now be made, and indeed, those extensions have been removed from Inform as redundant.
|
||||
|
||||
[x] Displaying quotations
|
||||
|
||||
|
@ -18058,20 +18061,20 @@ A third way to define an adjective, which should be used only if speed is except
|
|||
|
||||
The escape "*1" is expanded to the value on which the adjective is being tested. (This is usually faster than calling a routine, but in case of side-effects, the "*1" should occur only once in the condition, just as with a C macro.) To repeat: if in doubt, use the I6 routine method above.
|
||||
|
||||
[x] Naming Unicode characters {PM_UnicodeAlready} {PM_UnicodeNonLiteral} {PM_UnicodeOutOfRange}
|
||||
[x] Naming Unicode characters
|
||||
|
||||
^^{characters (letters): Unicode (arbitrary symbols): defining new names for}
|
||||
^^{translates as...+assert+: Unicode characters}
|
||||
^^{Unicode Character Names / Full Character Names+ext+} ^^{extensions: specific extensions: Unicode Character Names}
|
||||
^^{extensions: specific extensions: Unicode Full Character Names}
|
||||
|
||||
Inform allows the Unicode characters to be identified either with a decimal number or by name, but it has none of the character names built-in, and for efficiency reasons it only learns them when necessary.
|
||||
|
||||
Users normally teach these names to Inform by including one of the extensions "Unicode Character Names" or "Unicode Full Character Names", which consist of many hundreds of sentences like so:
|
||||
At one time Inform allowed names to be given to Unicode character values with
|
||||
sentences like so:
|
||||
|
||||
anticlockwise open circle arrow translates into Unicode as 8634.
|
||||
|
||||
Nothing restricts this usage to those extensions.
|
||||
These sentences now throw problem messages, and instead Inform allows exactly
|
||||
those names in the Unicode standard.
|
||||
|
||||
[x] Overriding definitions in kits {PM_BadI6Inclusion} {PM_BeforeTheLibrary} {PM_WhenDefiningUnknown} {PM_IncludeInsteadOf}
|
||||
|
||||
|
|
|
@ -166,8 +166,6 @@ Other extensions shipped with Inform are not presented as webs, but as single fi
|
|||
{extension author: Graham Nelson title: English Language}
|
||||
{extension author: Graham Nelson title: Metric Units}
|
||||
{extension author: Graham Nelson title: Rideable Vehicles}
|
||||
{extension author: Graham Nelson title: Unicode Character Names}
|
||||
{extension author: Graham Nelson title: Unicode Full Character Names}
|
||||
|
||||
### Website templates and interpreters shipped with Inform
|
||||
|
||||
|
|
|
@ -23,6 +23,7 @@ but they're just plain old files, and are not managed by Inbuild as "copies".
|
|||
@e EXTENSION_DOCUMENTATION_MODEL_IRES
|
||||
@e RESOURCE_JSON_REQS_IRES
|
||||
@e REGISTRY_JSON_REQS_IRES
|
||||
@e UNICODE_DATA_IRES
|
||||
|
||||
=
|
||||
filename *InstalledFiles::filename(int ires) {
|
||||
|
@ -44,6 +45,8 @@ filename *InstalledFiles::filename(int ires) {
|
|||
return Filenames::in(misc, I"resource.jsonr");
|
||||
case REGISTRY_JSON_REQS_IRES:
|
||||
return Filenames::in(misc, I"registry.jsonr");
|
||||
case UNICODE_DATA_IRES:
|
||||
return Filenames::in(misc, I"UnicodeData.txt");
|
||||
|
||||
case CBLORB_REPORT_MODEL_IRES:
|
||||
return InstalledFiles::varied_by_platform(models, I"CblorbModel.html");
|
||||
|
|
Loading…
Reference in New Issue