Download Report

Dyalog’08
Migrating to Unicode
Morten Kromberg
Workshop at Dyalog’08 - Elsinore
Agenda
•
•
•
•
• Native Files
What is Unicode?
V.12 Design Goals
Key Unicode Features
Language Differences
– Unicode Text Files (UTF-8)
• External Interfaces
– COM/OLE, Microsoft.NET
– ODBC / SQAPL
– ⎕NA: A & W win32 calls
– ⎕DR, ⍋ of char data
– Space & Performance
• ”Interop”: Classic vs Unicode
–
–
–
–
WSs & Component Files
TCP Sockets & Conga
External Vars, Mapped Files
Own DLLs and Aps
Migrating to Unicode
• Source Code Management
– SALT, SubVersion, Diff Tools
• Planning Migrations
Dyalog’08 - Elsinore
3
What is Unicode?
Wikipedia: An industry standard allowing computers to
consistently represent and manipulate text expressed
in any of the world's writing systems.
• It assigns a number, or code point, to each of
approximately 100,000 characters
– Including the APL character set.
• The first version of the standard appeared in 1991,
support is now becoming “common” on all platforms
Migrating to Unicode
Dyalog’08 - Elsinore
4
Why do we want Unicode?
• Obviously: It allows us to write applications which
use text from all the world’s written languages…
• Less obviously, but perhaps more important in the
short term:
– APL no longer needs it’s own character set (“Atomic Vector”)
– Characters no longer need to be translated on the way in
and out of APL
– APL Source Code can be stored in “ordinary” text files and
be handled by “standard” management tools
Migrating to Unicode
Dyalog’08 - Elsinore
5
What is Unicode in practice?
Char Name
HEX
DEC UTF-8
A
Latin capital letter A
00041
Æ
Latin capital letter AE
000C6
198 195 134
α
Greek small letter alpha
003B1
945 206 177
‫ؤ‬
Arabic letter waw with hamza above
00624
1572 216 164
⍺
APL functional symbol alpha
0237A
9082 226 141 186
𠀁
CJK ideograph extension B, second
20001
65 65
131073 240 160 128 129
• Most often, when someone tells you the data
”is Unicode”, they mean ”UTF-8 encoded”.
Migrating to Unicode
Dyalog’08 - Elsinore
6
Use Google...
Migrating to Unicode
Dyalog’08 - Elsinore
7
Wikipedia too ...
Migrating to Unicode
Dyalog’08 - Elsinore
8
Encodings
•
”Unicode” assigns unique numbers to characters.
Encodings are ways to represent these numbers on file.
Encoding
•
Description
UCS-4
4 bytes per character (= Dyalog ⎕DR type 320). Often used as
internal representation on Unix systems.
UCS-2
2 bytes per character (= type 160). The internal format for
”wide” chars under Windows until Win2000.
UTF-8
THE most popular encoding for text files. Identical to ASCII for
range 0-127 (= good for Americans). 2 bytes/char from 1282047, 3 bytes 2048-65535, 4 bytes after that. The only
encoding which is independent of ”endian-ness”.
UTF-16
Identical to UCS-2 for most of first plane, but can encode all
characters. Replaced UCS-2 on Windows after Win2000.
UCS (Universal Character Set) encodings have a fixed width,
UTF (Unicode Transformation Format) encodings are variable width.
Migrating to Unicode
Dyalog’08 - Elsinore
9
Version 12.0 Design Goals
• To allow users to develop Unicode applications
(containing all the worlds symbols)
• To make the Dyalog IDE a Unicode application
– No more ”translate tables”!
• Avoid having to explain ⎕AV to future generations
– Only one ”kind” of characters
• Design should encourage migration
– Controlled migration with ”interop” between old & new apps
– No ”Big Bang” data conversion events
– Classic & Unicode editions allow ”parallel runs”
Migrating to Unicode
Dyalog’08 - Elsinore
10
Unicode vs Classic
• Unicode Edition:
– Character data is defined as Unicode code points
– No translation of data as it moves in & out of APL
• Classic Edition:
– Character data is defined as indices into ⎕AV
– Translate tables used for keyboard, display and file I/O
• Classic will be available so long as a single major
customer has not been able to migrate
– The price may increase at some point
Migrating to Unicode
Dyalog’08 - Elsinore
11
Key Unicode Features (1)
• New Character Data Types 80, 160, 320:
1-, 2-, 4-byte representations of Code Points.
⎕DR 'Hello'
80
⎕DR '{⍺+⍵}'
160
⎕DR '𠀁𠀂𠀃'
320
• NB: One character = one array element!
Migrating to Unicode
Dyalog’08 - Elsinore
12
Key Unicode Features (2)
• Monadic ⎕UCS converts to and from code
points (self inverse):
⎕UCS 'Hello'
72 101 108 108 111
⎕UCS '{⍺+⍵}'
123 9082 43 9077 125
⎕UCS (2*17)+⍳3
𠀁𠀂𠀃
Migrating to Unicode
Dyalog’08 - Elsinore
13
Key Unicode Features (3)
• Dyadic ⎕UCS encodes and decodes data as UTF-8,
UTF-16 or UTF-32:
'UTF-8' ⎕UCS 'ABCÆØÅ'
65 66 67 195 134 195 152 195 133
'UTF-8' ⎕UCS 240 160 128 129, 240 160 128 130,
240 160 128 131
𠀁𠀂𠀃
'UTF-16' ⎕UCS '𠀁𠀂𠀃'
55360 56321 55360 56322 55360 56323
Migrating to Unicode
Dyalog’08 - Elsinore
14
Demo 1 ...
(key features)
Migrating to Unicode
Dyalog’08 - Elsinore
15
Language Differences
• If you are only using APL workspaces,
and component files, most code from
earlier versions will just load & run
• Potential problems are:
– Monadic ⍋ (only real language difference)
– ⎕DR to test for character data
– Dyadic use of ⎕DR to ”cast” data
– Space usage (char arrays can be larger)
Migrating to Unicode
Dyalog’08 - Elsinore
16
Monadic ⍋
• Due to differences in the internal representation,
upgrade without a collation sequence may return
different results:
Classic
Unicode
⍋'aA'
⍋'aA'
1 2
2 1
⎕AV⍳'aA‘
18 66
⎕UCS 'aA'
97 65
• Give ⍋ a left argument of ⎕AV to maintain the current
behaviour
• In many cases where monadic use, ⍋ order does not matter
Migrating to Unicode
Dyalog’08 - Elsinore
17
Testing for Character Data
• This no longer works as expected:
82=⎕DR X
• Dyalog recommends:
(10|⎕DR ⍵)∊0 2
– The latter is correct in all versions
Migrating to Unicode
Dyalog’08 - Elsinore
18
Dyadic ⎕DR for ”Casting”
• Classic (and previous versions):
83 ⎕DR '⍋'
⍝ ⎕AV[⎕IO+198]
¯109
⍝ Via APL+Win tables
• Unicode:
83 ⎕DR '⍋'
75 35
⍝ ⎕UCS 9035
⍝ 9035 = 256⊥⌽75 35
• The internal representation is different, and Unicode
does NO TRANSLATION
• Code which (e.g.) reads characters from native files
and then ”casts” to number using ⎕DR needs work
Migrating to Unicode
Dyalog’08 - Elsinore
19
More on ⎕DR ... (and ⎕UCS)
• Unicode Edition still recognises 82 as an left argument:
82 ⎕DR ¯109
⍋
• This returns the same character as in Classic. But:
160
⎕DR 82 ⎕DR ¯109
⍝ Type 82 cannot exist in Unicode
• Conversely, ⎕UCS exists in Classic:
⎕UCS 9035
⍋
⎕UCS 180
TRANSLATION ERROR
Migrating to Unicode
⍝ But must return elements of ⎕AV
⍝ Cannot convert to type 82
Dyalog’08 - Elsinore
20
Space and Time
• Character data will require 2 bytes per element in the
Unicode Edition, if it contains APL symbols. No
existing APL arrays can need 4 bytes per element.
• Primitives which manipulate or search this data may
run more slowly (more data to sift through).
• Comments and character constants in code, and the
script form of namespaces and classes, is also
affected
Migrating to Unicode
Dyalog’08 - Elsinore
21
Time and Space
• When copying functions between Classic and Unicode, the
format needs to be converted – this can be expensive.
• The same applies when reading a ⎕OR “across the line”.
• It is not recommended to dynamically import functions across
the Classic/Unicode boundary in production applications.
• Some VERY LARGE functions which could fix in v11.0 may not
fix in the Unicode Edition: Lists of names and constants in a
function share space with comments.
– Proposal to relax all limits on functions may be executed for version 12.1
Migrating to Unicode
Dyalog’08 - Elsinore
22
Unicode vs Classic
• Use the Unicode Edition if:
– You want to develop new applications
– You need to manage characters not in ⎕AV now.
• Use the Classic Edition if:
– You need other v12+ enhancements, but are not ready to
convert to Unicode yet
– Classic is upwards compatible with v11.0 (as usual)
• UE and CE are maintained from single source, and
are ”identical” except for character arrays.
• Start planning your migration now! (please!)
Migrating to Unicode
Dyalog’08 - Elsinore
23
So you want to migrate soon...
• If you ”only use APL” (workspaces, component files,
sockets), applications SHOULD just load & run
• If you
– Fell for the temptation to use any external tools or storage
media as part of your application 
– Wrote your own AP’s or DLL’s
– Or want to start using data not in ⎕AV
... you may have a little work to do. Let’s take a look!
Migrating to Unicode
Dyalog’08 - Elsinore
24
”Interop”
• Unicode and Classic editions are designed to inter-operate
seamlessly – also with v11 & v10.1
• 12.0 Classic can read and translate Unicode character data
found in files, workspaces and on TCP sockets
• Unicode editions will translate data to type 82 when using TCP
Sockets and Component files flagged as non-Unicode (for
interop with v11 & v10.1)
• If Unicode data contains characters not in ⎕AV
=> TRANSLATION ERROR
• Unicode editions still recognise 82 as a valid argument to ⎕DR
and native file functions, and are able to map data in old native
files to ”the same character”.
Migrating to Unicode
Dyalog’08 - Elsinore
25
”Interop”
• The intention is that users should be able to
perform controlled experiments when
migrating to Unicode
• No ”Big Bang” data conversion events; old
files and workspaces can still be read
• We hope that users will ”reciprocate” by
moving as quickly as possibly; it is as easy as
we could make it!
Migrating to Unicode
Dyalog’08 - Elsinore
26
Workspaces
•
Classic and Unicode editions can load each others workspaces, but:
– Classic cannot load (or COPY from) a workspace containing characters not
in ⎕AV (TRANSLATION ERROR)
•
The contents of ⎕AV are defined by ⎕AVU, a list of 256 Unicode Code
Points:
⎕AV[97+⍳26] ⍝ By default in v12.0, "Dyalog Alt"
ÁÂÃÇÈÊËÌÍÎÏÐÒÓÔÕÙÚÛÝþãìðòõ
⎕AVU[97+⍳26]←9397+⍳26 ⍝ Underscored alphabet (sort of)
⎕AV[97+⍳26] ⍝ Now we have "Dyalog Std” mapping
ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
•
When )COPYing from a pre-v12 workspace, ⎕AVU in the target namespace
decides how incoming character data is translated. So code written using Alt &
Std can be merged and maintain the original looks.
Migrating to Unicode
Dyalog’08 - Elsinore
27
More on ⎕AVU
• The Dyalog Std font is still in some older (”anglo”) applications
• Dyalog Alt is used across Western Europe
• Some countries use fonts created by local distributors:
)copy avu Russian.⎕AVU
C:\...avu saved Fri Jun 27 10:00:52 2008
3 50⍴65↓⎕AV
ABCDEFGHIJKLMNOPQRSTUVWXYZАБВГД⍙ЕЖЗИЙКЛМНОПРСТУФХЦ
ЧШЩЪЫЬЭЮ{€}⊣⌷¨Яабв⍨гдежзийклмнопрстуфхцч[/⌿\⍀<≤=≥>
≠∨∧-+÷×?∊⍴~↑↓⍳○*⌈⌊∇∘(⊂⊃∩∪⊥⊤|;,⍱⍲⍒⍋⍉⌽⊖⍟⌹!⍕⍎⍫⍪≡≢шщъы
•
•
The translate table is also used when reading component files and APL
data arriving on TCP Sockets
It has namespace scope, so classes or namespaces can be defined to
read data from Classic systems using different languages if necessary
Migrating to Unicode
Dyalog’08 - Elsinore
28
Underscores Must Die!
• There is no Underscored alphabet in Unicode. Underscoring is a
form ”emphasis” (like bold or italic). The underscored alphabet is
the ONLY incompatibility with the rest of the world and should be
phased OUT.
• The APL385 Unicode font incorrectly displays underscores for
code points 9398-9423 (decimal). The positions should really
display as Ⓐ..Ⓩ.
• (Don’t ask why circled alphabetics ARE in unicode, while
underscores are not – but Dyalog decided to map underscores to
this range)
Migrating to Unicode
Dyalog’08 - Elsinore
29
⎕AV: Just another variable
• In the Unicode Edition, the Atomic Vector is only used to define
how to inter-operate with Classic systems. Only characters in
⎕AV can be shared. Assuming the default (Alt) setting:
'ÁⒶ'∊⎕AV
1 0
• System variable ⎕Ⓐ (name now displays as ⎕Á) should no
longer be used. It continues to exist and returns ⎕AV[97+⍳26]
Migrating to Unicode
Dyalog’08 - Elsinore
30
Chars Allowed in Names
• The list has not been extended, the following are allowed:
0123456789 (but not as the 1st character in a name)
ABCDEFGHIJKLMNOPQRSTUVWXYZ_
abcdefghijklmnopqrstuvwxyz
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß
àáâãäåæçèéêëìíîïðñòóôõöøùúûüþ
∆⍙ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
• In a standard font, underscores display as Ⓐ to Ⓩ
• I Unicode, all of the above can now be used simultaneously
(previously, the available set depended on whether the Alt or
Std font was selected). Russian letters are NOT allowed.
Migrating to Unicode
Dyalog’08 - Elsinore
31
Component File Interop
• Like workspaces, Component Files can be shared
between Classic and Unicode editions.
• The same restriction applies: Classic cannot read
arrays containing characters not in ⎕AV.
• Files can be marked as non-Unicode, in which case
Unicode cannot write characters not in ⎕AV.
– All ”small” (32-bit) component files are non-Unicode
• For ordinary APL arrays (no ⎕ORs), the Unicode
edition can share files with old versions of APL too.
Migrating to Unicode
Dyalog’08 - Elsinore
32
File Properties
• New system function ⎕FPROPS allows you to control
whether a file may contain Unicode data:
'c:\temp\smallfile' ⎕FCREATE 32 32
'EJSU' ⎕FPROPS 1 ⍝ Endian, Journaled, Size, Unicode
0 0 32 0
'c:\temp\bigfile’ ⎕FCREATE 64 64
'EJSU' ⎕FPROPS 64
0 0 64 1
• Size defaults to 64 from v12.0 (new startup flag –F32/-F64)
• Small address size (32-bit) files are limited to 4Gb in size and
can NOT have the Unicode bit set
• Setting Journaling on prevents sharing with v11.0 or earlier
Migrating to Unicode
Dyalog’08 - Elsinore
33
Translation Error on Write
• Unicode edition can write to non-Unicode component files:
'{⍺+⍵}' ⎕FAPPEND 32 ⍝
'U' 0 ⎕FPROPS 64
⍝
'𠀁𠀂𠀃' ⎕FAPPEND 64 ⍝
TRANSLATION ERROR
'U' 1 ⎕FPROPS 32 ⍝ Not
TRANSLATION ERROR
∧/'{⍺+⍵}'∊⎕AV – fine!
Switch Unicode OFF
Chars not in ⎕AV
allowed for small files
• If non-Unicode files do not contain namespaces or ⎕ORs, v10.1
and v11.0 can use them
• Note: Large files (64-bit) cannot be used with versions 10.0 or
earlier.
Migrating to Unicode
Dyalog’08 - Elsinore
34
Migrating to Unicode
Dyalog’08 - Elsinore
35
TCP Socket / Conga Interop
•
TCPSocket objects have an Encoding property:
Encoding
Style
Meaning
None
Char
No translation, characters must be in range 0-255.
UTF-8
Char
To UTF-8 on send, from UTF-8 on receive
Classic
APL
Chars transmitted encoded as elements of ⎕AV
Unicode
APL
Types 80, 160 or 320 used as required
•
•
•
The default is None for Char, and Classic for APL
APL sockets are non-Unicode by default to avoid crashing downversion APL interpreters receiving Unicode data
Conga always sends data in ”native” form, receive will fail with a
TRANSLATION ERROR if data cannot be represented
Migrating to Unicode
Dyalog’08 - Elsinore
36
External Variables
• External Variables are implemented as small span component
files (32-bit files) – and can thus NOT contain Unicode data:
'c:\temp\xvar’ ⎕XT'x'
x
Hello World
x←'𠀁𠀂𠀃'
TRANSLATION ERROR
• External Variables should be seen as a ”deprecated” feature:
You will still be able to use existing external variables, but
should plan to convert to component files or mapped files at
your convenience.
Migrating to Unicode
Dyalog’08 - Elsinore
37
Mapped Files
•
Like external variables, the use of APL mapped files (containing APL
arrays with header information) should be seen as a deprecated feature.
–
•
Convert to using other mechanisms at your earliest convenience.
Support for RAW mapped files (where type information is provided when
mapping) remains core functionality (and will probably get more
important in a world of multicore machines):
32↓102↑80 ¯1 ⎕MAP'c:\Program Files\ComfortKeyboard\changes.txt'
Added new interface languages: Latvian, Brazilian Portuguese, Italian.
•
•
Type 82 is NOT supported in the Unicode Edition: Mapped variables are
”in the workspace” and cannot be translated on access.
To read a raw file written using data type 82, map with data type 83 and
the characters extracted by indexing into ⎕AVU.
Migrating to Unicode
Dyalog’08 - Elsinore
38
(Own) DLLs and APs
•
•
•
•
The format for passing APL arrays to Libraries and Auxiliary
Processors is unchanged, except that a Unicode Edition will pass
character arrays of type 80, 160 or 320
Dyalog-provided libraries have been upgraded. A number of old Aps
like PREFECT are no longer shipped, but v11 versions will continue to
work fine with the Classic Edition.
If you have written your own APs or DLLs which handle character
data, these need to be updated to deal with new data types.
You can return any of the Classic or Unicode character types, they will
be translated (subject to the usual TRANSLATION ERROR limitations).
Migrating to Unicode
Dyalog’08 - Elsinore
39
Native Files
•
Unicode Edition also still supports type 82, so that old files containing
APL characters can be used. They mapping to the ”same characters” but with a different internal representation:
V11:
'c:\temp\plus'⎕NCREATE ¯1
'{⍺+⍵}' ⎕nappend ¯1
V12:
⎕DR ⎕←⎕NREAD ¯1 82 5 0
{⍺+⍵}
160
Migrating to Unicode
Dyalog’08 - Elsinore
40
Native Files & Unicode
•
•
Unicode Edition supports new data types 80, 160, 320 – reading or
writing 1, 2 or 4 bytes at a time (file is UCS-1, -2 or -4 encoded).
Code Change Possibly Required: The DEFAULT TYPE when
appending character arrays is now 80 (was 82):
'plus:’ ⎕NAPPEND ¯2 ⍝ Type 80 (all ANSI)
'{⍺+⍵}' ⎕NAPPEND ¯1 ⍝ Type 160 (APL chars)
DOMAIN ERROR
⍝ Data cannot be narrowed
•
•
Early Beta versions of 12.0 used the type of the left argument, but this
lead to variable numbers of bytes being used when writing depending
on the content of an array (160 if a non-ANSI character included).
If you need to write text containing APL to a native file, use type 160 –
or perhaps better, use UTF-8!
Migrating to Unicode
Dyalog’08 - Elsinore
41
Native Files & UTF-8
•
The most common way to store Unicode data in text files is to encode it
using UTF-8: This is a format understood by ”most” web applications
and other Unicode-enabled applications.
text←'plus←{⍺+⍵}'
'UTF-8' ⎕UCS 'plus'
112 108 117 115
'c:\temp\plus.txt' ⎕NCREATE ¯1
(⎕UCS 'UTF-8' ⎕UCS 'plus') ⎕NAPPEND ¯1
⎕CMD 'notepad c:\temp\plus.txt' 'normal’
•
•
Windows Notepad is able to detect that the file is UTF-8 encoded and
displays the text correctly.
The monadic ⎕UCS on the left converts integers in the range 0-255 into
one-byte Unicode characters before appending. Integers above 127
would become type 163 (2 bytes per element).
Migrating to Unicode
Dyalog’08 - Elsinore
42
Native Files & UTF-8
•
•
•
The most common way to store Unicode data in text files is to encode it
using UTF-8: This is a format understood by ”most” web applications
and other Unicode-enabled applications.
UCS-2 (2 bytes per character) is supported by many Microsoft apps
(like Visual Studio). UCS-2 was the standard until Windows 2000 – now
replaced by UTF-16, which is identical to UCS-2 for most data, but
expands to 4 bytes when required.
Applications need to know which encoding has been used. Two
common methods of indicating this are ”Byte Order Marks” at the
beginning of the file, and (for web pages) HTTP tags.
Migrating to Unicode
Dyalog’08 - Elsinore
43
Byte Order Mark
•
By convention, the first few bytes of text files are sometimes (but not
always) an encoding of U+FEFF, the ”Byte Order Mark”, also known as
”Zero width no-break space”:
This convention allows applications to ”guess” the encoding used:
•
1st bytes are... Encoding is therefore probably
•
EF BB BF
UTF-8
FF FE
UTF-16 or UCS-2, written by little endian CPU (Intel)
FE FF
UTF-16 or UCS-2, big endian
FF FE 00 00
UTF-32 / UCS-4, little endian
00 00 FE FF
UTF-32 / UCS-4, big endian
The convention is more common under Windows than Unix/Linux.
Sometimes writing the BOM makes things worse...
Migrating to Unicode
Dyalog’08 - Elsinore
44
Reading Text Files
∇ Chars←ReadFile name;nid;signature;nums
[1] ⍝ Read ANSI or Unicode character file (Windows)
[2]
nid←name ⎕NTIE 0
[3]
signature←3↑⎕NREAD nid 83 3 0
[4]
:If signature≡¯17 ¯69 ¯65 ⍝ UTF-8 (EF BB BF)
[5]
Chars←⎕NREAD nid 80(¯3+⎕NSIZE nid) 3
[6]
Chars←'UTF-8' ⎕UCS ⎕UCS Chars
[7]
:ElseIf (2↑signature)≡¯1 ¯2 ⍝ LittleEnd UTF-16 (FF FE)
[8]
Chars←⎕NREAD nid 160(¯1+⎕NSIZE nid)2
[9]
:Else ⍝ ANSI
[10]
Chars←⎕NREAD nid 80(⎕NSIZE nid)0
[11] :EndIf
[12] ⎕NUNTIE nid
∇
Migrating to Unicode
Dyalog’08 - Elsinore
45
Writing Text Files
Migrating to Unicode
Dyalog’08 - Elsinore
46
Writing a UTF-8 Web Page
html←'<html>',NL,' <head>',NL
html,←' <meta http-equiv="content-type"
content="text/html; charset=UTF-8" />'
html,←’ </head>',NL,'<body>',NL
html,←’ <font face="APL385 Unicode">'
html,←'plus←{⍺+⍵}</font>',NL
html,←'</body>',NL,'</html>',NL
'c:\temp\plus.htm'⎕NCREATE ¯1
(⎕UCS 'UTF-8' ⎕UCS html) ⎕NAPPEND ¯1
⎕NUNTIE ¯1
⎕CMD 'iexplore c:\temp\plus.htm' ''
Migrating to Unicode
Dyalog’08 - Elsinore
47
Web Page: Results
Migrating to Unicode
Dyalog’08 - Elsinore
48
UTF-8 Files with .NET
Migrating to Unicode
Dyalog’08 - Elsinore
49
UTF-8 Files with .NET
apltxt←⎕SE.SALT.New 'C:\..\UTF8File' 'c:\temp\apl.txt'
apltxt.Text
Compute average in APL:
avg←{(+/⍵)÷⍴⍵}
apltxt.Text,←⊂'⍝ Morten was here’
System.Text.Encoding.⎕nl -2
ASCII BigEndianUnicode Default Unicode
Migrating to Unicode
Dyalog’08 - Elsinore
UTF32
UTF7
UTF8
50
External Interfaces: COM/.NET
• COM/OLE, Microsoft.Net: No problem
– Have been translating chars to UCS-2/UTF-16 ”always”
– Translation code removed in v12 Unicode 
• We already saw it in action:
↑System.IO.File.ReadAllLines ⊂'c:\temp\apl.txt'
Compute average in APL:
avg←{(+/⍵)÷⍴⍵}
Migrating to Unicode
Dyalog’08 - Elsinore
51
SQAPL / ODBC & Unicode
SQA.Connect 'B' 'MS SQL Server' 'pass' 'user’
(not all results displayed in the following)
0
SQA.Columns 'B' 'idioms'
COLUMN_NAME .. DATA_TYPE TYPE_NAME
id
..
4 int identity
exp
..
¯9 nvarchar
COLUMN_SIZE
10
400
⎕←data←3 1⊃SQA.Do 'B' 'select * from idioms'
1 {(+/⍵)÷⍴⍵}
2 {⍵/⍳⍴⍵}
3 {(<\⍵)⍳1}
data[;2]←{⎕UCS 'UTF-8' ⎕UCS ⍵}¨data[;2] ⍝ Make
UTF8
Migrating to Unicode
Dyalog’08 - Elsinore
52
SQAPL Example (continued)
SQA.Do 'B' 'alter table idioms add utf8exp varbinary(100)'
SQA.Prepare 'B.U1' 'update idioms set utf8exp=:<X20:
where id=:<I:' ('Bulk' 20)
SQA.X 'B.U1' (⌽data)
⍝ Store UTF8
1
2
3
1
⎕←data←3 1⊃SQA.Do 'B' 'select id,exp,utf8exp from idioms'
{(+/⍵)÷⍴⍵} {(+/â•
µ)Ã·â•
´â•
µ}
{⍵/⍳⍴⍵}
{â•
µ/â•
³â•
´â•
µ}
{(<\⍵)⍳1}
{(<\â•
µ)â•
³1}
data[;2]≡¨{'UTF-8' ⎕UCS (⎕UCS ⍵)~0}¨data[;3] ⍝ It works!
1 1
Migrating to Unicode
Dyalog’08 - Elsinore
53
ODBC / SQAPL Summary
• SQAPL 6.0 supports ODBC Unicode data types:
•
•
•
ODBC
Type
SQAPL
Type
Description
WCHAR
U
”Wide” fixed-length string
WVARCHAR
W
”Wide” variable-length
WLONGVARCHAR
Q
”Wide” unlimited-length
These can be used in the same was as the single-byte types. In most
cases, the choice is automatic (as we have seen).
Note: The above applies to databases which have Unicode data types.
However, Unicode data is often stored in single-byte types, UTF-8
encoded.
Most of the work will be understanding how to store Unicode in your
database – and converting the data (see your Database Manual ).
Migrating to Unicode
Dyalog’08 - Elsinore
54
External Interfaces: ⎕NA
• In Classic & previous editions, parameter type C meant
untranslated bytes and T meant ”text”, translated to ANSI.
• In Unicode, both are untranslated.
• T without a width specification now means ”wide characters
according to the host convention”
• Thus: T means T1 in Classic, T2 in Unicode for Windows, and
T4 under Unicode for Unix/Linux
• This means that the use of type T (<0T, >0T, =T) should be
portable across Classic/Unicode systems
• Some (typically Unix/Linux) system calls expect data to be UTF8 encoded: You must use dyadic ⎕UCS to do the translation.
• Future extensions to ⎕NA may provide UTF-8 encoding.
Migrating to Unicode
Dyalog’08 - Elsinore
55
Selection of A or W Functions
• Under Windows, Win32 library calls which handle text are
generally available in two variants:
– An ANSI (narrow) version with a name ending in A
– a Unicode (wide) version with a name ending in W
• For example, the function to display a message box is available
as MessageBoxA and MessageBoxW.
• If you specify the character * at the end of a name, this will be
replaced by A in Classic and W in the Unicode Edition.
• The intention is to allow you to write code which will work now
under Classic and continue to work under Unicode – to facilitate
parallel code testing and a controlled migration.
Migrating to Unicode
Dyalog’08 - Elsinore
56
Portable ⎕NA Example
• The following function is portable between Classic and Unicode:
∇ ok←title MsgBox msg;MessageBox
[1] ⎕NA 'I user32∣MessageBox* I <0T <0T I'
[2] ok←1=MessageBox 0 msg title 1 ⍝ 1=OK, 2=Cancel.
∇
• The function MessageBoxA will be selected by Classic,
MessageBoxW by Unicode.
• <0T will mean 1-byte (translated) text under Classic, and 2-byte
(untranslated) text under Unicode
– Strictly speaking, text should be translated to UTF-16 in Classic,
but this is only required for ”a few” special chars
Migrating to Unicode
Dyalog’08 - Elsinore
57
APL Source in Unicode Files
• SALT (Simple APL Library Toolkit) supports storage of
functions, namespaces and classes in UTF-8 files with a .dyalog
extension.
• You can also very easily write your own storage mechanism
using Unicode text files. Under .Net it is trivial:
Save:
System.IO.File.WriteAllText 'c:\temp\foo.txt'
(⎕VR 'foo') System.Text.Encoding.UTF8
Load:
⎕FX System.IO.File.ReadAllText ⊂'c:\temp\foo.txt’
• Without .Net it requires a wee bit more work (as we have seen
earlier)
Migrating to Unicode
Dyalog’08 - Elsinore
58
Source Code Management
• Storing APL source in Unicode text files may seem less
convenient to the seasoned APL programmer, but there are very
significant advantages:
• High quality tools (both free and ”commercial”) built for other
languages can be used to edit, compare, manage source, and
build systems – without further ado
• Not only does this make it easier to position APL as a tool for
”professional” software development, many of these tools are
actually useful (there are some smart people ”out there”)
• Young developers joining your APL team will already be familiar
with these tools and feel ”at home” more quickly
• The quality of life of the APL developer need not be sacrificed!
Migrating to Unicode
Dyalog’08 - Elsinore
59
Demo of Source Code Mgt
Migrating to Unicode
Dyalog’08 - Elsinore
60
Demo of Source Code Mgt
Migrating to Unicode
Dyalog’08 - Elsinore
61
Source Code Mgt Demo
• All tools shown here downloaded from internet, none
of them knew about APL in any way.
Migrating to Unicode
Dyalog’08 - Elsinore
62
Demo: Working with MyApp
Migrating to Unicode
Dyalog’08 - Elsinore
63
Keyboarding
• Discuss IME vs new Keyboards
• Demo new Console Unix/Linux APLs
Migrating to Unicode
Dyalog’08 - Elsinore
64
Migration Check List
• Are you migrating in order to simplify and stay current, or
because you want to support ”foreign” text in your application?
– Probably, you should do the former first (or at least experiment with
it), before trying the latter
• For the former, you only need to make sure that your interfaces
to external systems (native files, databases etc) work the same
way as before
– You may need to add checks to prevent the inadvertant entry of Unicode
characters that your external interfaces cannot handle
• For the latter, you need to be sure that external systems ALSO
support Unicode, and how they want to exchange data with your
application
Migrating to Unicode
Dyalog’08 - Elsinore
65
Think about ...
• (Dyadic) ⎕DR
• Native Files
• Monadic ⍋ of char data
• APL style TCP Sockets
• Interop required with
earlier versions?
• External Vars
• Mapped Files
• Own DLLs and Aps
Migrating to Unicode
– Need non-⎕AV/ANSI data
– Convert to UTF-8?
• Win32 or other system calls
via ⎕NA
• Underscores(!)
• Switching to SALT /
SubVersion?
Dyalog’08 - Elsinore
66
Suggested Strategy
• Migrate to v12 Classic, write code which
works in both Classic & Unicode.
• Wait until entire user base upgraded to v12.
• Move application to Unicode Edition.
• Suggested timeframe for a large application
with many interfaces might be 2-4 years.
• Start thinking now!
Migrating to Unicode
Dyalog’08 - Elsinore
67