The Lowdown on ASCII, ANSI, and Unicode
by Bob Zale
You must understand Unicode. At least the basics, anyway. Every
programmer needs to understand character sets. ASCII, ANSI, OEM, and
UNICODE. Without it, you'll soon find yourself in lost territory.
These are character sets used on our beloved computers, and they're
very important to programmers. Character sets exist to associate a
number (or even a multi-byte set of two or more numbers) to a
particular character: that could be a letter, a digit, punctuation,
or even a special symbol. All of these character sets provide the
same functionality, but they map many of the number-to-character
relationships very differently.
Exactly what do we mean by a multi-byte character? You may not have
encountered it before. That's common, but it's a fairly simple
concept. A byte can hold just 256 different values. However, there
are many more than 256 possible characters needed. So, in some cases,
it's been necessary to use multiple numbers, in a series, to represent
some of these characters. Just keep in mind that characters may need
1 byte, 2 bytes, or even more, to be accurately represented.
A good compiler offers you ANSI strings. This has been the standard
for many years. A better compiler lets you choose between ANSI and
UNICODE, but only one. If you want unicode, you can't keep binary
values in a string. It simply won't work. A great compiler, like
PowerBASIC, supports all of them, in the same program, transparently.
One variable with ANSI. Another with UNICODE. Mix and match any
way you choose with PowerStrings. All the messy details, and even
the needed translations, are handled automatically by the compiler.
But more about that later.
ASCII characters
This is the original. The first character set to appear on personal
computers. It was very simple, as long as you only needed plain old
American English characters. No accents. No symbols, No drawing.
No international characters. We used the numbers between 32 and 127
to represent what was needed. A blank space was 32, the letter "A"
was 65, lower case "c" was 99. The entire set of characters could
be stored in 7 bits. Very convenient for an 8-bit computer. Numbers
below 32 were reserved for control codes. They were commonly known
as non-printing, since they weren't associated with a character. The
number 13 was a carriage return, 10 a line feed, 9 was a tab. While
ASCII was certainly limited, it remains the basis for all the other
character sets to follow. ASCII characters are unchanged in the
other character sets.
OEM characters
It wasn't long before all the folks realized that bytes had one more
bit. The characters from 128 to 255 were a great temptation. IBM
added international characters, line drawing symbols, accents, and
even more. WordStar used them for formatting. Who knows what else.
Of course, when people outside the U.S. got involved, they needed to
support their own characters. The IBM set didn't work, as it was
just too small. There were lots of different ideas about what should
be supported, but every region of the planet knew that their scheme
was the best. We had lots of incompatible, even incomprehensible
text. These are OEM character sets. However, to this day, the
original IBM character set is universally known as the OEM character
set.
ANSI characters
After a time, some resolution was found in the ANSI standard. Pretty
much everyone agreed on the lower 128 characters... they remained the
same as ASCII. But the upper 128? That depended on your region of
the world. Every region had their own "code page", depending upon
the characters in their language. You couldn't use two code pages
simultaneously, so characters remained equally incompatible. One
character code above 127 might represent something totally different
in each of the dozens of code pages.
Of course, that wasn't the only issue. Some Asian alphabets have
thousands of characters. That would never work in a 8-bit byte.
So, the "multi-byte" character was born. Identify a character by a
specific set of 2, 3, or even more bytes. While this gave us much
more capacity, it was difficult to use for the programmer. It was
easy to move forward, character by character, while scanning a
string. But what if you had to back up? Was that preceding code
a one-byte character? Or was it part of a multi-byte set of codes?
All very confusing.
UNICODE characters
The creation of Unicode was an exhaustive effort to build a single
character set which could represent every character in any language
with a unique code number. This was sorely needed in this age of
the Internet to avoid mass confusion. Although not widely known,
there are several forms of Unicode. By far the most common form is
known as UTF-16, because each character is 16 bits wide. A 2-byte
value in the range of 0 to 65535. This form is used by PowerBASIC,
Microsoft, and other compiler publishers. Most characters of most
languages can be defined this way, so it's a very convenient form
for us to use. It's a great tool for programmers, because every
value represents just one unique character. Ambiguity is over.
Just as before, the lowest 128 values are inherited directly from
ASCII. They're just extended to a 2-byte representation. The
letter "Z" in ascii is represented by the byte &H5A. In Unicode,
it's represented by the word &H005A. Very straightforward.
When a file is created with Unicode characters, you'll find that it
is sometimes identified by a "Byte-Order Mark" (BOM). If the first
two bytes are &HFF, then &HFE, the file format is Unicode with
"Little Endian" encoding. The low-order byte of each word precedes
the high-order byte. This is the format used by all Intel CPU's,
PowerBASIC, and the vast majority of other origins. However, if
you encounter a Byte-Order mark of &HFE, then &HFF, the encoding
method is "Big Endian", and the byte order is reversed. It would
be nice if every Unicode text file contained a Byte-Order mark, but
that's just not the case. Don't ever count on its presence.
As you probably guessed, there's no real maximum to the number of
characters defined by Unicode. For that reason, there's also a
UTF-32 form, with each character defined as as a 4-byte DWord.
While this form expands capacity nicely, it's also very wasteful.
UTF-32 is rarely used, and merits just a passing mention at this
time.
UTF-8 UNICODE characters
And then came the Internet. Massive amounts of text, with great
pressure to present a web page quickly. All those extra zeros on
every UTF-16 character were called a huge waste of bandwidth and
time, too. How can we speed it all up? By making a compromise
between ANSI and UNICODE. UTF-8 is an all out effort to minimize
the size of the text which must be served up on a web page. In
that context, text size is the #1 issue. In UTF-8, each character
from 0 to 127 is stored as a single byte. Characters encoded above
127 are stored as a set of 2-6 bytes. The most used characters are
assigned codes with the smallest byte count.
Just as before, UTF-8 also inherits the ASCII values. If you're
working in American English only, you won't even notice a change.
The inventors made some other nice changes as well. For multi-byte
characters, they used unique values for the lead-in byte, and
other unique values for the following bytes. This allows you to
step through a string, in either direction, with absolutely no
ambiguity. A huge improvement in the overall scheme of things.
While UTF-8 is seldom used outside of the Internet, PowerBASIC
includes simple, easy to use functions, for quick translation of
UTF-8 Unicode to and from every other character set. A big boost
for your Internet-aware applications.
PowerStrings make it so simple
The new PowerBASIC offers PowerStrings. They actually show signs
of intrinsic intelligence. They know the form of the characters
they hold. They know if they're Unicode... They know if they're
ANSI... They know if they're OEM. And they act accordingly. If
a conversion is needed, it's all automatic. Totally automatic.
Concatenate ANSI with UNICODE? Sure! It's all automatic and
totally transparent. For example, suppose you have an ANSI string
as a$, and a Unicode string as u$$. You wish to concatenate them,
storing the result as b$$. It's easy. no different than before.
b$$ = a$ + u$$
PowerBASIC automatically converts a$ to Unicode format, appends
u$$ to it, then stores the result in the variable b$$. It's just
that simple. But is it fast? Of course! As always, PowerBASIC
leads the way in performance. It's very special. Just try the
execution of an INSTR() function against any other. Unicode or
ANSI. As with every time sensitive function, PowerBASIC keeps
two versions handy. One for OEM or ANSI, another for Unicode.
Each of them is built with explicit, hand crafted machine code.
When it's time to create your EXE, PowerBASIC includes only the
one which best suits your code.
The POWER changes everything
Unicode is important today. It will be pervasive tomorrow.
Don't be left behind, or it may not be possible to catch up.
PowerBASIC makes Unicode easy, so you typically don't need to
give it much thought at all. Use that fact to your advantage...
then spend the time you saved for other important issues. If
you don't prepare today, you could face real problems later on.
Plan now, or forever hold your peace.
|