KOI8-U

From Wikipedia, the free encyclopedia

KOI8-U is an 8-bit character encoding, designed to cover Ukrainian, which uses the Cyrillic alphabet. It is based on KOI8-R, which covers Russian and Bulgarian, but replaces eight graphic characters with four Ukrainian letters Ґ, Є, І, and Ї in both upper case and lower case.

KOI8 remains much more commonly used than ISO 8859-5, which never really caught on. Another common Cyrillic character encoding is Windows-1251. In the future, both may eventually give way to Unicode.

In Russian, KOI8 stands for Kod Obmena Informatsiey, 8 bit (Код Обмена Информацией, 8 бит) which means "Code for Information Exchange, 8 bit".

The KOI8 character sets have the property that the Russian Cyrillic letters are in pseudo-Roman order rather than the natural Cyrillic alphabetical order as in ISO 8859-5. Although this may seem unnatural, it has the useful property that if the 8th bit is stripped, the text can still be read (or at least deciphered) in case-reversed transliteration on an ordinary ASCII terminal. For instance, "Русский Текст" in KOI8-U becomes rUSSKIJ tEKST ("Russian Text") if the 8th bit is stripped.

[edit] Codepage layout

The following character set table may require cleanup to meet Wikipedia's quality standards.
Please improve this table if you can.

KOI8-U
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x	unused
1x	unused
2x	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
4x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
6x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~
8x	─	│	┌	┐	└	┘	├	┤	┬	┴	┼	▀	▄	█	▌	▐
9x	░	▒	▓	⌠	■	∙	√	≈	≤	≥	NBSP	⌡	°	²	·	÷
Ax	═	║	╒	ё	є	╔	і	ї	╗	╘	╙	╚	╛	ґ	╝	╞
Bx	╟	╠	╡	Ё	Є	╣	І	Ї	╦	╧	╨	╩	╪	Ґ	╬	©
Cx	ю	а	б	ц	д	е	ф	г	х	и	й	к	л	м	н	о
Dx	п	я	р	с	т	у	ж	в	ь	ы	з	ш	э	щ	ч	ъ
Ex	Ю	А	Б	Ц	Д	Е	Ф	Г	Х	И	Й	К	Л	М	Н	О
Fx	П	Я	Р	С	Т	У	Ж	В	Ь	Ы	З	Ш	Э	Щ	Ч	Ъ

In the table above, 20 is the regular SPACE character, and 9A is the NO-BREAK SPACE.

The difference with KOI8-R consists of the positions 0xA4; 0xA6; 0xA7; 0xAD; and 0xB4; 0xB6; 0xB7; 0xBD; which consist of extra letters that don't exist in Russian.

Although RFC 2319 says that character 95 should be U+2219 (∙), it may also be U+2022 (•) to match the bullet character in Windows-1251.

Some references have a typo and incorrectly state that character B4 is U+0403, rather than the correct U+0404. This typo is present in Appendix A of RFC 2319 (but the table in the main text of the RFC gives the correct mapping).