DjVu

From Wikipedia, the free encyclopedia

This article needs additional citations for verification.
Please help improve this article by adding reliable references. Unsourced material may be challenged and removed. (September 2007)

DjVu
File name extension	`.djvu, .djv`
Internet media type	`image/vnd.djvu`
Type code	DJVU
Developed by	AT&T Research
Type of format	Image file formats

DjVu (pronounced déjà vu) is a computer file format designed primarily to store scanned images, especially those containing text and line drawings. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web.

DjVu has been promoted as an alternative to PDF, as it gives smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70KB, black and white technical papers compress to 15–40KB, and ancient manuscripts compress to around 100KB; all of these are significantly better than the typical 500KB required for a satisfactory JPEG image. Like PDF, DjVu can contain an OCRed text layer, making it easy to perform cut and paste and text search operations.

1 History
2 Comparison with PDF
- 2.1 Relative advantages
3 Other compression methods
4 External links

[edit] History

The DjVu technology was originally developed by Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard at AT&T Laboratories in 1996. DjVu is a free file format. The file format specification is published as well as source code for the reference library. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T and LizardTech. The original authors maintain a GPLed implementation named "DjVuLibre".

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2 (similar to JBIG2). The JB2 encoding method identifies nearly-identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.

In 2002 the DjVu file format was chosen by the Internet Archive as the format in which its Million Book Project provides scanned public domain books online (along with TIFF and PDF).

DjVu format will be used by the One Laptop per Child project in order easily to supply existing paper books in an eBook format. The advantage of DjVu is that it is highly compressed and it does not require any font support. [1]

[edit] Comparison with PDF

The primary difference between DjVu and PDF is that DjVu is a raster format, whereas PDF is primarily a vector format. This difference between the two formats has several consequences:

The maximum resolution of a DjVu file must be specified when the file is created. On the other hand, a vector image represented by a PDF file can usually be magnified at arbitrary resolution without loss of quality. Even when read at normal resolution, vector graphics allows the rendering program to apply readability enhancements tools such as Adobe Cool Type which dramatically improve the ease of on-screen reading.
DjVu files render characters as images, without using fonts. PDF files usually render characters using fonts. Many PDF files do not embed the full representation of the necessary fonts, but simply specify their names and properties. The PDF viewer uses the exact same font if it is available. Otherwise it transforms an available font to compute an approximation with the metrics of the desired font.

All this suggests that in the long run vector graphics will become the format of choice for the production of text documents by typesetting. On the other hand, for scanned media the following two options exist:

One may use an optical character recognition engine to replace the raster image representation of scanned image with a vector font. Naturally, this works well only when the medium is composed primarily of text. And even then one has to deal with errors inherent in such substitution.
Another option is to store the scanned image of the medium as is. Both PDF and DjVu allow for efficient storage of this type, with some differences which are described below.

Roughly, the printed media content can be said to be a mix of text and graphics. To store various scanned media types both PDF and DjVu formats employ various codecs. The simplest (and the least efficient) way of storing scanned media is to treat both graphics and text as graphics. Historically, this was the first way how scanned media was stored in PDF: for color and gray images the JPEG codec was used, while for bitonal (black-and-white) images one of the fax codecs was used, most notably CCITT3 & CCITT4. As a result, a typical PDF file size was several hundred kilobytes per page. It was around this time when DjVu was proposed. This new file format essentially combined two new codecs with a very simple file structure:

The first of the two new codecs was called C44 and was a drastic improvement over JPEG: it used wavelets and achieved better size/quality ratios by a factor of 2 or more. In response, Adobe later included another wavelet-based image codec JPEG 2000 in its PDF 1.4 specification.
The other codec was JB2. It achieved the size of about 10KB/page at 300dpi for scanned bitonal text images. Again, Adobe followed with an introduction of similar JBIG2 codec in the PDF 1.4 specification.

It is interesting to note that while Adobe Reader 5.0 was able to render JBIG2-encoded images, the encoder only appeared in Adobe Acrobat 6.0. This, along with other factors, lead to the establishment of DjVu as the format of choice for storing scanned documents.

At present, both PDF and DjVu have similar arsenal for representing highly compressed images. Moreover, the codecs used are essentially the same. The difference for the end user thus comes from the differences between encoders. If one compares the JBIG2 encoders in Adobe Acrobat (in lossy mode) and the on-line service at [2], the general conclusion is that DjVu file will be smaller, while the PDF file will have higher quality (will be more accurate).

Both formats define features that do not address the representation of the document appearance but aim at creating a document delivery platform. Both DjVu files and PDF files can be enriched with text, table of contents, hyperlinks and metadata. PDF goes further by allowing sounds, interactive forms, and JavaScript programs. DjVu defines a protocol to transfer document pages on demand over the Internet. DjVu does not specify a way to certify the authenticity of a document or to define Digital Rights Management policies.

[edit] Relative advantages

With PDF documents one can zoom in on vector-based content to an arbitrary depth or print them at an arbitrarily high resolution without introducing quality loss or jaggedness inherent to raster formats. But if a PDF is simply used as a container for non-vector images (such as scans), those images will not gain anything. Also, a vector format can always be converted to a raster format, usually with irrevocable data loss, but the other direction is very difficult.

PDF is most useful when the original source is an electronic document such as a Microsoft Word doc or TeX file. Such documents benefit most from the vector graphics technology that underlies PDF. DjVu files can be marginally smaller but only deliver a high resolution image, possibly enriched with the associated text.

DjVu is very good for image files, and has been optimized especially for scanned text and images. However, PDF could be better if the scanned raster images can be transformed into high quality vector graphics, for instance by applying optical character recognition to the scanned image, identifying the fonts, and carefully proofreading the resulting file. This procedure often costs too much time. Suitable fonts might not be available, or one may want to preserve the original document more exactly, including signatures, marginal comments, paper texture, or other markings. In such cases, DjVu is the better choice.

[edit] Other compression methods

At present, the most advanced method for compressing scanned bitonal documents seems to be Cartesian Perceptual Compression. Its size/quality ratios are unmatched by both DjVu and PDF. However, this compression format enjoys limited popularity since it's a closed file format/codec, which is protected by a US patent.