Search the web
Sign In
New User? Sign Up
openreader-devel · OpenReader Development
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Re: [ebook-community] Orca design document   Message List  
Reply | Forward Message #70 of 70 | Next >
PDF as Container - some thoughts

At 12:33 PM 11/10/2005, Jon Noring wrote:
Btw, Leonard, can you give a brief summary as to the structure of the
PDF wrapper, at least how it compares with ZIP and a MIME-based
approach? In some ways, the PDF format had to consider issues similar
to what OR has to consider, such as user agent application access and
rendering of the internal content, including over networks.

Here is a VERY quick synopsis of my "PDF Internals" class that I offer to customers.

File Format
-----------
PDF files consists of four parts - Header, Body, XRef and Trailer (in that order).
Header - identifies the file as being a PDF, and offers a CLUE to the version
Body - the "objects"
XRef - the "object index", where to find each indirect object in the PDF
Trailer - references to key "objects" on the PDF (root/Catalog, Encryption, etc.)
The normal way to process a PDF is to read the header, then the trailer, then the XRef and then based on the information in the Trailer and the XRef, read each object from the Body "on demand".  This requires that the PDF data be available in format that enables random access AND that the end of the file be accessible (since you need the trailer & xref!).   This is just fine for local file-based viewing, but (as might imagine) SUCKS for web-based viewing, since it means that the ENTIRE PDF must be downloaded before anything can be displayed.   To solve this, Adobe introduced (in Acrobat 3, PDF 1.2) a feature called Linearization ("Fast Web View" in the Acrobat UI) in conjunction with proposing an addition (called Byte Serving) to the then 1.0 version http protocol (which was accepted as is now standard in http/1.1).   This enables a special XRef and Trailer (and all the objects for Page 1) to be placed at the beginning of the document so that it can be read & displayed quickly.   So you get "immediate gratification" - and while you are enjoying that, Acrobat/Reader is doing "speculative downloading" of the other pages in the background.

"Objects" in a PDF come in a variety (9) types - including simple scaler types (eg. Integer, Float, String), collection types (Dictionary and Array) and the Stream (for large data blocks).  These base types are combined in pre-defined combinations (according to the PDF Reference) to make "higher order objects" such as Pages, Bookmarks, Annotations, etc.  "Objects" can reference other objects - thus enabling a fine grained sharing/reuse mechanism.

A PDF producer isn't limited to the pre-defined "higher level objects" - you can always define your own (but not your own 'core types') AND you can also extend the pre-defined ones as well.  For example, many PDF tools vendors add extra stuff to PDFs that they've processed so that what took place can be recovered later for "tracking purposes".   This enables one to use PDF for all sorts of uses that Adobe never envisioned...

Compression/Encryption
----------------------
Compression, historically, was only allowed for data contained in a Stream object (page content, images, etc.) - but with Acrobat 6 (PDF 1.5) introduced compression of any collection of "objects" - thus enabling even smaller PDF documents (but breaking backwards compatiblity for the first time since Acrobat 2).   Every major (and a few minor) lossless and lossy compression methods are supporting including Flate (ZIP), JPEG, CCITT G3/G4, JBIG2, and JPEG2000.

Encryption is supported in both symmetric (RC4 & AES) and asymmetric (RSA) varieties and can either be over all the objects or just a subset thereof (eg. encrypted PDF with plaintext metadata)  It is usually accompanied by a set of "digital rights". 

Digital Signatures are supported using standard RSA/X.509 technologies for either the entire PDF, one or more "byte ranges" within the document, or a collection of objects.  DigSigs may be visible on the page(s) or invisible and may also include a set of "modification rights" which define what operations are safe to perform on the document w/o the signature being invalidated.


Content
-------
Each page of a PDF can have a series of "objects" associated with it that define what gets drawn on the page, using a syntax that is derived from (though different than!) Postscript.  It is all about explicit placement of these objects on the page at defined coordinates.   These content "objects" (which are different than the Body "objects") can be categorized into Text, Paths (vectors) and (Raster) Images.   This is also a mechanism for reusable/shared content (eg. company logo, slide background). 

Content can also be associated with "structure/tagging", which then applies semantic representations on the visual content.  "This is a paragraph", "This is a Table Header", "This is a Figure", etc.   This structure is useful for many things including enabling reflow, providing accessibility to the content, etc.

There are also a set of body "objects" that may appear (visually) on a page, but live on a special "layer" above standard content - this includes hyperlinks, markup & comment annotations, multimedia and Forms.

PDF also has standard "objects" for embedded files, sounds, movies/animations, etc.



So what does all this mean?!?
-----------------------------
It means that PDF is an excellent container for a variety of content, since it has native "objects" for handling the myriad of content that could be found in the documents of today - be they eBooks, journal articles, print publishing, etc.

It has pre-defined support for high quality text, raster and vector rendering along with markup & comment annotations.  It has support for rich multimedia in a variety of formats.  It supports structure/tagging to enable accessibility and even reflow.

It is also an open international standard with a VERY well deployed free reader already in existence and is the basis for other standards such as PDF/A.   There are NUMEROUS open source (and commercial) libraries available for reading, writing and manipulating PDF documents in your choice of programming language (C/C++, .NET, Java, Perl, Python, Ruby, etc.) on your favorite OS platform(s).



How's that, Jon?


Leonard
P.S. Although I think that PDF would make an EXCELLENT "container" for OEP/OR, there are certainly valid arguments for the ZIP+XML solution as well.

---------------------------------------------------------------------------
Leonard Rosenthol                            < mailto:leonardr@...>
Chief Technical Officer                      < http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                             215-938-0880 (fax)


Thu Nov 10, 2005 10:02 pm

pdfsage
Offline Offline
Send Email Send Email

Forward
Message #70 of 70 | Next >
Expand Messages Author Sort by Date

... No multimedia support for PDF/A-2, sorry. We are discussing a PDF/A-3 that may incorporate live content (multimedia, 3D, etc.), but not there yet. The...
Leonard Rosenthol
pdfsage
Offline Send Email
Nov 10, 2005
6:23 pm

... Here is a VERY quick synopsis of my "PDF Internals" class that I offer to customers. File Format ... PDF files consists of four parts - Header, Body, XRef...
Leonard Rosenthol
pdfsage
Offline Send Email
Nov 10, 2005
10:06 pm

... That can be fixed :-). Bill...
Bill Janssen
JalopyUser
Offline Send Email
Nov 10, 2005
7:51 pm

... As Mr. Janssen pointed out, the place to start in researching MTHML is RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt). It is interesting to note that in...
Lee Passey
wlpassey
Offline Send Email
Nov 8, 2005
6:31 pm

... Bad assumption! (a logical one, but still wrong) Mac OS X has it's own 2D graphic subsystem called Quartz, which is based on the PDF imaging model. (for...
Leonard Rosenthol
pdfsage
Offline Send Email
Nov 8, 2005
2:19 pm
 First  |  |  Next > Last 
< Prev Topic  |  Next Topic >
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help