C# Parsing 类实现的 PDF 文件分析器 已翻译 100%

oschina 投递于 2014/03/11 06:41 (共 18 段, 翻译完成于 04-07)
阅读 14213
收藏 68
6
加载中

1. Introduction

This project allows you to read and parse a PDF file and display its internal structure. The PDF file specification document is available from Adobe. This project is based on “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”. It is an intimidating 1310 pages document. The article provides a concise overview of the specifications. The associated project defines C# classes for reading and parsing a PDF file. To test these classes the attached test programPdfFileAnalyzerallows you to read a PDF file analyzes it and display and save the result. The program breaks the PDF file into individual page descriptions, fonts, images and other objects. Two types of PDF files are not supported by this program: encrypted files and multi-generations files.

Version 1.1 of this program allows programmers in world regions that define decimal separator as comma to compile and run the program.

Version 1.2 fixes a problem related to reading PDF documents with Cross Reference Streams. In version prior to 1.2 the program would terminated with an error of duplicate object numbers.

If you are interested in incorporating PDF file writer into your application, please read "PDF File Writer C# Class Library" article.

已有 1 人翻译此段
我来翻译

2. Overview

The PDF file is structured to allow Adobe Acrobat to display and print each page on a variety of screens and printers. If you open the file with a binary editor you will see that most of the file is unreadable. The small sections that are readable look like:

1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj 
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj

The first impression is that the file is made of objects nested between “n 0 obj” and “endobj” keywords. The PDF term is indirect objects. The numbers before “obj” are the object number and the generation number. Items enclosed within double angle brackets <<>> are dictionaries. Items enclosed between square brackets [] are arrays. Items starting with slash / are parameters names (i.e. /Pages). In the example above the first item “1 0 obj” is the document catalog or the root object. The catalog has in its dictionary an item “/Pages 2 0 R”. This is a reference to an object that defines tree of pages. In this case, object number 2 has a reference to one page “/Kids[4 0 R]”. This is a one page document. Object number 4 is the only page definition. The page size is 612 by 792 points. In other words 8.5” by 11” (1” is 72 points). The page uses two fonts F1 and F2. They are defined in objects 6 and 8. The page contents are being described in object number 5. Object number 5 has a stream that describes the painting of the page. In the example we have “. . .” as place holder for this description. If you tried to look at the PDF file with binary editor the stream will look as a long block of unreadable random numbers. The reason for it is that you are looking at compressed data. The stream is compressed with ZLib deflate method. This is specified in the dictionary by “/Filter /FlateDecode”. The compressed stream is 2319 bytes long. If you decompress the stream the first few items will look something like this:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

This is a small sample of page description language. In this example “re” stands for rectangle. The four numbers before it are position and size “X Y Width Height”.

已有 1 人翻译此段
我来翻译

This simplified example demonstrates the general idea behind PDF files. You start with a root object that point to hierarchy of pages. Each page defines resources such as fonts, images and contents streams. Contents streams are made of operators and arguments required to paint the pages. The PdfFileAnalyzer will produce an object summary file. This file contains all the objects without the streams. Each stream will be decoded and saved as a separate file. Page descriptions are saved as text files. Image streams are saved as .jpg or .bmp files. Font streams are saved as .ttf files. Other streams that are binary are saved as .bin files. Text streams are saved as .txt files. Page descriptions go through another parsing process that translates the cryptic one or two letters codes into a pseudo C# source. As an example the page description above is translated to:

SaveGraphicsState(); // q
Rectangle(37.08, 56.424, 537.84, 679.18); // re
ClippingPathEvenOddRule(); // W*
NoPaint(); // n
BeginMarkedContentPropList("/P", "<</MCID 0>>"); // BDC
GrayLevelForNonStroking(0.753); // g
Rectangle(36.6, 465.43, 537.96, 24.84); // re
FillEvenOddRule(); // f*
EndMarkedContent(); // EMC
BeginMarkedContentPropList("/P", "<</Lang(x-none)/MCID 1>>"); // BDC
BeginText(); // BT
SelectFontAndSize("/F1", 18); // Tf
TextMatrix(1, 0, 0, 1, 39.6, 718.8); // Tm
GrayLevelForNonStroking(0); // g
GrayLevelForStroking(0); // G
ShowTextWithGlyphPos("[(GRA)29(NOTECH LI)-3(MIT)-4(ED)]"); // TJ
EndTextObject(); // ET

The remaining part of this article will go into PDF file structure and the parsing process in more details. The following sections will cover: object definitions, file structure, file parsing, File reading, and using the PdfFileAnalyzer program.

已有 1 人翻译此段
我来翻译

3. Disclaimer

The PdfFileAnalyzer will work with most PDF files. This was my experience scanning many of the PDF files on my own system. However, the program does not support encrypted files or multi-generations files (the second number before obj is not zero). The number of features available in the PDF specifications is very significant. It is not possible for a single developer to systematically test all the features. If the program will throw an exception during file analysis, an error message will be displayed showing the source code module name and line number.

4. Object Definitions

PDF file is made of objects. Each PDF object has a corresponding class in the PdfFileAnalyzer project. All of these object classes are derived classes from PdfBase class. The source code for objects class definition is BasicObjects.cs. The exact PDF objects definition is available in chapter 3 of the Adobe's PDF specifications.

已有 1 人翻译此段
我来翻译

4.1. Basic Objects

  • Boolean object is implemented by PdfBoolean class. The PDF definition of Boolean is the same as C#.

  • Integer object is implemented by PdfInt class. The PDF definition is the same as Int32 in C#.

  • Real number object is implemented by PdfReal class. The PDF definition is the same as Single in C#.

  • String object is implemented by PdfStr class. The PDF definition is different than C#. String is made out of bytes not characters. It is enclosed in parenthesis (). The PdfFileAnalyzer saves the PDF string in a C# string including the parenthesis. PDF string is useful for ASCII encoding.

  • Hexadecimal string object is implemented by PdfHex class. It is a string of characters defined by two hex digits per byte and enclosed within angle brackets <>. The PdfFileAnalyzer saves the PDF hex string in C# string including the angle brackets. For PDF readers the string and the hex string objects serve the same purpose. The string (AB) is the equivalent to <4142>. PDF hex string is useful for any encoding.

  • Name object is implemented by PdfName class. Name object are made of forward slash followed by a sequence of characters. For example /Width. Named objects are used as parameters names. The PdfFileAnalyzer saves the name object in C# string including the leading /.

  • Null object is implemented by PdfNull class. The PDF definition of null is basically the same as in C#.

已有 1 人翻译此段
我来翻译

4.2. Compound Objects

  • Array object is implemented by PdfArray class. PDF array is a collection of objects enclosed within square brackets []. The objects of one array can be a mix of any type except stream. The PdfFileAnalyzer saves the objects in a C# array of PdfBase class. Since all objects are derived classes of PdfBase there is no problem saving a mix of object types within this array. When array object is converted to a string (ToString() method), the program adds a leading and trailing square brackets. Array can be empty. Example of array with six objects: [120 9.56 true null (string) <414243>].

  • Dictionary object is implemented by PdfDict class. PDF dictionary is a collection of key and value pairs enclosed within double angle brackets <<>>. Dictionary key is a name object and value is any object except stream. The PdfFileAnalyzer saves one key value pair in PdfPair class. The key is a C# string and the value is PdfBase. The PdfDict class has an array of PdfPair classes. Dictionary is accessed by key. Therefore pair ordering is not important. PdfFileAnalyzer sorts the pairs by key value. Example of dictionary with three pairs: <</CropBox [0 0 612 792] /Rotate 0 /Type /Page>>.

  • Stream object is implemented by PdfStream. Streams are used to hold page description language, images and fonts. PDF Stream is made of two parts a dictionary and a stream of bytes. The dictionary defines the stream parameters. One of the stream dictionary entries is /Filter. The PDF document defines 10 types of filters. PdfFileAnalyzer supports 4 filters. These 4 filters are the only ones I found to be in general use. The compression filter FlateDecode is the most used filter by current PDF writers. FlateDecode supports ZLib deflate decompression. The LZWDecode compression filter was used a few years ago. In order to read older PDF files, this program supports this filter. ASCII85Decode filter converting printable ASCII to binary. DCTDecode for JPEG image compression. The PdfFileAnalyzer implement decompression for the first three. The DCTDecode stream is saved as is with file extension .jpg. It is an image file that can be viewed.

  • Object stream was introduced in PDF 1.5. It is a stream that contains multiple indirect objects (described below). Stream objects described above are compressed one stream at a time. Object stream compresses all the included streams in one compressed section.

  • Cross-reference stream was introduced in PDF 1.5. It is a stream that contains cross-reference table described later in the article.

  • Inline image object is implemented by PdfInlineImage. It is a stream within a stream. Inline image is part of page description language. It is made of three operators BI-begin image, ID-image data and EI-end image. The area between BI and ID is an image dictionary and the area between ID and EI is the image data.

已有 1 人翻译此段
我来翻译

4.3. Indirect Objects

  • Indirect object is implemented by PdfIndirectObject. It is the main building block of a PDF document. An indirect object is any object encased between “n 0 obj” and “endobj”. Other objects can refer to indirect object by specifying “n 0 R”. The “n” is the object number. The “0” is the generation number. This program does not support generation number other than 0. The PDF specification allows for other numbers. The idea behind multi-generation is to allow PDF modifications by keeping the original file and appending changes.

  • Object reference is a way of referring to indirect objects. For example /Pages 2 0 R is a dictionary entry in the catalog object. It is a pointer to /Pages object. The pages object is indirect object number 2.

4.4. Operators and keywords

  • Operators and keywords are not considered PDF objects. However, the PdfFileAnalyzer program has a PdfOp and a PdfKeyword classes that are derived classes of PdfBase. During the parsing process the parser creates a PdfOp or a PdfKeyword for each valid sequence of characters. Appendix A Operator Summary of the Adobe's PDF file specification lists all the operators. The list is made of 73 operators. Here are some examples of operators: BT-begin text object, G-set gray level for stroking operations, m-move to, re-rectangle and Tc-set character spacing. Examples of keywords: stream, obj, endobj, xref.

已有 1 人翻译此段
我来翻译

5. File Structure

PDF file is made of four parts: header, body, cross-reference and trailer signature.

  • Header: The header is the file signature. It must be %PDF-1.x where x is 0 to 7.

  • Body: The body area contains all the indirect objects.

  • Cross-reference: The cross-reference is a table of file position pointers to all indirect objects. There are two types of cross reference tables. The original style made of ASCII characters. The new style is a stream within an indirect object. The information is encoded as binary numbers. At the end of the cross-reference table there is a trailer dictionary. A file can have more than one cross-reference area.

  • Trailer signature: The trailer signature is made of: keyword “startxref”, byte offset to the last cross-reference table, and end signature %%EOF. Please note: trailer dictionary is part of cross-reference area.

6. File Parsing

The PDF file is a sequence of bytes. Some of the bytes have special meaning.

White space is defined as: null, tab, line feed, form feed, carriage return and space.

Delimiters are defined as: (, ), <, >, [, ], {, }, /, %, and white space characters.

File parsing is done with PdfParser class. To start the parsing process the program sets file position to the area to be parsed. ParseNextItem() is the method that extract the next object.

已有 1 人翻译此段
我来翻译

The parser skips white space and comments. If next byte is “(“ the object is a string. If next byte is “[“ the object is an array. If next two bytes are “<<“ the object is a dictionary. If next byte is “<“ the object is a hex string. If next byte is “/“ the object is a name. If the next byte is none of the above the parser accumulates the following bytes until a delimiter is found. The delimiter is not part of the current token. The token can be integer, real number, operator or keyword. In the case of integer, the program will search further for object reference “n 0 R” or indirect object “n 0 obj” where n is the integer. The returned value from ParseNextItem() is the appropriate object as per section 4. Object Definitions. The object class is returned as PdfBase class.

In the case of array or dictionary, the program will perform recursive calling of the ParseNextItem() to parse the internal objects of the array or dictionary.

已有 1 人翻译此段
我来翻译

7. File Reading

PdfDocument class is the main class of PDF file analysis. The entry method is ReadPdfFile(String FileName). The program opens the PDF file for binary reading (one byte at a time).

File analysis starts with checking the header signature %PDF-1.x where x is 0 to 7 and the trailer end signature %%EOF. One would think that all PDF writers would put the header at position zero of the file and the trailer at the very end of the file. Unfortunately it is not the case. The program has to search for these two signatures at the two ends of the file. If the header signature is not at position zero, all indirect objects file position pointers have to be adjusted.

Just before the trailer signature there is a pointer to the start of the last cross-reference table.

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(5)

帅的木法弄
帅的木法弄
我了个去 急需
挖坟了
crossmix
crossmix

ok

k
ku987111

赞一个

何广宇
何广宇

推荐 PDF-Explained和Font & Encodings这两本书.

雨翔河
雨翔河

这个玩意儿看起来很好玩

返回顶部
顶部