加载中

1. Introduction

This project allows you to read and parse a PDF file and display its internal structure. The PDF file specification document is available from Adobe. This project is based on “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”. It is an intimidating 1310 pages document. The article provides a concise overview of the specifications. The associated project defines C# classes for reading and parsing a PDF file. To test these classes the attached test programPdfFileAnalyzerallows you to read a PDF file analyzes it and display and save the result. The program breaks the PDF file into individual page descriptions, fonts, images and other objects. Two types of PDF files are not supported by this program: encrypted files and multi-generations files.

Version 1.1 of this program allows programmers in world regions that define decimal separator as comma to compile and run the program.

Version 1.2 fixes a problem related to reading PDF documents with Cross Reference Streams. In version prior to 1.2 the program would terminated with an error of duplicate object numbers.

If you are interested in incorporating PDF file writer into your application, please read "PDF File Writer C# Class Library" article.

1. 介绍

这个项目让你可以去读取并解析一个PDF文件,并将其内部结构展示出来. PDF文件的格式标准文档可以从Adobe那儿获取到. 这个项目基于“PDF指南,第六版,Adobe便携文档格式1.7 2006年11月”. 它是一个恐怕有1310页的大部头. 本文提供了对这份文档的简洁概述. 与此相关的项目定义了用来读取和解析PDF文件的C#类. 为了测试这些类,附带的测试程序PdfFileAnalyzer让你可以去读取一个PDF文件,分析它并展示和保存结果. 程序将PDF文件分割成单独每页的描述,字体,图片和其它对象. 有两种类型的PDF文件不受此程序的支持: 加密文件和多代文件.

这个程序的1.1版本允许世界各地使用点符号作为小数分隔符的程序员来编译和运行程序.

1.2版本则修复了一个有关使用跨多个引用流来读取PDF文档的问题. 1.2之前的版本对此场景只会以一个对象数字重复的错误而终止运行.

如果你对将PDF文件写入器引入你的应用程序,那就请读一读 "PDF 文件写入程序 C# 类库" 这篇文章吧.

2. Overview

The PDF file is structured to allow Adobe Acrobat to display and print each page on a variety of screens and printers. If you open the file with a binary editor you will see that most of the file is unreadable. The small sections that are readable look like:

1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj 
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj

The first impression is that the file is made of objects nested between “n 0 obj” and “endobj” keywords. The PDF term is indirect objects. The numbers before “obj” are the object number and the generation number. Items enclosed within double angle brackets <<>> are dictionaries. Items enclosed between square brackets [] are arrays. Items starting with slash / are parameters names (i.e. /Pages). In the example above the first item “1 0 obj” is the document catalog or the root object. The catalog has in its dictionary an item “/Pages 2 0 R”. This is a reference to an object that defines tree of pages. In this case, object number 2 has a reference to one page “/Kids[4 0 R]”. This is a one page document. Object number 4 is the only page definition. The page size is 612 by 792 points. In other words 8.5” by 11” (1” is 72 points). The page uses two fonts F1 and F2. They are defined in objects 6 and 8. The page contents are being described in object number 5. Object number 5 has a stream that describes the painting of the page. In the example we have “. . .” as place holder for this description. If you tried to look at the PDF file with binary editor the stream will look as a long block of unreadable random numbers. The reason for it is that you are looking at compressed data. The stream is compressed with ZLib deflate method. This is specified in the dictionary by “/Filter /FlateDecode”. The compressed stream is 2319 bytes long. If you decompress the stream the first few items will look something like this:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

This is a small sample of page description language. In this example “re” stands for rectangle. The four numbers before it are position and size “X Y Width Height”.

2. 概要

PDF格式的文件,借助Adobe Acrobat软件,可以在各种屏幕上显示查看,使用各种打印机打印。但是,如果使用二进制文件编辑器打开PDF文件,你会发现文件大部分是不可读的,有小部分是可读的,如下:

1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj 
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj

看上去,该文件是由嵌套在“n 0 OBJ ”和“ endobj ”关键词之间的对象组成的,术语PDF也就是间接对象的意思。 “obj”前面的数字是对象编号和第几代对象标识, 双尖括号中的内容表示数据字典对象,中括号中的内容表示数组对象, 以斜杠/ 开始的内容表示参数名称 (例如: /Pages)。上例中的第一项 “1 0 obj” 表示文档的目录或者文档的根对象。文档目录的字典对象 “/Pages 2 0 R”,指向定义页码树对象的引用。按照这样推算,编号为2的对象包含指向 “/Kids[4 0 R]”的页面的引用,是一个页面文档。 编号为4的对象是唯一的一个页面定义, 页面大小为612*792点, 换句话说,也就是8.5” * 11” (1” 代表72 点)点。该页面使用了两种字体F1和F2,这两种字体分别在编号为6和8的对象中定义。该页面的内容在编号为5的对象中描述,该对象中包含页面绘图的流信息,示例中的 “. . .”代表这部分流信息。如果使用二进制文件编辑器打开PDF文件,会发现这部分流信息看起来是一长串不可读的随机数,原因是那是压缩数据。流数据采用Zlib方法压缩,压缩方式由字典对象“/Filter /FlateDecode”描述,被压缩流的大小为2319字节。解压这部分流信息,前面几行内容如下所示:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

这是页面描述语言的一个小例子。 示例中, “re” 代表矩形,“re” 前面的4个数字代表矩形的位置和大小,依次为:起点横坐标、起点纵坐标、宽度、高度。

This simplified example demonstrates the general idea behind PDF files. You start with a root object that point to hierarchy of pages. Each page defines resources such as fonts, images and contents streams. Contents streams are made of operators and arguments required to paint the pages. The PdfFileAnalyzer will produce an object summary file. This file contains all the objects without the streams. Each stream will be decoded and saved as a separate file. Page descriptions are saved as text files. Image streams are saved as .jpg or .bmp files. Font streams are saved as .ttf files. Other streams that are binary are saved as .bin files. Text streams are saved as .txt files. Page descriptions go through another parsing process that translates the cryptic one or two letters codes into a pseudo C# source. As an example the page description above is translated to:

SaveGraphicsState(); // q
Rectangle(37.08, 56.424, 537.84, 679.18); // re
ClippingPathEvenOddRule(); // W*
NoPaint(); // n
BeginMarkedContentPropList("/P", "<</MCID 0>>"); // BDC
GrayLevelForNonStroking(0.753); // g
Rectangle(36.6, 465.43, 537.96, 24.84); // re
FillEvenOddRule(); // f*
EndMarkedContent(); // EMC
BeginMarkedContentPropList("/P", "<</Lang(x-none)/MCID 1>>"); // BDC
BeginText(); // BT
SelectFontAndSize("/F1", 18); // Tf
TextMatrix(1, 0, 0, 1, 39.6, 718.8); // Tm
GrayLevelForNonStroking(0); // g
GrayLevelForStroking(0); // G
ShowTextWithGlyphPos("[(GRA)29(NOTECH LI)-3(MIT)-4(ED)]"); // TJ
EndTextObject(); // ET

The remaining part of this article will go into PDF file structure and the parsing process in more details. The following sections will cover: object definitions, file structure, file parsing, File reading, and using the PdfFileAnalyzer program.

这个简单的例子演示了PDF文件内部实现的总体思路。从页面层次结构的根对象开始, 每一页都定义了诸如字体、图片、内容流的资源,内容流由操作符和绘制页面所需要的参数构成。PDF文件分析器会产生一个对象汇总文件,该文件包含非流对象的其他所有对象。每个数据流会被解码并保存为一个单独的文件, 页面描述流保存为文本格式的文件, 图片流保存为.jpg或.bmp格式的文件,字体流保存为.ttf格式的文件,其他二进制流保存为.bin 格式的文件,文本流保存为.txt格式的文件。通过另一个解析过程,晦涩难懂的页面描述会被转换为伪C#代码,如上例中的页面描述被转为:

SaveGraphicsState(); // q
Rectangle(37.08, 56.424, 537.84, 679.18); // re
ClippingPathEvenOddRule(); // W*
NoPaint(); // n
BeginMarkedContentPropList("/P", "<</MCID 0>>"); // BDC
GrayLevelForNonStroking(0.753); // g
Rectangle(36.6, 465.43, 537.96, 24.84); // re
FillEvenOddRule(); // f*
EndMarkedContent(); // EMC
BeginMarkedContentPropList("/P", "<</Lang(x-none)/MCID 1>>"); // BDC
BeginText(); // BT
SelectFontAndSize("/F1", 18); // Tf
TextMatrix(1, 0, 0, 1, 39.6, 718.8); // Tm
GrayLevelForNonStroking(0); // g
GrayLevelForStroking(0); // G
ShowTextWithGlyphPos("[(GRA)29(NOTECH LI)-3(MIT)-4(ED)]"); // TJ
EndTextObject(); // ET

文章接下来的部分将对PDF文件的结构和解析过程进行更为详细的描述,接下来的章节包括:对象定义,文件结构,文件解析,文件读取,以及使用PDF文件分析器编程。

3. Disclaimer

The PdfFileAnalyzer will work with most PDF files. This was my experience scanning many of the PDF files on my own system. However, the program does not support encrypted files or multi-generations files (the second number before obj is not zero). The number of features available in the PDF specifications is very significant. It is not possible for a single developer to systematically test all the features. If the program will throw an exception during file analysis, an error message will be displayed showing the source code module name and line number.

4. Object Definitions

PDF file is made of objects. Each PDF object has a corresponding class in the PdfFileAnalyzer project. All of these object classes are derived classes from PdfBase class. The source code for objects class definition is BasicObjects.cs. The exact PDF objects definition is available in chapter 3 of the Adobe's PDF specifications.

3. 免责声明

pdf 文件分析器能处理大量的文件,这是我在自己的系统上扫描众多PDF文件的经验。不过,该程序不支持加密文件或者多个代文件(在对象不为零之前的第二个数字)。在PDF规格文件之中可用功能的数量是非常显著的。这并不可能为一个单的个开发者系统地测试所有的功能。如果在整个文件分析期间该程序抛出一个异常,将显示一条错误信息,该信息显示源代码模块名和行号。

4.对象定义

PDF文件生成多个对象。在PDF文件分析器项目中每个PDF对象都有一个对应的类。所有这些对象类都派生于PDFbase类。对象类定义源代码是BasicObjects.cs.确却地PDF对象定义在Adobe pdf文件 规格第三章之中是有用的



4.1. Basic Objects

  • Boolean object is implemented by PdfBoolean class. The PDF definition of Boolean is the same as C#.

  • Integer object is implemented by PdfInt class. The PDF definition is the same as Int32 in C#.

  • Real number object is implemented by PdfReal class. The PDF definition is the same as Single in C#.

  • String object is implemented by PdfStr class. The PDF definition is different than C#. String is made out of bytes not characters. It is enclosed in parenthesis (). The PdfFileAnalyzer saves the PDF string in a C# string including the parenthesis. PDF string is useful for ASCII encoding.

  • Hexadecimal string object is implemented by PdfHex class. It is a string of characters defined by two hex digits per byte and enclosed within angle brackets <>. The PdfFileAnalyzer saves the PDF hex string in C# string including the angle brackets. For PDF readers the string and the hex string objects serve the same purpose. The string (AB) is the equivalent to <4142>. PDF hex string is useful for any encoding.

  • Name object is implemented by PdfName class. Name object are made of forward slash followed by a sequence of characters. For example /Width. Named objects are used as parameters names. The PdfFileAnalyzer saves the name object in C# string including the leading /.

  • Null object is implemented by PdfNull class. The PDF definition of null is basically the same as in C#.

4.1. 基础的对象

  • Boolean对象是靠PdfBoolean类来实现的. Boolean在PDF上的定义同C#上的是相同的.

  • Integer 对象是靠PdfInt类来实现的. PDF上的定义同C#上Int32的定义是相同的.

  • 实数对象是靠PdfReal类来实现的. PDF上的定义同C#上的Single定义相同.

  • String 对象是靠PdfStr类来实现的. PDF上的定义同C#相比有所不同. String 是用字节构造出来的,而不是字符. 它被包在圆括号()里面. PdfFileAnalyzer会把包含在圆括号中的C#字符串保存成PDF的字符串. PDF的字符串对于ASCII编码非常有用.

  • 十六进制字符串独享是靠PdfHex类来实现的. 它是由每字节两个十六进制数定义,并包在尖括号里面的字符串. PdfFileAnalyzer 将包含在尖括号中的C#字符串保存成PDF十六进制字符串. 对于 PDF 读取器,字符串和十六进制字符串对象可用于同种目的. 字符串 (AB) 等同于<4142>. PDF 十六进制字符串对于任意编码的场景非常有用.

  • Name 对象是靠PdfName类来实现的. Name 对象是由打头的正斜杠后面跟着一些字符组成的. 例如 /Width. Named 对象用作参数名称. PdfFileAnalyzer 将正斜杠开头的C#字符串保存成Name对象.

  • Null 对象是靠PdfNull类来实现的. PDF 对于null的定义基本上同C#中的是一样的.

4.2. Compound Objects

  • Array object is implemented by PdfArray class. PDF array is a collection of objects enclosed within square brackets []. The objects of one array can be a mix of any type except stream. The PdfFileAnalyzer saves the objects in a C# array of PdfBase class. Since all objects are derived classes of PdfBase there is no problem saving a mix of object types within this array. When array object is converted to a string (ToString() method), the program adds a leading and trailing square brackets. Array can be empty. Example of array with six objects: [120 9.56 true null (string) <414243>].

  • Dictionary object is implemented by PdfDict class. PDF dictionary is a collection of key and value pairs enclosed within double angle brackets <<>>. Dictionary key is a name object and value is any object except stream. The PdfFileAnalyzer saves one key value pair in PdfPair class. The key is a C# string and the value is PdfBase. The PdfDict class has an array of PdfPair classes. Dictionary is accessed by key. Therefore pair ordering is not important. PdfFileAnalyzer sorts the pairs by key value. Example of dictionary with three pairs: <</CropBox [0 0 612 792] /Rotate 0 /Type /Page>>.

  • Stream object is implemented by PdfStream. Streams are used to hold page description language, images and fonts. PDF Stream is made of two parts a dictionary and a stream of bytes. The dictionary defines the stream parameters. One of the stream dictionary entries is /Filter. The PDF document defines 10 types of filters. PdfFileAnalyzer supports 4 filters. These 4 filters are the only ones I found to be in general use. The compression filter FlateDecode is the most used filter by current PDF writers. FlateDecode supports ZLib deflate decompression. The LZWDecode compression filter was used a few years ago. In order to read older PDF files, this program supports this filter. ASCII85Decode filter converting printable ASCII to binary. DCTDecode for JPEG image compression. The PdfFileAnalyzer implement decompression for the first three. The DCTDecode stream is saved as is with file extension .jpg. It is an image file that can be viewed.

  • Object stream was introduced in PDF 1.5. It is a stream that contains multiple indirect objects (described below). Stream objects described above are compressed one stream at a time. Object stream compresses all the included streams in one compressed section.

  • Cross-reference stream was introduced in PDF 1.5. It is a stream that contains cross-reference table described later in the article.

  • Inline image object is implemented by PdfInlineImage. It is a stream within a stream. Inline image is part of page description language. It is made of three operators BI-begin image, ID-image data and EI-end image. The area between BI and ID is an image dictionary and the area between ID and EI is the image data.

4.2. 复合的对象

  • Array 对象是靠 PdfArray 类来实现的. PDF 数组是一个封装在一堆中括号中的对象的集合. 一个数组的对象可以是除了流之外的任何对象.PdfFileAnalyzer 将一个C#数组中的对象保存成PdfBase类

    . 因为所有的对象都继承自PdfBase,所有在这个数组中保存多种类型的对象没有啥问题. 当数组对象被转换成一个字符串时(使用ToString()方法), 程序会在首位添加中括号. 数组可以是空的. 下面是一个有六个对象的数组示例: [120 9.56 true null (string) <414243>].

  • Dictionary 对象是靠PdfDict类实现的. PDF 字典是一组被包入一对双尖括号中的键值对集合. Dictionary 的键是一个对象的名称,而值则可以是除了流之外的任何对象.  PdfFileAnalyzer 将一个键值对保存到PdfPair类中. 键是一个C#字符串,而值则是一个PdfBase.PdfDict 类有一个PdfPair类的数组. Dictionary 可以用键来访问. 因而键值对的顺序没有啥意义. PdfFileAnalyzer 用键来对键值对进行排序. 下面是一个有三个键值对的字典: <</CropBox [0 0 612 792] /Rotate 0 /Type /Page>>.

  • Stream 对象是靠PdfStream来实现的. Streams 被用来处理面熟语言,图形和字体. PDF Stream 由一个字典和一个字节流组成. 字典中定义了流的参数. 比如流对象中字典的一个键值对 /Filter. PDF 文档定义了10种类型的过滤器. PdfFileAnalyzer 支持了4种. 这是我发现在实际场景中只会被用到那4种. 压缩过滤器 FlateDecode 是现在的PDF写入器最长被用到的过滤器. FlateDecode支持ZLib解压缩. LZWDecode 压缩过滤器在过去些年用的比较多. 为了能读取比较老的PDF文件, 我们的程序支持这个过滤器. ASCII85Decode 过滤器将可被打印的ASCII转换成二进制位. DCTDecode 用于JPEG图像的压缩.PdfFileAnalyzer 为前三种实现了解压缩. DCTDecode 流则以文件扩展名.jpg保存. 它是一个可以被展示的图片文件.

  • Object 流在PDF 1.5中被引入. 它是一个包含多个间接对象(在下面会描述道)的流. 上面描述的Stream 对象一次只压缩一个流. Object 流会将所有包含进来的流压缩到一个压缩域中.

  • 多引用流在PDF 1.5中被引入. 它是一个包含多引用表格的流,下文会描述到.

  • 内联图片对象是靠 PdfInlineImage来实现的. 它是一个带有一个流的流. 内联图片是页面描述语言的一部分. 它由BI-开头图形, ID-图形数据和EI-结尾图形这三个操作符组成. BI 和 ID 之间的区域是一个图形字典,而ID 和 EI 之间的区域则包含图形数据.

4.3. Indirect Objects

  • Indirect object is implemented by PdfIndirectObject. It is the main building block of a PDF document. An indirect object is any object encased between “n 0 obj” and “endobj”. Other objects can refer to indirect object by specifying “n 0 R”. The “n” is the object number. The “0” is the generation number. This program does not support generation number other than 0. The PDF specification allows for other numbers. The idea behind multi-generation is to allow PDF modifications by keeping the original file and appending changes.

  • Object reference is a way of referring to indirect objects. For example /Pages 2 0 R is a dictionary entry in the catalog object. It is a pointer to /Pages object. The pages object is indirect object number 2.

4.4. Operators and keywords

  • Operators and keywords are not considered PDF objects. However, the PdfFileAnalyzer program has a PdfOp and a PdfKeyword classes that are derived classes of PdfBase. During the parsing process the parser creates a PdfOp or a PdfKeyword for each valid sequence of characters. Appendix A Operator Summary of the Adobe's PDF file specification lists all the operators. The list is made of 73 operators. Here are some examples of operators: BT-begin text object, G-set gray level for stroking operations, m-move to, re-rectangle and Tc-set character spacing. Examples of keywords: stream, obj, endobj, xref.

4.3. 间接对象

  • 间接对象是靠 PdfIndirectObject实现的. 它是一个PDF文档的主要构造块. 间接对象是任何被包在 “n 0 obj” 和 “endobj”之间的对象. 其它对象可以通过设定“n 0 R”来引用间接对象. “n”代表对象编号. “0”代表生成编号. 这个程序不支持0之外的生成编号. PDF 规范允许其它的编号. 多代生成的理念允许PDF的修改操作是在保留原有文件的基础上追加变更.

  • 对象引用时一种引用间接对象的方法. 例如 /Pages 2 0 R 是目录对象中的字典里的一项. 它是一个指向 /Pages 对象的指针. pages对象是编号为2的间接对象.

4.4. 操作符和关键词

  • 操作符和关键词不被认为是PDF对象. 而PdfFileAnalyzer 程序有一个PdfOp 和一个PdfKeyword 类可以从中得到 PdfBase 的类. 在转换过程中,转换器为每一个可用的字符序列创建了一个 PdfOp 或者PdfKeyword . Pdf文件规范的附录A-操作符总结中列出了所有的操作符. 列表中有73个操作符. 下面是一些操作符的示例: BT-打头的文本对象, G-用于做记号的设置灰度操作, m-移动到, re-矩形和Tc-设置字符间距. 下面是关键词的示例: stream, obj, endobj, xref.

5. File Structure

PDF file is made of four parts: header, body, cross-reference and trailer signature.

  • Header: The header is the file signature. It must be %PDF-1.x where x is 0 to 7.

  • Body: The body area contains all the indirect objects.

  • Cross-reference: The cross-reference is a table of file position pointers to all indirect objects. There are two types of cross reference tables. The original style made of ASCII characters. The new style is a stream within an indirect object. The information is encoded as binary numbers. At the end of the cross-reference table there is a trailer dictionary. A file can have more than one cross-reference area.

  • Trailer signature: The trailer signature is made of: keyword “startxref”, byte offset to the last cross-reference table, and end signature %%EOF. Please note: trailer dictionary is part of cross-reference area.

6. File Parsing

The PDF file is a sequence of bytes. Some of the bytes have special meaning.

White space is defined as: null, tab, line feed, form feed, carriage return and space.

Delimiters are defined as: (, ), <, >, [, ], {, }, /, %, and white space characters.

File parsing is done with PdfParser class. To start the parsing process the program sets file position to the area to be parsed. ParseNextItem() is the method that extract the next object.

5. 文件结构

PDF文件由四个部分构成: 头部Header , 主体body, 多引用cross-reference 和附带签名 trailer signature.

  • Header: 头部是文件的签名. 它必须是 %PDF-1.x , x 从 0 到 7.

  • Body: 主体区域包含所有的间接对象.

  • Cross-reference: 多引用是一个指向所有间接对象的文件位置指针列表. 有两种类型的多引用表格. 原始的类型有ASCII字符组成. 新式的是一个包含一个间接对象的流. 信息以二进制数字编码. 在多引用表格的结束部分有一个附件字典. 一个文件可以有超过一个的多引用区域.

  • Trailer signature: 附带签名由关键词“startxref”, 最后一个多引用表格的偏移位, 和结束签名 %%EOF 组成. 请注意: 附带签名是多引用区域的一部分.

6. 文件转换

PDF 文件是一个字节的序列. 一些字节有特殊的意义.

空格被定义成: null, tab, 换行, 换页, 回车和间隔.

分隔符被定义成: (, ), <, >, [, ], {, }, /, %, 以及空格字符.

文件转换是由PdfParser 类来完成的. 开始进行转换过程是,程序会设置文件需要被转换区域的位置. ParseNextItem() 是提取下一个对象的方法.

The parser skips white space and comments. If next byte is “(“ the object is a string. If next byte is “[“ the object is an array. If next two bytes are “<<“ the object is a dictionary. If next byte is “<“ the object is a hex string. If next byte is “/“ the object is a name. If the next byte is none of the above the parser accumulates the following bytes until a delimiter is found. The delimiter is not part of the current token. The token can be integer, real number, operator or keyword. In the case of integer, the program will search further for object reference “n 0 R” or indirect object “n 0 obj” where n is the integer. The returned value from ParseNextItem() is the appropriate object as per section 4. Object Definitions. The object class is returned as PdfBase class.

In the case of array or dictionary, the program will perform recursive calling of the ParseNextItem() to parse the internal objects of the array or dictionary.

解析器跳过空格符和注释。如果下一个字节是“(”,判断对象为一个字符串。如果下一个字节是“[”,判断对象是一个数组。如果接下来的两个字节是“<<”,判断对象是一个字典。如果下一个字节是“<”,判断对象是一个十六进制字符串。如果下一个字节是“/”,判断对象是一个名称。如果下一个字节不是上述任何一种,解析器会采集随后的字节直到发现定界符。定界符不是当前标记符的一部分。标记符可以是整数,实数,操作符或关键词。在整数的情况下,程序将进一步搜索对象引用“n 0 R”或间接对象“n 0 obj”中 n 为该整数的对象。从 ParseNextItem() 返回的值是第4节“对象的定义”中所述的适当对象。对象的类作为 PdfBase 类返回。

在数组或字典的情况下,程序将执行递归调用 ParseNextItem() 来解析数组或字典的内部对象。

7. File Reading

PdfDocument class is the main class of PDF file analysis. The entry method is ReadPdfFile(String FileName). The program opens the PDF file for binary reading (one byte at a time).

File analysis starts with checking the header signature %PDF-1.x where x is 0 to 7 and the trailer end signature %%EOF. One would think that all PDF writers would put the header at position zero of the file and the trailer at the very end of the file. Unfortunately it is not the case. The program has to search for these two signatures at the two ends of the file. If the header signature is not at position zero, all indirect objects file position pointers have to be adjusted.

Just before the trailer signature there is a pointer to the start of the last cross-reference table.

7. 文件读取

PdfDocument 类是 PDF 文件分析的主要类。入口方法是 ReadPdfFile(String FileName)。程序以二进制读取的方式打开 PDF 文件(一次一个字节)。

文件分析开始于检查头部签名 %PDF-1.x(x为0到7)和结尾签名%%EOF。有人会认为,所有的 PDF 生成器会把头部签名放在文件的零位置,结尾签名放在文件的最后。不幸的是,实际并非如此。程序必须在文件的两端搜索这两个签名。如果头部签名不在零位置,所有间接对象的文件位置的指针也必须调整。

就在结尾签名的前面有一个指向最后一个交叉引用表开始位置的指针。

返回顶部
顶部