PDF file sleeps a lot of invisible data


by Christiaan Colen

PDF (Portable Document Format) is the electronic document file format developed and advocated by Adobe Systems . A PDF that can be displayed / printed in the same layout in any environment and can embed links and annotations is still popular format even in 2018, 25 years since release. In such a PDF file, " PSPDFKit " to develop PDF file browsing / editing software explains what kind of data is embedded other than the contents of the document.

What's Hiding in Your PDF? | Inside PSPDFKit
https://pspdfkit.com/blog/2018/whats-hiding-in-your-pdf/


◆ 1: Information metadata
Since PDF 1.0 released in 1993, you can enter author, creation date, creator, producer in PDF file. In addition, in PDF 1.1 and later, titles, subjects, keywords, and the last updated date and time are also recorded. This makes it easy to locate specific files from many files. However, since these information can be edited later, it is not necessarily reliable and it can be said that "the content of the document is exactly the same even if the information metadata is different", so be careful.



◆ 2: Extended information metadata
In the PDF ISO standard, metadata is described in the stream and can be held in XMP format. Therefore, in addition to the above information metadata, it is possible to represent arbitrary data types with metadata. Though these metadata are not rendered by browsing software, they may be analyzed by the file management system.

◆ 3: Object metadata - The metadata described in the stream is not limited to documents. For example, it is possible to include image information in XMP format. SDK for extracting and changing files from XMP format metadata is officially distributed from Adobe.

◆ 4: Save and update increment
The PDF ISO standard has the concept of incremental saving (incremental storage). Incremental save is to reflect only the changed part of the PDF file and save it. Especially when you change the document on the fly , you can minimize the behavior of the background automatic saving process.

However, if incremental storage is done, for example, even if seemingly deletion of fatal information or erroneous information seems to be deleted, that information remains in the file, which can be troublesome. For that reason, PSPDFkit recommends "Save in reconfiguration (complete save)". This will purge the old object and you will not be able to edit the PDF form data.



◆ 5: PDF comment <br> In the world of programming, there are many times to leave explanations and remarks separately from the code using " comment out " to make the source code more comprehensible. Similarly for PDF files, it is possible to add comments to files by using "%". As a result, if you open a PDF file with a text editor instead of a PDF renderer, you may see a comment out message from the producer. When PDF file is read by PDF browsing software, comments surrounded by% are ignored, so the file is displayed correctly and the content of the comment is not reflected at all.

Even though it seems to be an ordinary document at first glance in a PDF file, various invisible data are actually sleeping in the background. In addition, since PDF supports JavaScript, PSPDFkit says that options are infinite. Also, when browsing a PDF file, I urge people to be careful about what kind of information is lurking, and in particular, who should create confidential information by digital signature.

in Software, Posted by log1i_yk