PDF is an unusual and awkward hybrid format containing both textual (complete with comments!) and binary data. Thanks to this and other dubious design choices, it's far easier to write a PDF than to read one. One of the most memorable examples is indirect object references; when reading an array, you can only distinguish them from integers once you've already read and parsed 2 integers, and then see the letter 'R' (e.g. "5 0 R"). A trivial transposition that puts the 'R' first (e.g. "R 5 0") would've simplified parsing greatly, since then the first character is enough to know what comes next. I remember having to deal with lots of trivial-yet-easily-fixable annoyances like this when I worked on some PDF parsing code.
This comes straight from the PostScript ancestry and the Forth style stack, right? Are there problems associated implementing this processing Forth-style, so you parse words and put them on the stack, and pop 2 operands when you see 'R'?
It’s easiest to see it as what it is: a textual format that evolved allowing binary blobs inside it.
You can write a PDF in notepad, or every text editor, and, if you know your postscript, it isn’t that bad of an experience if you forget about modern features, except for that table of contents at the end of the file.
While PDF uses different syntax, it is conceptually similar to the IFF file format from 1985 (the basis of which is still used in some file formats like PNG and AIFF) in being a format that can contain multiple types of data. PDF is a little bit more focused on being an object-based file format in specific, but one could have made PDF fit into IFF boxes easily in an alternative universe.
I see a lot of complaints about the PDF format here, what's the alternative though? Is there another format that achieves the same goals which people should rally around?
Just curious given PDF's ubiquity.
(For the record, I had to deal with text encodings in PDFs, and yes, it was a pain)
Pdf needs to have a « v2 » format that simply drops backward compatibility for everything that predates utf8, then create open converters tools to this new format.
Only support one type of font embedding, with a single encoding (make everything utf-8) : it won’t change anything noticeable regarding the file size, and will greatly help at least parsing text.
I’ve only dealt with text parsing, so that’s the only easy improvement i see, but i’m pretty sure following the same logic on graphic content should be possible.
Let's say i was surprised to see something like that on HN frontpage. I expected a high quality PDF reader in Rust, and having implemented a PDF parser + TrueType parser myself, i was really eager to see how the same code looked.
Honestly, i won't be opening an issue that simply says "is this pdf lib aiming to be compliant with the PDF specification regarding text encoding ?", because it would be like insulting the owner on its own project page.
Better let him discover himself the mess that the PDF specification is, than pointing at gaps in the project...