Lopdf: Rust library for PDF document manipulation

userbinator · on April 7, 2019

PDF is an unusual and awkward hybrid format containing both textual (complete with comments!) and binary data. Thanks to this and other dubious design choices, it's far easier to write a PDF than to read one. One of the most memorable examples is indirect object references; when reading an array, you can only distinguish them from integers once you've already read and parsed 2 integers, and then see the letter 'R' (e.g. "5 0 R"). A trivial transposition that puts the 'R' first (e.g. "R 5 0") would've simplified parsing greatly, since then the first character is enough to know what comes next. I remember having to deal with lots of trivial-yet-easily-fixable annoyances like this when I worked on some PDF parsing code.

fulafel · on April 8, 2019

This comes straight from the PostScript ancestry and the Forth style stack, right? Are there problems associated implementing this processing Forth-style, so you parse words and put them on the stack, and pop 2 operands when you see 'R'?

Someone · on April 8, 2019

It’s easiest to see it as what it is: a textual format that evolved allowing binary blobs inside it.

You can write a PDF in notepad, or every text editor, and, if you know your postscript, it isn’t that bad of an experience if you forget about modern features, except for that table of contents at the end of the file.

See https://brendanzagaeski.appspot.com/0004.html

userbinator · on April 8, 2019

It's a textual format... with byte offsets. That's the most unusual part, and it's been there since the beginning (PDF 1.0).

peapicker · on April 8, 2019

While PDF uses different syntax, it is conceptually similar to the IFF file format from 1985 (the basis of which is still used in some file formats like PNG and AIFF) in being a format that can contain multiple types of data. PDF is a little bit more focused on being an object-based file format in specific, but one could have made PDF fit into IFF boxes easily in an alternative universe.

muizelaar · on April 7, 2019

Here's the beginnings of a pdf viewer built with lopdf and WebRender: https://github.com/srijs/rpdf/commits/master

netghost · on April 8, 2019

I see a lot of complaints about the PDF format here, what's the alternative though? Is there another format that achieves the same goals which people should rally around? Just curious given PDF's ubiquity.

(For the record, I had to deal with text encodings in PDFs, and yes, it was a pain)

bsaul · on April 8, 2019

Pdf needs to have a « v2 » format that simply drops backward compatibility for everything that predates utf8, then create open converters tools to this new format.

Only support one type of font embedding, with a single encoding (make everything utf-8) : it won’t change anything noticeable regarding the file size, and will greatly help at least parsing text.

I’ve only dealt with text parsing, so that’s the only easy improvement i see, but i’m pretty sure following the same logic on graphic content should be possible.

fulafel · on April 8, 2019

There's PDF/A.

bsaul · on April 8, 2019

Thanks for the info, i've never heard of that, that's very refreshing to see that people are indeed trying to update the standard to remove the cruft.

Sidenote : It seems the PDF/A format open yet the specification is kept behind a paywall ????

meruru · on April 8, 2019

Yes, DjVu: https://en.wikipedia.org/wiki/DjVu

statingobvious8 · on April 8, 2019

Basic HTML5 with inline CSS and imgs written via data uri? Browsers are just as ubiquitous as pdf readers.

timClicks · on April 7, 2019

OT perhaps but does anyone know a readable reference for PDF opcodes? The example code feels somewhat opaque:

  let content = Content {
   operations: vec![
    Operation::new("BT", vec![]),
    Operation::new("Tf", vec!["F1".into(), 48.into()]),
    Operation::new("Td", vec![100.into(), 600.into()]),
    Operation::new("Tj", vec![Object::string_literal("Hello World!")]),
    Operation::new("ET", vec![]),
   ],
  };

mkl · on April 7, 2019

The specification itself [1] is mostly pretty readable. For the opcodes, you want Appendix A.

[1] "PDF Reference, Sixth Edition, version 1.7" at https://www.adobe.com/devnet/pdf/pdf_reference_archive.html

fourier_mode · on April 7, 2019

An example of the resulting generated pdf would've been a great.

propter_hoc · on April 8, 2019

How does this compare to pdfbox (or the free versions of itext)?

fxfan · on April 8, 2019

You cannot even take out non-ascii strings (or search) if the PDF file had been created in a certain way. That's so 1980.

PDF is not exactly guaranteed to be portable- and is a format that needs to be dumped by anyone who cares about a global portability.

afiori · on April 8, 2019

Still, for all its negative sides, is the best approximation of a portable and reliable "just work" document format.

bsaul · on April 7, 2019

[flagged]

WhatIsDukkha · on April 7, 2019

https://github.com/J-F-Liu/lopdf/issues

I don't see an issue open?

Maybe type something like this, but maybe a bit more polite, into the github issues?

Be more constructive then a driveby on hn?

bsaul · on April 7, 2019

Let's say i was surprised to see something like that on HN frontpage. I expected a high quality PDF reader in Rust, and having implemented a PDF parser + TrueType parser myself, i was really eager to see how the same code looked.

Honestly, i won't be opening an issue that simply says "is this pdf lib aiming to be compliant with the PDF specification regarding text encoding ?", because it would be like insulting the owner on its own project page.

Better let him discover himself the mess that the PDF specification is, than pointing at gaps in the project...

tdhz77 · on April 7, 2019

*Is this compliant to the pdf spec?