Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lopdf: Rust library for PDF document manipulation (github.com/j-f-liu)
160 points by adamnemecek on April 7, 2019 | hide | past | favorite | 21 comments


PDF is an unusual and awkward hybrid format containing both textual (complete with comments!) and binary data. Thanks to this and other dubious design choices, it's far easier to write a PDF than to read one. One of the most memorable examples is indirect object references; when reading an array, you can only distinguish them from integers once you've already read and parsed 2 integers, and then see the letter 'R' (e.g. "5 0 R"). A trivial transposition that puts the 'R' first (e.g. "R 5 0") would've simplified parsing greatly, since then the first character is enough to know what comes next. I remember having to deal with lots of trivial-yet-easily-fixable annoyances like this when I worked on some PDF parsing code.


This comes straight from the PostScript ancestry and the Forth style stack, right? Are there problems associated implementing this processing Forth-style, so you parse words and put them on the stack, and pop 2 operands when you see 'R'?


It’s easiest to see it as what it is: a textual format that evolved allowing binary blobs inside it.

You can write a PDF in notepad, or every text editor, and, if you know your postscript, it isn’t that bad of an experience if you forget about modern features, except for that table of contents at the end of the file.

See https://brendanzagaeski.appspot.com/0004.html


It's a textual format... with byte offsets. That's the most unusual part, and it's been there since the beginning (PDF 1.0).


While PDF uses different syntax, it is conceptually similar to the IFF file format from 1985 (the basis of which is still used in some file formats like PNG and AIFF) in being a format that can contain multiple types of data. PDF is a little bit more focused on being an object-based file format in specific, but one could have made PDF fit into IFF boxes easily in an alternative universe.


Here's the beginnings of a pdf viewer built with lopdf and WebRender: https://github.com/srijs/rpdf/commits/master


I see a lot of complaints about the PDF format here, what's the alternative though? Is there another format that achieves the same goals which people should rally around? Just curious given PDF's ubiquity.

(For the record, I had to deal with text encodings in PDFs, and yes, it was a pain)


Pdf needs to have a « v2 » format that simply drops backward compatibility for everything that predates utf8, then create open converters tools to this new format.

Only support one type of font embedding, with a single encoding (make everything utf-8) : it won’t change anything noticeable regarding the file size, and will greatly help at least parsing text.

I’ve only dealt with text parsing, so that’s the only easy improvement i see, but i’m pretty sure following the same logic on graphic content should be possible.


There's PDF/A.


Thanks for the info, i've never heard of that, that's very refreshing to see that people are indeed trying to update the standard to remove the cruft.

Sidenote : It seems the PDF/A format open yet the specification is kept behind a paywall ????



Basic HTML5 with inline CSS and imgs written via data uri? Browsers are just as ubiquitous as pdf readers.


OT perhaps but does anyone know a readable reference for PDF opcodes? The example code feels somewhat opaque:

  let content = Content {
   operations: vec![
    Operation::new("BT", vec![]),
    Operation::new("Tf", vec!["F1".into(), 48.into()]),
    Operation::new("Td", vec![100.into(), 600.into()]),
    Operation::new("Tj", vec![Object::string_literal("Hello World!")]),
    Operation::new("ET", vec![]),
   ],
  };


The specification itself [1] is mostly pretty readable. For the opcodes, you want Appendix A.

[1] "PDF Reference, Sixth Edition, version 1.7" at https://www.adobe.com/devnet/pdf/pdf_reference_archive.html


An example of the resulting generated pdf would've been a great.


How does this compare to pdfbox (or the free versions of itext)?


You cannot even take out non-ascii strings (or search) if the PDF file had been created in a certain way. That's so 1980.

PDF is not exactly guaranteed to be portable- and is a format that needs to be dumped by anyone who cares about a global portability.


Still, for all its negative sides, is the best approximation of a portable and reliable "just work" document format.


[flagged]


https://github.com/J-F-Liu/lopdf/issues

I don't see an issue open?

Maybe type something like this, but maybe a bit more polite, into the github issues?

Be more constructive then a driveby on hn?


Let's say i was surprised to see something like that on HN frontpage. I expected a high quality PDF reader in Rust, and having implemented a PDF parser + TrueType parser myself, i was really eager to see how the same code looked.

Honestly, i won't be opening an issue that simply says "is this pdf lib aiming to be compliant with the PDF specification regarding text encoding ?", because it would be like insulting the owner on its own project page.

Better let him discover himself the mess that the PDF specification is, than pointing at gaps in the project...


*Is this compliant to the pdf spec?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: