Every so often I come across a PDF document that has been protected to prevent copying. You probably won’t notice this for a while until your copy and paste fails a few times. Finally you will see that the Edit -> Copy menu in Adobe Reader has been disabled/grayed out. Other readers like Okular show a more meaningful “Copying forbidden by DRM” message to alert you to what is going on.
The logic behind this feature is to prevent copying but is completely pointless because of the “if you can see it you can copy it” principle more commonly referred to as the analogue hole. A determined thief can take photographs of all the pages effectively bypassing all the content protection completely. Sure it’s not a digital copy of the text but they still have a digital copy of the image of the text and more importantly they have the confidential information you were trying to protect.
So all this protection achieves is forcing legitimate users to type text extracts instead of copying them. This will of course lead to errors in transcription and which will cause confusion down the line when your customers tries to order part number XI20458-AA7 when it is listed as X12O4513B-AA7 in the document. Can you spot the three differences ?
For the moment the solution is to convert the PDF to text using the pdftotext tool from the poppler-utils package on kubuntu.
$ aptitude show poppler-utils Package: poppler-utils [snip] Description: PDF utilitites (based on libpoppler) This package contains pdftops (PDF to PostScript converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), pdftohtml (PDF to HTML converter), pdftotext (PDF to text converter), and pdffonts (PDF font analyzer).
To install on GNU/Debian based systems:
$ sudo apt-get install poppler-utils
Once this is installed you can simply run the command
$ pdftotext bla.pdf
This will output a file called bla.txt with all the text from the document. Oddly enough running
results in the error “Error: Copying of text from this document is not allowed.”, so it remains to be seen how long it will be before pdftotext will be patched to remove this feature.