Can’t copy from a PDF file.

Every so often I come across a PDF document that has been protected to prevent copying. You probably won’t notice this for a while until your copy and paste fails a few times. Finally you will see that the Edit -> Copy menu in Adobe Reader has been disabled/grayed out. Other readers like Okular show a more meaningful “Copying forbidden by DRM” message to alert you to what is going on.

The logic behind this feature is to prevent copying but is completely pointless because of the “if you can see it you can copy it” principle more commonly referred to as the analogue hole. A determined thief can take photographs of all the pages effectively bypassing all the content protection completely. Sure it’s not a digital copy of the text but they still have a digital copy of the image of the text and more importantly they have the confidential information you were trying to protect.

So all this protection achieves is forcing legitimate users to type text extracts instead of copying them. This will of course lead to errors in transcription and which will cause confusion down the line when your customers tries to order part number XI20458-AA7 when it is listed as X12O4513B-AA7 in the document. Can you spot the three differences ?

For the moment the solution is to convert  the PDF to text using the pdftotext tool from the poppler-utils package on kubuntu.

$ aptitude show poppler-utils
Package: poppler-utils
[snip]
Description: PDF utilitites (based on libpoppler)
 This package contains pdftops (PDF to PostScript converter), pdfinfo (PDF
document information extractor), pdfimages (PDF image extractor), pdftohtml
(PDF to HTML converter), pdftotext (PDF to text
 converter), and pdffonts (PDF font analyzer).

To install on GNU/Debian based systems:

$ sudo apt-get install poppler-utils

Once this is installed you can simply run the command

$ pdftotext bla.pdf

This will output a file called bla.txt with all the text from the document. Oddly enough running

pdftohtml bla.pdf

results in the error “Error: Copying of text from this document is not allowed.”, so it remains to be seen how long it will be before pdftotext will be patched to remove this feature.

This entry was posted in General. Bookmark the permalink.

5 Responses to Can’t copy from a PDF file.

  1. Jurn says:

    I use use MS Office’s OneNote. Insert / Screen Clipping / then right-click on the clipping, and select “copy text from picture”. It does OCR from small and non-system fonts very well.

  2. ken_fallon says:

    Hi Jurn,

    Will it allow you to copy from a ‘protected’ document ?

    Ken

  3. reine says:

    Nice HPR on cron, Ken, keep on trucking!.

  4. tomtom says:

    Use Okular. (on Linux)
    If you get “Copying forbidden by DRM”, simply go to “Settings” -> “Configure Okular” and in the “General Options”, “Program Feature” disable “Obey DRM limitations”.

  5. Ray says:

    tomtom, thanks for the tip! works perfect. DRM crap pisses me off!

Leave a Reply

Your email address will not be published.