ACA Blog

ACA Blog


November 2019
M T W T F S S
« Oct    
 123
45678910
11121314151617
18192021222324
252627282930  

Categories


Better PDF previews in Liferay without ImageMagick

Jan EerdekensJan Eerdekens

The problem

While doing some work for a client I ran into issues with the preview generation for some PDFs. While it did generate a preview image for each page of the document, the text on it was strange sometimes to say the least. In the table below you can see a screenshot of the first page of the problematic PDF, the preview that is generated by Liferay and the preview we’re able to generate after our hack:

Actual PDF Before hack After hack
 Normenboek zonder architectpdf (1) Normenboek zonder architect Normenboek zonder architectpdf (1)

The biggest problem is the font, but the background is also a bit screwy. My initial thought was that the PDF might be using some specials font(s) and didn’t (correctly) embed them? So my first inclination was to see how to add fonts to the system that could be picked up by whatever Liferay was using to generate the preview images. The default for Liferay to generate previews of PDF files is a pure Java library called PDFBox. There’s also the option of using an OS native install of Imagemagick (and Ghostscript), but that would require at least an additional 2Gb of memory outside of the JVM allocation. As this wasn’t an option in this case I first looked into the font option. While this does seem to be possible in PDFBox, by editing the PDFBox_External_Fonts.properties that can be found inside the JAR and adding the additional fonts, I couldn’t quite get it to work: instead of strange/wrong characters I now got no characters at all.

After some more Googling it seems that the PDFBox that Liferay 6.2 uses, which is version 1.8.2, is known for having a lot of font issues. Most of these seem to be better/fixed in the 2.0.0 version… sadly enough this version hasn’t been released yet. But in cases like this you sometimes need to take a page out of Ayrton Senna’s book and push the limit a bit:


On a given day, a given circumstance, you think you have a limit. And you then go for this limit and you touch this limit, and you think, ‘Okay, this is the limit’. And so you touch this limit, something happens and you suddenly can go a little bit further. With your mind power, your determination, your instinct, and the experience as well, you can fly very high.

— Ayrton Senna

dragons

The solution

As you might expect you can’t just drop in a snapshot version of PDFBox 2.0.0 jar and expect everything to be solved. I doesn’t quite work like that and here are reasons why:

While there are 3 JARs in Liferay’s lib directory that are part of PDFBox, only the first two I mentioned are actually used in the preview generation and will need to be switched out. The jempbox JAR is used for PDF XMP metadata extraction, but can be left in its original state (in the 2.0.0 version jempbox has been renamed to xmpbox). If you take a2.0.0-SNAPSHOT build of these two JARs and use them to replace the ones in the Liferay WEB-INF/lib directory and restart you’ll run into the other two problems I mentioned.

From the stracktrace you get you’ll see that not only is Liferay using PDFBox for preview and thumbnail generation, it also uses it, via Apache Tika, for text extraction. Tika uses PDFBox’s PDFTextStripper class (and some auxiliary ones) for this, which were moved from the package org.apache.pdfbox.util to org.apache.pdfbox.text for PDFBox 2.0.0. Because we do not also want to patch Tika, we’ll just move those classes back to their original package and call it a day.

This brings us to our last problem, but also biggest problem: there have some significant code changes/refactorings in PDFBox between version 1.8.2 and the snapshot we’d like to use. The first change we run into is something that is related to the previous problem. In version 1.8.2 the PDFTextStripper class had a method setForceParsing(boolean)which isn’t present anymore in 2.0.0, but which we’ll just add back with an empty implementation:

While it is a bit strange to solve a NoSuchMethodException like this, it seems it wasn’t a really critical part because afterwards the text extraction seemed to work again like before. This means we can finally get to the important part: fixing the API incompatibilities in the Liferay PDF preview generation code that is done by the LiferayPDFBoxConverterclass. Due to some refactorings in PDFBox this class won’t find the getAllPages() method on PDDocumentCatalog anymore. To fix this you’ll need to take the source of this class, modify it and then replace the original class, located in Liferay’s portal-impl.jar, with your modified one. We did this using a fancy JAR/WAR overlay system that we use in our build/deploy system (which we’ll cover in a blog post someday), but there are of course other ways to do this: manually patching the JAR, an extlet, … .

When you add the source of the LiferayPDFBoxConverter class to a simple project you’ll also see that some other stuff won’t compile because of missing/changed methods. For this we’ll need to make some changes to the generateImagesPB() methods so they look like the ones below:

With this modified class in place and the updated and tweaked PDFBox JARs the PDF preview generation (and text extraction) should work again and produce far better results that before. To make the lives of the developers that, like me, like to live on the edge, here’s some helpful code:

Blog written by Jan Eerdekens

jan2

Liferay Expert at ACA IT-Solutions

Interested in joining our team?

Interested in meeting one of our team members? Interested in joining our team?
We are always looking for new motivated professionals to join the ACA team!
Have a look at our new ACA job website: http://www.aca-it.be/jobs

 

Belgian Java and Liferay developer with some weird interests, eternal complainer, atheist, skeptic and geocacher.