blog

Engineers to the rescue: When your public employment service makes it impossible to apply for jobs 

Engineers to the rescue: When your public employment service makes it impossible to apply for jobs 

The background

InfoJobs.net is Adevinta’s job board in Spain and Italy. A place where companies can hire the right talent, and where people can find their next job.

InfoJobs, much like any other marketplace in Adevinta, makes use of YAMS (Your Adevinta Media Service) to store and retrieve media assets. But, unlike other marketplaces which mainly store images, InfoJobs stores, transforms and retrieves mostly documents (résumés, CVs, cover letters, etc.).

YAMS offers users the possibility to upload a document (in basically any format) and transform it into a PDF. In order to do so, YAMS parses these objects by means of two key projects:

1. LibreOffice (open source)

2. UniPDF (commercial)

The first is used to transform any document (that is not already a PDF) into a PDF, and the second is used to perform extra transformations to the output PDF, for example, applying a watermark. After the input document has been transformed, it is delivered to the requesting users via a CDN network.

We are the team responsible for creating and maintaining the YAMS service, the Edge team, and this is our story of customer obsession, collaboration and solidarity.

The issue

One Monday morning, we received a message from one of InfoJobs engineers about YAMS returning errors (HTTP 500) while trying to fetch certain PDF documents.

We examined our log files and found that a bunch of PDF files were causing transformation failures with a never-seen-before error from UniPDF.

The team dug deeper and spotted the issue: these files uploaded by some InfoJobs users looked like corrupted PDFs. Because they’re not formally valid, UniPDF fails to parse them, throwing an error. Our routines catch this error but, instead of classifying it as a file format error (HTTP 4xx), the routine thinks something broke inside YAMS itself.

We have seen plenty of examples of corrupted and invalid objects being uploaded to YAMS, so we did what we usually do. In these cases the right course of action is to instruct YAMS to:

1. Flag these files as corrupted.

2. Return the correct error code.

Then we retrieved the offending documents and added them to the YAMS test suite. This is part of our standard way of working, as it helps to make sure that we will keep returning the expected error code when similarly broken files are encountered in the future.

The doubt

Normally that would have been the end of it, if it wasn’t for the fact that we noticed that all the offending files were generated by the “Servicio Canario de Empleo” (Canary Islands Employment Service – SCE) and they were all résumés of people looking for employment. This realisation prompted us to investigate a little further.

We asked ourselves: is it a coincidence that all the files causing this new error have been generated by the same organisation? Could it be that these files are not actually corrupted, but instead conflicting with the way UniPDF does its parsing? Is there something we can do for these files to be processed correctly? Can we lend a hand to the less tech-savvy of InfoJobs’ user base?

So we tried a few things: we updated the UniPDF library to the latest version and reprocessed the broken files, but we still got the same error. Then we tried to test those files with other tools (i.e. qpdf, pdfinfo) but the validity check failed each time.

In a last-ditch attempt, we tried to open the files with different viewers (i.e. Adobe Acrobat Reader, macOS preview, LibreOffice Draw, Google Chrome, Mozilla Firefox) and, to our great surprise, they were all able to display the PDFs.

At this point, we had to get to the bottom of the issue.

The communication

The next morning, during our daily sync meeting, it was clear that we had two options: talk to the SCE or ask the UniPDF people. So we decided to do… both.

While some of us contacted the technical office of the Canary Islands local government, others wrote a support request to UniPDF.

The people from the Canary Islands Employment Service were kind enough to provide us with a PDF generated by their systems that didn’t contain any personal information (as opposed to the files uploaded by the users via InfoJobs that contained all sorts of sensitive data). We attached this file to our UniPDF support request, hoping we could leverage their expertise to make sense of this inexplicable behaviour.

At the same time, we started our own internal investigation.

The rabbit hole

We obtained the latest version of the PDF 1.x standard (ISO 32000-2) and started learning about it. At the same time, we started taking apart the “broken” PDFs produced by the SCE. It did not take long before we realised that none of the SCE documents complied with the PDF standard.

Specifically, we found out that they were the result of two completely independent PDFs being appended one after the other inside the same file.

The first PDF is the original document (a person’s résumé), while the second is the same document again but with an additional signature footer explaining how to verify its authenticity.

This finding would explain why UniPDF was unable to parse the files. Why other readers were able to visualise the files successfully remained a mystery.

In the meantime, we received a reply from the UniPDF technical support team, which basically highlighted the same issues we had already found.

The analysis

We analysed the test PDF generated by the SCE to understand why our system was unable to handle it correctly. Our investigation revealed that the file was composed of two PDFs, joined one after the other.

As this is incompatible with the PDF standard, the tools we used to check the documents integrity reported errors when attempting to validate it.

Validity check with qpdf

Leave a Reply

Your email address will not be published. Required fields are marked *