Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Vlad Patryshev, who was born in Russia, has an education in Math (including now popular category theory), and works at HealthExpense.com. A functional programmer and a Scala fan since 2008, he is now an organizer of Scala BASE and Bay Area Categories and Types meetups.

I will start this post by describing a problem I’ve been having, and along the way I will share with you the solution I arrived at. I need to download content from certain websites, and the content includes PDF files. In these instances, I use Selenium for controlling the browser with the HTTPS protocol. Unfortunately, this makes it impossible to intercept the content in a proxy, so I have to to extract the content when it is already in the browser.

Recent changes in Mozilla introduced pdf.js, a beautiful, well-written, very powerful script that renders PDF content right in the browser. The content is represented as a sequence of sophisticated <div> sections, where style specifies location, rotation, scale, and size. This sequence of <div> is embedded into an HTML document that has buttons for printing, downloading, resizing and more. If you use a Mozilla browser, you are no doubt familiar with this solution.

Selenium allows you to retrieve the page content (the content of elements), or execute any JavaScript code, so you can easily retrieve the section with the rendered document. Unfortunately, it is not PDF, since the original binary PDF content is hidden, but I want to extract it.

The solution to my problem can be found in Scala, the language that is turning into a Swiss army knife these days.

Interaction with the Browser

You can tell the browser to load a page, and you can run any JavaScript inside the current browser window. But, it takes time to load a page; unfortunately, since browser-side reactive JavaScript programming and our client reactive programming style don’t inter-operate well, so we have to wait at the border for a condition to become true.

When JavaScript executes on the browser side, it returns an object that can be a String, a number, or an array of such objects. Here is an example:

This returns an array with two elements (Array(“pwd“, password)). Well, not exactly. What if the element is not there? An exception will be thrown in JavaScript, and this exception is intercepted in runJS and transformed into an error report. So, actually runJS returns an instance of Result[Object], where the value could be Good(value:Object) or Bad(error:String). (Read more about the Result class).

Working with untyped data coming from JavaScript can be a mental challenge for a strictly-typing programmer. Fortunately, in our specific case, we can limit ourselves to just strings.

Still, when something in the browser takes time and requires a continuation-passing style, we don’t have much of a choice; we just patiently check if the result is available.

Extracting the Document, JavaScript Side

Now we can assume pdf.js has our document, but it won’t give it away easily. There is no “getter” method; the code is written in modern continuation-passing style. So, we have to pass some continuations.

What is happening here: we declare a variable, extractedPdf, into which the document, in the form of a byte buffer, will be stored. Then we (in extractPdfContent) get the data Promise (read a good explanation of Promise, in Scala) and tell the Promise that, when data is available, it should call our “good news listener,” or, on error, call our “bad news listener.”

We don’t know when pdf.js will finish its job of extracting the PDF; we can only guess; so that’s what we do in practice.

When the PDF bytes are available, we have to turn it into a form that can be passed back. Base64 would be nice, but it’s too much trouble, so let’s just convert the bytes to hexes. For this we have the following piece of JavaScript:

That’s all we need in JavaScript. Now let’s tackle the client side of the story.

Extracting the Document, Scala Side

In our client code we need to do the following steps:

  • load the page with the document (e.g.
  • wait until the document actually loads
  • tell the browser to give us the binary data
  • make sure we have the binary
  • retrieve the binary
  • store its bytes in a file
  • profit

Additionally, in this example, we also want to see how pdf.js rendered the file. In other words, how it rendered the PDF into the sequence of <div>s mentioned in the beginning of this text.

Some of these actions return values, and some do not; all of these actions may fail. And we, Scala programmers, do not want our code to look like this:

No. In Scala you can do better than this.

The whole code looks like this (https://gist.github.com/vpatryshev/7076235#file-gistfile1-scala):

So, what exactly is happening in this code?

Line 1. The function is called downloadPDF, and it returns Result[(File, String)] – that is, either a pair (file, htmlContent), wrapped in Good, or an instance of Bad with the list of errors.

Line 2. We call a method that loads a page at a given URL. This method returns Result, and Result has a method called andThen. This is very similar to semicolon: ignore the previous result and proceed, or, on error, just stop here. So, if page did not load, we return Bad with an explanation – that’s it.

Line 3. We wait until the specified element shows up in the document. On failure, we stop here, and on success we call andThen.

Line 4. We call extractPdfContent() in JavaScript; this method does not return anything meaningful, but it prompts pdf.js to give us our bytes. On failure we just stop here; on success we proceed.

Line 5. We don’t know how much time pdf.js will spend giving us the bytes it already has; but it runs in a separate thread, and there’s no reason to believe yield() would be enough; so we generously give it a second. We are on the receiving side of data flow, and we are not reactive, so we must be patient.

Line 6. Now we call intBuf2hex, retrieving the bytes pdf.js gave us, and converting them to a string. Since the result of runJS is just Result[Object], we, in case of success, have to convert the value to String. The string is actually a sequence of ascii hexes, and it is stored inside Good (in case of errors we have another Bad).

Line 6. We call flatMap on extracted because what we do inside may result in a failure, so it’s a Result inside Result – it needs flattening.

Line 7. Given a string with hexes, we first decode it into an array of bytes, and then send the result to a new file. Where do these functions, decodeHex and #> come from? Let’s digress.

I have a function that, by the pattern known as ‘pimp my library’, appends some methods to class String.

This means, every time I write someString.decodeHex, the compiler finds an implicit that builds a class that has a method, decodeHex – in our case an anonymous class provided by powerString().

Similarly, I have another pimping implicit:

Having a byte array, we can apply a method #> that dumps the contents to a file, returning a Result[File]. Result.attempt() catches exceptions and transforms them to Bad(errors).

So, now you see that lines 6-7 produce a value of type Result[File], Good or Bad.

Line 8. So far each step depended upon the previous one, and if we had an error, we did not bother to call anything. In this line, though, we extract the HTML (produced by rendering the PDF), whatever the result of the previous operations.

Line 9. pdf <*> html – this is called (in math) a tensor product, and, since Result is an applicative functor, it combines two results into one, giving either a list of errors or a tuple (File, String) wrapped in Good.

Conclusion

Summing up, in these nine lines shown earlier, we pull a PDF from somewhere, dump it into a file, and obtain both the file and the rendered HTML, taking care of all possible errors while doing it.

This is the power of functional programming, and this is the power of Scala (and JavaScript, too).

See below for some Scala PDF and JavaScript resources from Safari Books Online.

Read these titles on Safari Books Online

Not a subscriber? Sign up for a free trial.

Developing with PDF helps you understand how to work with PDF to construct your own documents, troubleshoot problems, and even build your own tools.
Scala Cookbook helps you save time and trouble when using Scala to build object-oriented, functional, and concurrent applications. With more than 250 ready-to-use recipes and 700 code examples, this comprehensive cookbook covers the most common problems you’ll encounter when using the Scala language, libraries, and tools.
Maintainable JavaScript helps you learn how to write maintainable JavaScript code that other team members can easily understand, adapt, and extend. Author Nicholas Zakas assembled this collection of best practices as a front-end tech leader at Yahoo!, after completing his own journey from solo hacker to team player.

About the author

Vlad Patryshev was born in Russia, and has an education in Math (including now popular category theory). Since 1998 in the US, he worked at Borland (JBuilder team), then at Google in various teams, and had a 20% project, an onscreen keyboard, now available on various Google pages. He then changed several Bay Area startups, and now is working at HealthExpense.com. A functional programmer and a Scala fan since 2008, he is now an organizer of Scala BASE and Bay Area Categories and Types meetups. As a hobby, he rides his road bike, builds decks, and drives around the US (Key West to the Polar Circle in Alaska). He can be reached at @vpatryshev.

Tags:

Comments are closed.