Cover by Donald Bruce Stewart, Bryan O'Sullivan, John Goerzen

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Warming Up: Portably Splitting Lines of Text

Haskell provides a built-in function, lines, that lets us split a text string on line boundaries. It returns a list of strings with line termination characters omitted:

ghci> :type lines
lines :: String -> [String]
ghci> lines "line 1\nline 2"
["line 1","line 2"]
ghci> lines "foo\n\nbar\n"
["foo","","bar"]

While lines looks useful, it relies on us reading a file in text mode in order to work. Text mode is a feature common to many programming languages; it provides a special behavior when we read and write files on Windows. When we read a file in text mode, the file I/O library translates the line-ending sequence "\r\n" (carriage return followed by newline) to "\n" (newline alone), and it does the reverse when we write a file. On Unix-like systems, text mode does not perform any translation. As a result of this difference, if we read a file on one platform that was written on the other, the line endings are likely to become a mess. (Both readFile and writeFile operate in text mode.)

ghci> lines "a\r\nb"
["a\r","b"]

The lines function splits only on newline characters, leaving carriage returns dangling at the ends of lines. If we read a Windows-generated text file on a Linux or Unix box, we’ll get trailing carriage returns at the end of each line.

We have comfortably used Python’s universal newline support for years; this transparently handles Unix and Windows line-ending conventions for us. We would like to provide something similar in Haskell.

Since we are still early in our career of reading Haskell code, we will discuss our Haskell implementation in some detail:

-- file: ch04/SplitLines.hs
splitLines :: String -> [String]

Our function’s type signature indicates that it accepts a single string, the contents of a file with some unknown line-ending convention. It returns a list of strings, representing each line from the file:

-- file: ch04/SplitLines.hs
splitLines [] = []
splitLines cs =
    let (pre, suf) = break isLineTerminator cs
    in  pre : case suf of 
                ('\r':'\n':rest) -> splitLines rest
                ('\r':rest)      -> splitLines rest
                ('\n':rest)      -> splitLines rest
                _                -> []

isLineTerminator c = c == '\r' || c == '\n'

Before we dive into detail, notice first how we organized our code. We presented the important pieces of code first, keeping the definition of isLineTerminator until later. Because we have given the helper function a readable name, we can guess what it does even before we’ve read it, which eases the smooth flow of reading the code.

The Prelude defines a function named break that we can use to partition a list into two parts. It takes a function as its first parameter. That function must examine an element of the list and return a Bool to indicate whether to break the list at that point. The break function returns a pair, which consists of the sublist consumed before the predicate returned True (the prefix) and the rest of the list (the suffix):

ghci> break odd [2,4,5,6,8]
([2,4],[5,6,8])
ghci> :module +Data.Char
ghci> break isUpper "isUpper"
("is","Upper")

Since we need only to match a single carriage return or newline at a time, examining each element of the list one by one is good enough for our needs.

The first equation of splitLines indicates that if we match an empty string, we have no further work to do.

In the second equation, we first apply break to our input string. The prefix is the substring before a line terminator, and the suffix is the remainder of the string. The suffix will include the line terminator, if any is present.

The pre : expression tells us that we should add the pre value to the front of the list of lines. We then use a case expression to inspect the suffix, so we can decide what to do next. The result of the case expression will be used as the second argument to the (:) list constructor.

The first pattern matches a string that begins with a carriage return, followed by a newline. The variable rest is bound to the remainder of the string. The other patterns are similar, so they ought to be easy to follow.

A prose description of a Haskell function isn’t necessarily easy to follow. We can gain a better understanding by stepping into ghci and observing the behavior of the function in different circumstances.

Let’s start by partitioning a string that doesn’t contain any line terminators:

ghci> splitLines "foo"
["foo"]

Here, our application of break never finds a line terminator, so the suffix it returns is empty:

ghci> break isLineTerminator "foo"
("foo","")

The case expression in splitLines must thus be matching on the fourth branch, and we’re finished. What about a slightly more interesting case?

ghci> splitLines "foo\r\nbar"
["foo","bar"]

Our first application of break gives us a nonempty suffix:

ghci> break isLineTerminator "foo\r\nbar"
("foo","\r\nbar")

Because the suffix begins with a carriage return followed by a newline, we match on the first branch of the case expression. This gives us pre bound to "foo", and suf bound to "bar". We apply splitLines recursively, this time on "bar" alone:

ghci> splitLines "bar"
["bar"]

The result is that we construct a list whose head is "foo" and whose tail is ["bar"]:

ghci> "foo" : ["bar"]
["foo","bar"]

This sort of experimenting with ghci is a helpful way to understand and debug the behavior of a piece of code. It has an even more important benefit that is almost accidental in nature. It can be tricky to test complicated code from ghci, so we will tend to write smaller functions, which can further help the readability of our code.

This style of creating and reusing small, powerful pieces of code is a fundamental part of functional programming.

A Line-Ending Conversion Program

Let’s hook our splitLines function into the little framework that we wrote earlier. Make a copy of the InteractWith.hs source file; let’s call the new file FixLines.hs. Add the splitLines function to the new source file. Since our function must produce a single String, we must stitch the list of lines back together. The Prelude provides an unlines function that concatenates a list of strings, adding a newline to the end of each:

-- file: ch04/SplitLines.hs
fixLines :: String -> String
fixLines input = unlines (splitLines input)

If we replace the id function with fixLines, we can compile an executable that will convert a text file to our system’s native line ending:

$ ghc --make FixLines
[1 of 1] Compiling Main             ( FixLines.hs, FixLines.o )
Linking FixLines ...

If you are on a Windows system, find and download a text file that was created on a Unix system (for example, gpl-3.0.txt [http://www.gnu.org/licenses/gpl-3.0.txt]). Open it in the standard Notepad text editor. The lines should all run together, making the file almost unreadable. Process the file using the FixLines command you just created, and open the output file in Notepad. The line endings should now be fixed up.

On Unix-like systems, the standard pagers and editors hide Windows line endings, making it more difficult to verify that FixLines is actually eliminating them. Here are a few commands that should help:

$ file gpl-3.0.txt
gpl-3.0.txt: ASCII English text
$ unix2dos gpl-3.0.txt
unix2dos: converting file gpl-3.0.txt to DOS format ...
$ file gpl-3.0.txt
gpl-3.0.txt: ASCII English text, with CRLF line terminators

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required