O'Reilly logo

Ruby Cookbook by Leonard Richardson, Lucas Carlson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Strings

Ruby is a programmer-friendly language. If you are already familiar with object oriented programming, Ruby should quickly become second nature. If you've struggled with learning object-oriented programming or are not familiar with it, Ruby should make more sense to you than other object-oriented languages because Ruby's methods are consistently named, concise, and generally act the way you expect.

Throughout this book, we demonstrate concepts through interactive Ruby sessions. Strings are a good place to start because not only are they a useful data type, they're easy to create and use. They provide a simple introduction to Ruby, a point of comparison between Ruby and other languages you might know, and an approachable way to introduce important Ruby concepts like duck typing (see Recipe 1.12), open classes (demonstrated in Recipe 1.10), symbols (Recipe 1.7), and even Ruby gems (Recipe 1.20).

If you use Mac OS X or a Unix environment with Ruby installed, go to your command line right now and type irb. If you're using Windows, you can download and install the One-Click Installer from http://rubyforge.org/projects/rubyinstaller/, and do the same from a command prompt (you can also run the fxri program, if that's more comfortable for you). You've now entered an interactive Ruby shell, and you can follow along with the code samples in most of this book's recipes.

Strings in Ruby are much like strings in other dynamic languages like Perl, Python and PHP. They're not too much different from strings in Java and C. Ruby strings are dynamic, mutable, and flexible. Get started with strings by typing this line into your interactive Ruby session:

	string = "My first string"

You should see some output that looks like this:

	=> "My first string"

You typed in a Ruby expression that created a string "My first string", and assigned it to the variable string. The value of that expression is just the new value of string, which is what your interactive Ruby session printed out on the right side of the arrow. Throughout this book, we'll represent this kind of interaction in the following form:[1]

	string = "My first string"                 # => "My first string"

In Ruby, everything that can be assigned to a variable is an object. Here, the variable string points to an object of class String. That class defines over a hundred built-in methods: named pieces of code that examine and manipulate the string. We'll explore some of these throughout the chapter, and indeed the entire book. Let's try out one now: String#length, which returns the number of bytes in a string. Here's a Ruby method call:

	string.length                              # => 15

Many programming languages make you put parentheses after a method call:

	string.length()                            # => 15

In Ruby, parentheses are almost always optional. They're especially optional in this case, since we're not passing any arguments into String#length. If you're passing arguments into a method, it's often more readable to enclose the argument list in parentheses:

	string.count 'i'                           # => 2 # "i" occurs twice.
	string.count('i')                          # => 2

The return value of a method call is itself an object. In the case of String#length, the return value is the number 15, an instance of the Fixnum class. We can call a method on this object as well:

	string.length.next                         # => 16

Let's take a more complicated case: a string that contains non-ASCII characters. This string contains the French phrase "il était une fois," encoded as UTF-8:[2]

	french_string = "il \xc3\xa9tait une fois"   # => "il \303\251tait une fois"

Many programming languages (notably Java) treat a string as a series of characters. Ruby treats a string as a series of bytes. The French string contains 14 letters and 3 spaces, so you might think Ruby would say the length of the string is 17. But one of the letters (the e with acute accent) is represented as two bytes, and that's what Ruby counts:

	french_string.length                       # => 18

For more on handling different encodings, see Recipe 1.14 and Recipe 11.12. For more on this specific problem, see Recipe 1.8

You can represent special characters in strings (like the binary data in the French string) with string escaping. Ruby does different types of string escaping depending on how you create the string. When you enclose a string in double quotes, you can encode binary data into the string (as in the French example above), and you can encode newlines with the code "\n", as in other programming languages:

	puts "This string\ncontains a newline"
	# This string
	# contains a newline

When you enclose a string in single quotes, the only special codes you can use are "\'" to get a literal single quote, and "\\" to get a literal backslash:

	puts 'it may look like this string contains a newline\nbut it doesn\'t'
	# it may look like this string contains a newline\nbut it doesn't

	puts 'Here is a backslash: \\'
	# Here is a backslash: \

This is covered in more detail in Recipe 1.5. Also see Recipes 1.2 and 1.3 for more examples of the more spectacular substitutions double-quoted strings can do.

Another useful way to initialize strings is with the " here documents" style:

	long_string = <<EOF
	Here is a long string
	With many paragraphs
	EOF
	# => "Here is a long string\nWith many paragraphs\n"

	puts long_string
	# Here is a long string
	# With many paragraphs

Like most of Ruby's built-in classes, Ruby's strings define the same functionality in several different ways, so that you can use the idiom you prefer. Say you want to get a substring of a larger string (as in Recipe 1.13). If you're an object-oriented programming purist, you can use the String#slice method:

	string                                     # => "My first string"
	string.slice(3, 5)                         # => "first"

But if you're coming from C, and you think of a string as an array of bytes, Ruby can accommodate you. Selecting a single byte from a string returns that byte as a number.

	string.chr + string.chr + string.chr + string.chr + string.chr
	# => "first"

And if you come from Python, and you like that language's slice notation, you can just as easily chop up the string that way:

	string[3, 5]                              # => "first"

Unlike in most programming languages, Ruby strings are mutable: you can change them after they are declared. Below we see the difference between the methods String#upcase and String#upcase!:

	string.upcase                             # => "MY FIRST STRING"
	string                                    # => "My first string"
	string.upcase!                            # => "MY FIRST STRING"
	string                                    # => "MY FIRST STRING"

This is one of Ruby's syntactical conventions. "Dangerous" methods (generally those that modify their object in place) usually have an exclamation mark at the end of their name. Another syntactical convention is that predicates, methods that return a true/false value, have a question mark at the end of their name (as in some varieties of Lisp):

	string.empty?                             # => false
	string.include? 'MY'                      # => true

This use of English punctuation to provide the programmer with information is an example of Matz's design philosophy: that Ruby is a language primarily for humans to read and write, and secondarily for computers to interpret.

An interactive Ruby session is an indispensable tool for learning and experimenting with these methods. Again, we encourage you to type the sample code shown in these recipes into an irb or fxri session, and try to build upon the examples as your knowledge of Ruby grows.

Here are some extra resources for using strings in Ruby:

  • You can get information about any built-in Ruby method with the ri command; for instance, to see more about the String#upcase! method, issue the command ri "String#upcase!" from the command line.

  • "why the lucky stiff" has written an excellent introduction to installing Ruby, and using irb and ri: http://poignantguide.net/ruby/expansion-pak-1.html

  • For more information about the design philosophy behind Ruby, read an interview with Yukihiro "Matz" Matsumoto, creator of Ruby: http://www.artima.com/intv/ruby.html

1.1. Building a String from Parts

Problem

You want to iterate over a data structure, building a string from it as you do.

Solution

There are two efficient solutions. The simplest solution is to start with an empty string, and repeatedly append substrings onto it with the << operator:

	hash = { "key1" => "val1", "key2" => "val2" }
	string = ""
	hash.each { |k,v| string << "#{k} is #{v}\n" }
	puts string
	# key1 is val1
	# key2 is val2

This variant of the simple solution is slightly more efficient, but harder to read:

	string = ""
	hash.each { |k,v| string << k << " is " << v << "\n" }

If your data structure is an array, or easily transformed into an array, it's usually more efficient to use Array#join:

	puts hash.keys.join("\n") + "\n"
	# key1
	# key2

Discussion

In languages like Python and Java, it's very inefficient to build a string by starting with an empty string and adding each substring onto the end. In those languages, strings are immutable, so adding one string to another builds an entirely new string. Doing this multiple times creates a huge number of intermediary strings, each of which is only used as a stepping stone to the next string. This wastes time and memory.

In those languages, the most efficient way to build a string is always to put the substrings into an array or another mutable data structure, one that expands dynamically rather than by implicitly creating entirely new objects. Once you're done processing the substrings, you get a single string with the equivalent of Ruby's Array#join. In Java, this is the purpose of the StringBuffer class.

In Ruby, though, strings are just as mutable as arrays. Just like arrays, they can expand as needed, without using much time or memory. The fastest solution to this problem in Ruby is usually to forgo a holding array and tack the substrings directly onto a base string. Sometimes using Array#join is faster, but it's usually pretty close, and the << construction is generally easier to understand.

If efficiency is important to you, don't build a new string when you can append items onto an existing string. Constructs like str << 'a' + 'b' or str << "#{var1} #{var2}" create new strings that are immediately subsumed into the larger string. This is exactly what you're trying to avoid. Use str << var1 <<''<< var2 instead.

On the other hand, you shouldn't modify strings that aren't yours. Sometimes safety requires that you create a new string. When you define a method that takes a string as an argument, you shouldn't modify that string by appending other strings onto it, unless that's really the point of the method (and unless the method's name ends in an exclamation point, so that callers know it modifies objects in place).

Another caveat: Array#join does not work precisely the same way as repeated appends to a string. Array#join accepts a separator string that it inserts between every two elements of the array. Unlike a simple string- building iteration over an array, it will not insert the separator string after the last element in the array. This example illustrates the difference:

	data = ['1', '2', '3']
	s = ''
	data.each { |x| s << x << ' and a '}
	s                                             # => "1 and a 2 and a 3 and a "
	data.join(' and a ')                          # => "1 and a 2 and a 3"

To simulate the behavior of Array#join across an iteration, you can use Enumerable#each_with_index and omit the separator on the last index. This only works if you know how long the Enumerable is going to be:

	s = ""
	data.each_with_index { |x, i| s << x; s << "|" if i < data.length-1 }
	s                                             # => "1|2|3"


[1] Yes, this was covered in the Preface, but not everyone reads the Preface.

[2] "\xc3\xa9" is a Ruby string representation of the UTF-8 encoding of the Unicode character é.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required