Cover by Leonard Richardson, Lucas Carlson

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

1.5. Representing Unprintable Characters

Problem

You need to make reference to a control character, a strange UTF-8 character, or some other character that's not on your keyboard.

Solution

Ruby gives you a number of escaping mechanisms to refer to unprintable characters. By using one of these mechanisms within a double-quoted string, you can put any binary character into the string.

You can reference any any binary character by encoding its octal representation into the format "\000", or its hexadecimal representation into the format "\x00".

	octal = "\000\001\010\020"
	octal.each_byte { |x| puts x }
	# 0
	# 1
	# 8
	# 16

	hexadecimal = "\x00\x01\x10\x20"
	hexadecimal.each_byte { |x| puts x }
	# 0
	# 1
	# 16
	# 32

This makes it possible to represent UTF-8 characters even when you can't type them or display them in your terminal. Try running this program, and then opening the generated file smiley.html in your web browser:

	open('smiley.html', 'wb') do |f|
	  f << '<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">'
	  f << "\xe2\x98\xBA"
	end

The most common unprintable characters (such as newline) have special mneumonic aliases consisting of a backslash and a letter.

	"\a" == "\x07" # => true # ASCII 0x07 = BEL (Sound system bell)
	"\b" == "\x08" # => true # ASCII 0x08 = BS (Backspace)
	"\e" == "\x1b" # => true # ASCII 0x1B = ESC (Escape)
	"\f" == "\x0c" # => true # ASCII 0x0C = FF (Form feed)
	"\n" == "\x0a" # => true # ASCII 0x0A = LF (Newline/line feed)
	"\r" == "\x0d" # => true # ASCII 0x0D = CR (Carriage return)
	"\t" == "\x09" # => true # ASCII 0x09 = HT (Tab/horizontal tab)
	"\v" == "\x0b" # => true # ASCII 0x0B = VT (Vertical tab)

Discussion

Ruby stores a string as a sequence of bytes. It makes no difference whether those bytes are printable ASCII characters, binary characters, or a mix of the two.

When Ruby prints out a human-readable string representation of a binary character, it uses the character's \xxx octal representation. Characters with special \x mneumonics are printed as the mneumonic. Printable characters are output as their printable representation, even if another representation was used to create the string.

	"\x10\x11\xfe\xff"             # => "\020\021\376\377"
	"\x48\145\x6c\x6c\157\x0a"     # => "Hello\n"

To avoid confusion with the mneumonic characters, a literal backslash in a string is represented by two backslashes. For instance, the two-character string consisting of a backslash and the 14th letter of the alphabet is represented as "\\n".

	"\\".size                      # => 1
	"\\" == "\x5c"                 # => true
	"\\n"[0] == ?\\                # => true
	"\\n"[1] == ?n                 # => true
	"\\n" =~ /\n/                  # => nil

Ruby also provides special shortcuts for representing keyboard sequences like Control-C. "\C-_x_" represents the sequence you get by holding down the control key and hitting the x key, and "\M-_x_" represents the sequence you get by holding down the Alt (or Meta) key and hitting the x key:

	"\C-a\C-b\C-c" #               => "\001\002\003"
	"\M-a\M-b\M-c" #               => "\341\342\343"

Shorthand representations of binary characters can be used whenever Ruby expects a character. For instance, you can get the decimal byte number of a special character by prefixing it with ?, and you can use shorthand representations in regular expression character ranges.

	?\C-a                                    # => 1
	?\M-z                                    # => 250

	contains_control_chars = /[\C-a-\C-^]/
	'Foobar' =~ contains_control_chars       # => nil
	"Foo\C-zbar" =~ contains_control_chars   # => 3

	contains_upper_chars = /[\x80-\xff]/
	'Foobar' =~ contains_upper_chars         # => nil
	"Foo\212bar" =~ contains_upper_chars     # => 3

Here's a sinister application that scans logged keystrokes for special characters:

	def snoop_on_keylog(input)
	  input.each_byte do |b|
	    case b
	      when ?\C-c; puts 'Control-C: stopped a process?'
	      when ?\C-z; puts 'Control-Z: suspended a process?'
	      when ?\n; puts 'Newline.'
	      when ?\M-x; puts 'Meta-x: using Emacs?'
	    end
	  end
	end

	snoop_on_keylog("ls -ltR\003emacsHello\012\370rot13-other-window\012\032")
	# Control-C: stopped a process?
	# Newline.
	# Meta-x: using Emacs?
	# Newline.
	# Control-Z: suspended a process?

Special characters are only interpreted in strings delimited by double quotes, or strings created with %{} or %Q{}. They are not interpreted in strings delimited by single quotes, or strings created with %q{}. You can take advantage of this feature when you need to display special characters to the end-user, or create a string containing a lot of backslashes.

	puts "foo\tbar"
	# foo     bar
	puts %{foo\tbar}
	# foo     bar
	puts %Q{foo\tbar}
	# foo     bar

	puts 'foo\tbar'
	# foo\tbar
	puts %q{foo\tbar}
	# foo\tbar

If you come to Ruby from Python, this feature can take advantage of you, making you wonder why the special characters in your single-quoted strings aren't treated as special. If you need to create a string with special characters and a lot of embedded double quotes, use the %{} construct.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required