You are previewing Python 2.6 Text Processing Beginner's Guide.
O'Reilly logo
Python 2.6 Text Processing Beginner's Guide

Book Description

With a basic knowledge of Python you have the potential to undertake time-saving text processing. This book is a great introduction to the various techniques, and teaches through practical examples and clear explanations.

  • The easiest way to learn text processing with Python

  • Deals with the most important textual data formats you will encounter

  • Learn to use the most popular text processing libraries available for Python

  • Packed with examples to guide you through

  • In Detail

    For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.

    Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.

    The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

    Learn text processing techniques and work with the most popular Python libraries for transforming text from one form to another

    Table of Contents

    1. Python 2.6 Text Processing
      1. Python 2.6 Text Processing
      2. Credits
      3. About the Author
      4. About the Reviewer
      5. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
        2. Why Subscribe?
        3. Free Access for Packt account holders
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Time for action - heading
          1. What just happened?
          2. Pop Quiz - heading
          3. Have a go hero - heading
        6. Reader feedback
        7. Customer support
          1. Errata
          2. Piracy
          3. Questions
      7. 1. Getting Started
        1. Categorizing types of text data
          1. Providing information through markup
          2. Meaning through structured formats
          3. Understanding freeform content
        2. Ensuring you have Python installed
          1. Providing support for Python 3
        3. Implementing a simple cipher
        4. Time for action - implementing a ROT13 encoder
          1. What just happened?
          2. Have a go hero - more translation work
          3. Processing structured markup with a filter
        5. Time for action - processing as a filter
          1. What just happened?
        6. Time for action - skipping over markup tags
          1. What just happened?
          2. State machines
          3. Pop Quiz - ROT 13 processing
          4. Have a go hero - support multiple input channels
        7. Supporting third-party modules
          1. Packaging in a nutshell
        8. Time for action - installing SetupTools
          1. What just happened?
        9. Running a virtual environment
          1. Configuring virtualenv
        10. Time for action - configuring a virtual environment
          1. What just happened?
          2. Have a go hero - install your own environment
        11. Where to get help?
        12. Summary
      8. 2. Working with the IO System
        1. Parsing web server logs
        2. Time for action - generating transfer statistics
          1. What just happened?
        3. Using objects interchangeably
        4. Time for action - introducing a new log format
          1. What just happened?
          2. Have a go hero - creating a new processing class
        5. Accessing files directly
        6. Time for action - accessing files directly
          1. What just happened?
          2. Context managers
          3. Handling other file types
        7. Time for action - handling compressed files
          1. What just happened?
          2. Implementing file-like objects
            1. File object methods
              1. close
              2. fileno
              3. flush
              4. read
              5. readline
              6. readlines
              7. seek
              8. tell
              9. write
              10. writelines
            2. Enabling universal newlines
        8. Accessing multiple files
        9. Time for action - spell-checking HTML content
          1. What just happened?
          2. Simplifying multiple file access
            1. Inplace filtering
          3. Pop Quiz - file-like objects
        10. Accessing remote files
        11. Time for action - spell-checking live HTML pages
          1. What just happened?
          2. Have a go hero - access web logs remotely
          3. Error handling
        12. Time for action - handling urllib 2 errors
          1. What just happened?
        13. Handling string IO instances
        14. Understanding IO in Python 3
        15. Summary
      9. 3. Python String Services
        1. Understanding the basics of string object
          1. Defining strings
        2. Time for action - employee management
          1. What just happened?
          2. Building non-literal strings
          3. Pop Quiz - string literals
        3. String formatting
        4. Time for action - customizing log processor output
          1. What just happened?
          2. Percent (modulo) formatting
            1. Mapping key
            2. Conversion flags
            3. Minimum width
            4. Precision
            5. Width
            6. Conversion type
              1. Using string special methods
          3. Have a go hero - make log processing more readable
          4. Using the format method approach
        5. Time for action - adding status code data
          1. What just happened?
          2. Making use of conversion specifiers
              1. Fill
              2. Align
              3. Sign
              4. Width
              5. Precision
              6. Type
          3. Have a go hero - updating the file size check to use the format method
        6. Creating templates
        7. Time for action - displaying warnings on malformed lines
          1. What just happened?
          2. Template syntax
          3. Rendering a template
          4. Pop Quiz - string formatting
        8. Calling string object methods
        9. Time for action - simple manipulation with string methods
          1. What just happened?
          2. Aligning text
          3. Detecting character classes
          4. Casing
          5. Searching strings
          6. Dealing with lists of strings
            1. Treating strings as sequences
          7. Have a go hero - dive into the string object
        10. Summary
      10. 4. Text Processing Using the Standard Library
        1. Reading CSV data
        2. Time for action - processing Excel formats
          1. What just happened?
        3. Time for action - CSV and formulas
          1. What just happened?
          2. Reading non-Excel data
        4. Time for action - processing custom CSV formats
          1. What just happened?
        5. Writing CSV data
        6. Time for action - creating a spreadsheet of UNIX users
          1. What just happened?
          2. Pop Quiz - CSV handling
          3. Have a go hero - detecting CSV dialects
        7. Modifying application configuration files
        8. Time for action - adding basic configuration read support
          1. What just happened?
          2. Using value interpolation
        9. Time for action - relying on configuration value interpolation
          1. What just happened?
          2. Handling default options
        10. Time for action - configuration defaults
          1. What just happened?
          2. Have a go hero - overriding configuration options
        11. Writing configuration data
        12. Time for action - generating a configuration file
          1. What just happened?
          2. Have a go hero - clearing configuration defaults
        13. Reconfiguring our source
          1. A note on Python 3
        14. Time for action - creating an egg-based package
          1. What just happened?
          2. Understanding the setup.py file
          3. Have a go hero - building some eggs!
        15. Working with JSON
        16. Time for action - writing JSON data
          1. What just happened?
          2. Encoding data
          3. Decoding data
          4. Pop Quiz - JSON formatting
          5. Have a go hero - translating strings to integers
        17. Summary
      11. 5. Regular Expressions
        1. Simple string matching
        2. Time for action - testing an HTTP URL
          1. What just happened?
          2. Understanding the match function
          3. Learning basic syntax
            1. Detecting repetition
            2. Specifying character sets and classes
            3. Applying anchors to restrict matches
          4. Wrapping it up
          5. Have a go hero - tidying up our URL test
        3. Advanced pattern matching
          1. Grouping
        4. Time for action - regular expression grouping
          1. What just happened?
          2. Have a go hero - updating our stats processor to use named groups
          3. Using greedy versus non-greedy operators
          4. Assertions
            1. Performing an 'or' operation
          5. Pop Quiz - regular expressions
        5. Implementing Python-specific elements
          1. Other search functions
            1. search
            2. findall and finditer
            3. split
            4. sub
          2. Compiled expression objects
            1. Dealing with performance issues
          3. Parser flags
          4. Unicode regular expressions
          5. The match object
            1. Processing bind zone files
        6. Time for action - reading DNS records
          1. What just happened?
          2. Have a go hero - adding support for $ORIGIN
          3. Pop Quiz - understanding the Pythonisms
        7. Summary
      12. 6. Structured Markup
        1. XML data
        2. SAX processing
        3. Time for action - event-driven processing
          1. What just happened?
          2. Incremental processing
        4. Time for action - driving incremental processing
          1. What just happened?
          2. Building an application
        5. Time for action - creating a dungeon adventure game
          1. What just happened?
          2. Pop Quiz - SAX processing
          3. Have a go hero - adding gold
        6. The Document Object Model
          1. xml.dom.minidom
        7. Time for action - updating our game to use DOM processing
          1. What just happened?
          2. Have a go hero - cleaning up the dungeon a bit
          3. Creating and modifying documents programmatically
          4. Have a go hero - adding multiple dungeons
        8. XPath
          1. Accessing XML data using ElementTree
        9. Time for action - using XPath in our adventure
          1. What just happened?
        10. Reading HTML
        11. Time for action - displaying links in an HTML page
          1. What just happened?
          2. BeautifulSoup
          3. Have a go hero - updating link extractor to use BeautifulSoup
        12. Summary
      13. 7. Creating Templates
        1. Time for action - installing Mako
          1. What just happened?
        2. Basic Mako usage
        3. Time for action - loading a simple Mako template
          1. What just happened?
          2. Generating a template context
          3. Have a go hero - understanding context internals
          4. Managing execution with control structures
          5. Including Python code
        4. Time for action - reformatting the date with Python code
          1. What just happened?
          2. Adding functionality with tags
            1. Rendering files with %include
            2. Generating multiline comments with %doc
            3. Documenting Mako with %text
            4. Defining functions with %def
        5. Time for action - defining Mako def tags
          1. What just happened?
          2. Have a go hero - formatting whitespace
          3. Importing %def sections using %namespace
        6. Time for action - converting mail message to use namespaces
          1. What just happened?
              1. Selectively importing def blocks
          2. Filtering output
            1. Expression filters
            2. Filtering the output of %def blocks
            3. Setting default filters
        7. Inheriting from base templates
        8. Time for action - updating base template
          1. What just happened?
          2. Growing the inheritance chain
        9. Time for action - adding another inheritance layer
          1. What just happened?
          2. Inheriting attributes
          3. Pop Quiz - inheriting from templates
        10. Customizing
          1. Custom tags
        11. Time for action - creating custom Mako tags
          1. What just happened?
          2. Customizing filters
        12. Overviewing alternative approaches
        13. Summary
      14. 8. Understanding Encodings and i18n
        1. Understanding basic character encodings
          1. ASCII
            1. Limitations of ASCII
          2. KOI8-R
        2. Unicode
          1. Using Unicode with Python 3
          2. Understanding Unicode
            1. Design goals
              1. Universality
              2. Efficiency
              3. Characters, not glyphs
              4. Semantics
              5. Plain text
              6. Logical order
              7. Unification
              8. Dynamic composition
              9. Stability
              10. Convertibility
          3. Organizational structure
          4. Backwards compatibility
          5. Encoding
            1. UTF-32
            2. UTF-8
          6. Pop Quiz - character encodings
        3. Encodings in Python
        4. Time for action - manually decoding
          1. What just happened?
          2. Reading Unicode
          3. Writing Unicode strings
        5. Time for action - copying Unicode data
          1. What just happened?
        6. Time for action - fixing our copy application
          1. What just happened?
          2. Pop Quiz - Python encodings
          3. Have a go hero - other encodings
        7. The codecs module
        8. Time for action - changing encodings
          1. What just happened?
          2. Have a go hero - translating it back
        9. Adopting good practices
        10. Internationalization and Localization
          1. Preparing an application for translation
        11. Time for action - preparing for multiple languages
          1. What just happened?
        12. Time for action - providing translations
          1. What just happened?
          2. Looking for more information on internationalization
          3. Pop Quiz - internationalization
        13. Summary
      15. 9. Advanced Output Formats
        1. Dealing with PDF files using PLATYPUS
        2. Time for action - installing ReportLab
          1. What just happened?
          2. Generating PDF documents
        3. Time for action - writing PDF with basic layout and style
          1. What just happened?
          2. Have a go hero - drawing a logo
        4. Writing native Excel data
        5. Time for action - installing xlwt
          1. What just happened?
          2. Building XLS documents
        6. Time for action - generating XLS data
          1. What just happened?
          2. Pop Quiz - creating XLS documents
        7. Working with OpenDocument files
        8. Time for action - installing ODFPy
          1. What just happened?
          2. Building an ODT generator
        9. Time for action - generating ODT data
          1. What just happened?
          2. Have a go hero - understanding ODF XML files
        10. Summary
      16. 10. Advanced Parsing and Grammars
        1. Defining a language syntax
          1. Specifying grammar with Backus-Naur Form
          2. Grammar-driven parsing
        2. PyParsing
        3. Time for action - installing PyParsing
          1. What just happened?
        4. Time for action - implementing a calculator
          1. What just happened?
          2. Parse actions
        5. Time for action - handling type translations
          1. What just happened?
          2. Have a go hero - using events to lookup operators
          3. Suppressing parts of a match
        6. Time for action - suppressing portions of a match
          1. What just happened?
              1. Understanding BIND configuration format
              2. Implementing parser
              3. PyParsing objects
              4. And
              5. CharsNotIn
              6. Combine
              7. FollowedBy
              8. Keyword
              9. Literal
              10. MatchFirst
              11. NotAny
              12. OneOrMore, ZeroOrMore
              13. Regex
              14. StringStart, StringEnd
              15. White
                1. Debugging
          2. Have a go hero - extending our configuration file parser
        7. Processing data using the Natural Language Toolkit
        8. Time for action - installing NLTK
          1. What just happened?
          2. NLTK processing examples
            1. Removing stems
            2. Discovering collocations
        9. Summary
      17. 11. Searching and Indexing
        1. Understanding search complexity
        2. Time for action - implementing a linear search
          1. What just happened?
          2. Have a go hero - understanding why this is bad
        3. Text indexing
        4. Time for action - installing Nucular
          1. What just happened?
          2. An introduction to Nucular
        5. Time for action - full text indexing
          1. What just happened?
        6. Time for action - measuring index benefit
          1. What just happened?
          2. Scripts provided by Nucular
          3. Using XML files
          4. Advanced Nucular features
        7. Time for action - field-qualified indexes
          1. What just happened?
          2. Performing an enhanced search
        8. Time for action - performing advanced Nucular queries
          1. What just happened?
          2. Pop Quiz - introduction to Nucular
        9. Indexing and searching other data
        10. Time for action - indexing Open Office documents
          1. What just happened?
        11. Other index systems
          1. Apache Lucene
          2. ZODB and zc.catalog
          3. SQL text indexing
        12. Summary
      18. A. Looking for Additional Resources
        1. Python resources
          1. Unofficial documentation
          2. Python enhancement proposals
          3. Self-documenting
            1. Using other documentation tools
          4. Community resources
            1. Following groups and mailing lists
            2. Finding a users' group
            3. Attending a local Python conference
        2. Honorable mention
          1. Lucene and Solr
          2. Generating C-based parsers with GNU Bison
          3. Apache Tika
        3. Getting started with Python 3
          1. Major language changes
            1. Print is now a function
            2. Catching exceptions
            3. Using metaclasses
            4. New reserved words
            5. Major library changes
            6. Changes to list comprehensions
          2. Migrating to Python 3
        4. Time for action - using 2to3 to move to Python 3
          1. What just happened?
        5. Summary
      19. B. Pop Quiz - Answers
        1. Chapter 1: Getting Started
          1. ROT 13 Processing Answers
        2. Chapter 2: Working with the IO System
          1. File-like objects
        3. Chapter 3: Python String Services
          1. String literals
          2. String formatting
        4. Chapter 4: Text Processing Using the Standard Library
          1. CSV handling
          2. JSON formatting
        5. Chapter 5: Regular Expressions
          1. Regular expressions
          2. Understanding the Pythonisms
        6. Chapter 6: Structured Markup
          1. SAX processing
        7. Chapter 7: Creating Templates
          1. Template inheritance
        8. Chapter 8: Understanding Encoding and i18n
          1. Character encodings
          2. Python encodings
          3. Internationalization
        9. Chapter 9: Advanced Output Formats
          1. Creating XLS documents
        10. Chapter 11: Searching and Indexing
          1. Introduction to Nucular