Book description
Text Processing in Python is an example-driven, hands-on tutorial that carefully teaches programmers how to accomplish numerous text processing tasks using the Python language. Filled with concrete examples, this book provides efficient and effective solutions to specific text processing problems and practical strategies for dealing with all types of text processing challenges.
Text Processing in Python begins with an introduction to text processing and contains a quick Python tutorial to get you up to speed. It then delves into essential text processing subject areas, including string operations, regular expressions, parsers and state machines, and Internet tools and techniques. Appendixes cover such important topics as data compression and Unicode. A comprehensive index and plentiful cross-referencing offer easy access to available information. In addition, exercises throughout the book provide readers with further opportunity to hone their skills either on their own or in the classroom. A companion Web site (http://gnosis.cx/TPiP) contains source code and examples from the book.
Here is some of what you will find in thie book:
When do I use formal parsers to process structured and semi-structured data? Page 257
How do I work with full text indexing? Page 199
What patterns in text can be expressed using regular expressions? Page 204
How do I find a URL or an email address in text? Page 228
How do I process a report with a concrete state machine? Page 274
How do I parse, create, and manipulate internet formats? Page 345
How do I handle lossless and lossy compression? Page 454
How do I find codepoints in Unicode? Page 465
0321112547B05022003
Table of contents
- Copyright
- Preface
- Acknowledgments
-
1. Python Basics
-
1.1. Techniques and Patterns
- 1.1.1. Utilizing Higher-Order Functions in Text Processing
- 1.1.2. Exercise: More on combinatorial functions
- 1.1.3. Specializing Python Datatypes
-
1.1.4. Base Classes for Datatypes
- METHODS
- BUILT-IN FUNCTIONS
-
METHODS AND ATTRIBUTES
- FILE.close()
- FILE.closed
- FILE.fileno()
- FILE.flush()
- FILE.isatty()
- FILE.mode
- FILE.name
- FILE.read ([size=sys.maxint])
- FILE.readline([size=sys.maxint])
- FILE.readlines([size=sys.maxint])
- FILE.seek(offset [,whence=0])
- FILE.tell()
- FILE.truncate([size=0])
- FILE.write(s)
- FILE.writelines(lines)
- FILE.xreadlines()
-
METHODS
- int.__and__(self, other)int.__rand__(self, other)
- int.__hex__(self)
- int.__invert__(self)
- int.__lshift__(self, other)int.__rlshift__(self, other)
- int.__oct__(self)
- int.__or__(self, other)int.__ror__(self, other)
- int.__rshift__(self, other)int.__rrshift__(self, other)
- int.__xor__(self, other)int.__rxor__(self, other)
- DIGRESSION
- CAPABILITIES
-
METHODS
- float.__abs__(self)
- float.__add__(self, other)float.__radd__(self, other)
- float.__cmp__(self, other)
- float.__div__(self, other)float.__rdiv__(self, other)
- float.__divmod__(self, other)float.__rdivmod__(self, other)
- float.__floordiv__(self, other)float.__rfloordiv__(self, other)
- float.__mod__(self, other)float.__rmod__(self, other)
- float.__mul__(self, other)float.__rmul__(self, other)
- float.__neg__(self)
- float.__pow__(self, other)float.__rpow__(self, other)
- float.__sub__(self, other)float.__rsub__(self, other)
- float.__truediv__(self, other)float.__rtruediv__(self, other)
- METHODS
-
METHODS
- dict.__cmp__(self, other)UserDict.UserDict.__cmp__(self, other)
- dict.__contains__(self, x)UserDict.UserDict.__contains__(self, x)
- dict.__delitem__(self, x)UserDict.UserDict.__delitem__(self, x)
- dict.__getitem__(self, x)UserDict.UserDict.__getitem__(self, x)
- dict.__len__(self)UserDict.UserDict.__len__(self)
- dict.__setitem__(self, key, val)UserDict.UserDict.__setitem__(self, key, val)
- dict.clear(self)UserDict.UserDict.clear(self)
- dict.copy(self)UserDict.UserDict.copy(self)
- dict.get(self, key [,default=None])UserDict.UserDict.get(self, key [,default=None])
- dict.has_key(self, key)UserDict.UserDict.has_key(self, key)
- dict.items(self)UserDict.UserDict.items(self)dict.iteritems(self)UserDict.UserDict.iteritems(self)
- dict.keys(self)UserDict.UserDict.keys(self)dict.iterkeys(self)UserDict.UserDict.iterkeys(self)
- dict.popitem(self)UserDict.UserDict.popitem(self)
- dict.setdefault(self, key [,default=None])UserDict.UserDict.setdefault(self, key [,default=None])
- dict.update(self, other)UserDict.UserDict.update(self, other)
- dict.values(self)UserDict.UserDict.values(self)dict.itervalues(self)UserDict.UserDict.itervalues(self)
-
METHODS
- list.__add__(self, other)UserList.UserList.__add__(self, other)tuple.__add__(self, other)list.__iadd__(self, other)UserList.UserList.__iadd__(self, other)
- list.__contains__(self, x)UserList.UserList.__contains__(self, x)tuple.__contains__(self, x)
- list.__delitem__(self, x)UserList.UserList.__delitem__(self, x)
- list.__delslice__(self, start, end)UserList.UserList.__delslice__(self, start, end)
- list.__getitem__(self, pos)UserList.UserList.__getitem__(self, pos)tuple.__getitem__(self, pos)
- list.__getslice__(self, start, end)UserList.UserList.__getslice__(self, start, end)tuple.__getslice__(self, start, end)
- list.__hash__(self)UserList.UserList.__hash__(self)tuple.__hash__(self)
- list.__len__(selfUserList.UserList.__len__(selftuple.__len__(self
- list.__mul__(self, num)UserList.UserList.__mul__(self, num)tuple.__mul__(self, num)list.__rmul__(self, num)UserList.UserList.__rmul__(self, num)tuple.__rmul__(self, num)list.__imul__(self, num)UserList.UserList.__imul__(self, num)
- list.__setitem__(self, pos, val)UserList.UserList.__setitem__(self, pos, val)
- list.__setslice__(self, start, end, other)UserList.UserList.__setslice__(self, start, end, other)
- list.append(self, item)UserList.UserList.append(self, item)
- list.count(self, item)UserList.UserList.count(self, item)
- list.extend(self, seq)UserList.UserList.extend (self, seq)
- list.index(self, item)UserList.UserList.index(self, item)
- list.insert(self, pos, item)UserList.UserList.insert(self, pos, item)
- list.pop(self [,pos=-1])UserList.UserList.pop(self [,pos=-1])
- list.remove(self, item)UserList.UserList.remove(self, item)
- list.reverse(self)UserList.UserList.reverse(self)
- list.sort(self [cmpfunc])UserList.UserList.sort(self [,cmpfunc])
- METHODS
- 1.1.5. Exercise: Filling out the forms (or deciding not to)
- 1.1.6. Problem: Working with lines from a large file
-
1.2. Standard Modules
-
1.2.1. Working with the Python Interpreter
- FUNCTIONS
- FUNCTIONS
- ATTRIBUTES
- FUNCTIONS
- BUILT-IN
-
CONSTANTS
- types.BuiltinFunctionTypetypes.BuiltinMethodType
- types.BufferType
- types.Class Type
- types.CodeType
- types.ComplexType
- types.DictTypetypes.DictionaryType
- types.EllipsisType
- types.FileType
- types.FloatType
- types.FrameType
- types.FunctionTypetypes.LambdaType
- types.GeneratorType
- types.InstanceType
- types.IntType
- types.ListType
- types.LongType
- types.MethodTypetypes.Unbound MethodType
- types.ModuleType
- types.NoneType
- types.StringType
- types.TracebackType
- types.TupleType
- types.UnicodeType
- types.SliceType
- types.StringTypes
- types.TypeType
- types.XRangeType
-
1.2.2. Working with the Local Filesystem
- FUNCTIONS
- FUNCTIONS
- CLASSES
-
METHODS AND ATTRIBUTES
- filecmp.dircmp.report()
- filecmp.dircmp.report_partial_closure()
- filecmp.dircmp.report_partial_closure()
- filecmp.dircmp.left_list
- filecmp.dircmp.right_list
- filecmp.dircmp.common
- filecmp.dircmp.left_only
- filecmp.dircmp.right_only
- filecmp.dircmp.common_dirs
- filecmp.dircmp.common_files
- filecmp.dircmp.common_funny
- filecmp.dircmp.same_files
- filecmp.dircmp.diff_files
- filecmp.dircmp.funny_files
- filecmp.dircmp.subdirs
- FUNCTIONS
- CLASSES
- FUNCTIONS
- FUNCTIONS
-
FUNCTIONS
- os.path.abspath(pathname)
- os.path.basename(pathname)
- os .path.commonprefix(pathlist)
- os.path.dirname(pathname)
- os.path.exists(pathname)
- os.path.expanduser(pathname)
- os.path.expandvars(pathname)
- os.path.getatime(pathname)
- os.path.getmtime(pathname)
- os.path.getsize(pathname)
- os.path.isabs(pathname)
- os.path.isdir(pathname)
- os.path.isfile(pathname)
- os.path.islink(pathname)
- os.path.ismount(pathname)
- os.path.join(path1 [,path2 [...]])
- os.path.normcase(pathname)
- os.path.normpath(pathname)
- os.path.realpath(pathname)
- os.path.samefile(pathname1, pathname2)
- os.path.sameopenfile(fp1, fp2)
- os.path.split(pathname)
- os.path.splitdrive(pathname)
- os.path.walk(pathname, visitfunc, arg)
- FUNCTIONS
- FUNCTIONS
- CONSTANTS
- FUNCTIONS
- FUNCTIONS
-
1.2.3. Running External Commands and Accessing OS Features
- FUNCTIONS
-
FUNCTIONS
- os.access(pathname, operation)
- os.chdir(pathname)
- os.chmod(pathname, mode)
- os.chown(pathname, uid, gid)
- os.chroot(pathname)
- os.getcwd()
- os.getenv(var [,value=None])
- os.getpid()
- os.kill(pid, sig)
- os.link(src, dst)
- os.listdir(pathname)
- os.lstat(pathname)
- os.mkdir(pathname [,mode=0777])
- os.mkdirs(pathname [,mode=0777])
- os.mkfifo(pathname [,mode=0666])
- os.nice(increment)
- os.popen(cmd [,mode=ârâ [,bufsize]])
- os.popen2(cmd [,mode [,bufsize]])
- os.popen3(cmd [,mode [,bufsize]])
- os.popen4(cmd [,mode [,bufsize]])
- os.putenv(var, value)
- os.readlink(linkname)
- os.remove(filename)
- os.removedirs(pathname)
- os.rename(src, dst)
- os.renames(src, dst)
- os.rmdir(pathname)
- os.startfile(path)
- os.stat(pathname)
- os.strerror(code)
- os.symlink(src, dst)
- os.system(cmd)
- os.tempnam([dir [,prefix]])
- os.tmpfile()
- os.uname()
- os.unlink(filename)
- os.utime(pathname, times)
- CONSTANTS AND ATTRIBUTES
-
1.2.4. Special Data Values and Formats
-
FUNCTIONS
- random.betavariate(alpha, beta)
- random.choice(seq)
- random.cunifvariate(mean, arc)
- random.expovariate(lambda_)
- random.gamma(alpha, beta)
- random.gauss(mu, sigma)
- random.lognormvariate(mu, sigma)
- random.normalvariate(mu, sigma)
- random.paretovariate(alpha)
- random.random()
- random.randrange([start=0,] stop [,step=1])
- random.seed([x=time.time()])
- random.shuffle(seq [,random=random.random])
- random.uniform(min, max)
- random.vonmisesvariate(mu, kappa)
- random.weibullvariate(alpha, beta)
- FUNCTIONS
- CONSTANTS AND ATTRIBUTES
-
FUNCTIONS
- time.asctime([tuple=time.localtime()])
- time.clock()
- time.ctime([seconds=time.time()])
- time.gmtime([seconds=time.time()])
- time.localtime([seconds=time.time()])
- time.mktime(tuple)
- time.sleep(seconds)
- time.strftime(format [,tuple=time.localtime()])
- time.strptime(s [,format=â%a %b %d %H:%M:%S %Yâ])
- time.time()
-
FUNCTIONS
-
1.2.1. Working with the Python Interpreter
-
1.3. Other Modules in the Standard Library
- __builtin__
- 1.3.1. Serializing and Storing Python Objects
-
1.3.2. Platform-Specific Operations
- _winreg
- AE
- aepack
- aetypes
- applesingle
- buildtools
- calendar
- Carbon.AE, Carbon.App, Carbon.CF, Carbon.Cm, Carbon.Ctl, Carbon.Dlg, Carbon.Evt, Carbon.Fm, Carbon.Help, Carbon.List, Carbon.Menu, Carbon.Mlte, Carbon.Qd, Carbon.Qdoffs, Carbon.Qt, Carbon.Res, Carbon.Scrap, Carbon.Snd, Carbon.TE, Carbon.Win
- cd
- cfmfile
- ColorPicker
- ctb
- dl
- EasyDialogs
- fcntl
- findertools
- fl, FL, flp
- fm, FM
- fpectl
- FrameWork, MiniAEFrame
- gettext
- grp
- locale
- mac, macerrors, macpath
- macfs, macfsn, macostools
- MacOS
- macresource
- macspeech
- mactty
- mkcwproject
- msvcrt
- Nac
- nis
- pipes
- PixMapWrapper
- posix, posixfile
- preferences
- pty
- pwd
- pythonprefs
- py_resource
- quietconsole
- resource
- syslog
- tty, termios, TERMIOS
- W
- waste
- winsound
- xdrlib
- 1.3.3. Working with Multimedia Formats
-
1.3.4. Miscellaneous Other Modules
- array
- atexit
- BaseHTTPServer, SimpleHTTPServer, SimpleXMLRPCServer, CGIHTTPServer
- Bastion
- bisect
- cmath
- cmd
- code
- codeop
- compileall
- compile, compile.ast, compile.visitor
- copy_reg
- curses, curses.ascii, curses.panel, curses.textpad, curses.wrapper
- dircache
- dis
- distutils
- doctest
- errno
- fpformat
- gc
- getpass
- imp
- inspect
- keyword
- math
- mutex
- new
- pdb
- popen2
- profile
- pstats
- pyclbr
- pydoc
- py_compile
- Queue
- readline, rlcompleter
- rexec
- sched
- signal
- site, user
- statcache
- statvfs
- thread, threading
- Tkinter, ScrolledText, Tix, turtle
- traceback
- unittest
- warnings
- weakref
- whrandom
-
1.1. Techniques and Patterns
-
2. Basic String Operations
-
2.1. Some Common Tasks
- 2.1.1. Problem: Quickly sorting lines on custom criteria
- 2.1.2. Problem: Reformatting paragraphs of text
- 2.1.3. Problem: Column statistics for delimited or flat-record files
- 2.1.4. Problem: Counting characters, words, lines, and paragraphs
- 2.1.5. Problem: Transmitting binary data as ASCII
- 2.1.6. Problem: Creating word or letter histograms
- 2.1.7. Problem: Reading a file backwards by record, line, or paragraph
-
2.2. Standard Modules
-
2.2.1. Basic String Transformations
- CONSTANTS
-
FUNCTIONS
- string.atof(s=...)
- string.atoi(s=...[,base=10])
- string.atol(s=...[,base=10])
- string.capitalize(s=...)ââ.capitalize()
- string.capwords(s=...)ââ.title()
- string.center(s=. . . , width=...)ââ.center(width)
- string.count(s, sub [,start [,end]])ââ.count(sub [,start [,end]])
- ââ.endswith(suffix [,start [,end]])
- string.expandtabs(s=...[,tabsize=8])ââ.expandtabs([,tabsize=8])
- string.find(s, sub [,start [,end]])ââ.find(sub [,start [,end]])
- string.index(s, sub [,start [,end]])ââ.index(sub [,start [,end]])
- ââ.isalpha()
- ââ.isalnum()
- ââ.isdigit()
- ââ.islower()
- ââ.isspace()
- ââ.istitle()
- ââ.isupper()
- string.join(words=...[,sep=â â])ââ.join (words)
- string.joinfields(...)
- string.ljust(s=..., width=...)ââ.Ijust(width)
- string.lower(s=...)ââ.lower()
- string.lstrip(s=...)ââ.lstrip([chars=string.whitespace])
- string.maketrans(from, to)
- string.replace(s=..., old=..., new=...[,maxsplit=...])ââ.replace(old, new [,maxsplit])
- string.rfind(s, sub [,start [,end]])ââ.rfind(sub [,start [,end]])
- string.rindex(s, sub [,start [,end]])ââ.rindex(sub [,start [,end]])
- string.rjust(s=..., width=...)ââ.rjust(width)
- string.rstrip(s=...)ââ.rstrip([chars=string.whitespace])
- string.split(s=...[,sep=...[,maxsplit=...]])ââ.split([,sep [,maxsplit]])
- string.splitfields(...)
- ââ.splitlines([keepends=0])
- ââ.startswith(prefix [,start [,end]])
- string.strip(s=...)ââ.strip([chars=string.whitespace])
- string.swapcase(s=...)ââ.swapcase()
- string.translate(s=..., table=...[,deletechars=ââ])ââ.translate(table [,deletechars=ââ])
- string.upper(s=...)ââ.upper()
- string.zfill(s=..., width=...)
-
2.2.2. Strings as Files, and Files as Strings
- CLASSES
-
METHODS
- mmap.mmap.close()
- mmap.mmap.find(sub [,pos])
- mmap.mmap.flush([offset, size])
- mmap.mmap.move(target, source, length)
- mmap.mmap.read(num)
- mmap.mmap.read_byte()
- mmap.mmap.readline()
- mmap.mmap.resize(newsize)
- mmap.mmap.seek(offset [,mode])
- mmap.mmap.size()
- mmap.mmap.tell()
- mmap.mmap.write(s)
- mmap.mmap.write_byte(c)
- CONSTANTS
- CLASSES
-
METHODS
- StringIO.StringIO.close()cStringIO.StringIO.close()
- StringIO.StringIO.flush()cStringIO.StringIO.flush()
- StringIO.StringIO.getvalue()cStringIO.StringIO.getvalue()
- StringIO.StringIO.isatty()cStringIO.StringIO.isatty()
- StringIO.StringIO.read ([num])cStringIO.StringIO.read ([num])
- StringIO.StringIO.readline([length=...])cStringIO.StringIO.readline([length])
- StringIO.StringIO.readlines([sizehint=...])cStringIO.StringIO.readlines([sizehint]
- cStringIO.StringIO.reset()
- StringIO.StringIO.seek(offset [,mode=0])cStringIO.StringIO.seek(offset [,mode])
- StringIO.StringIO.tell()cStringIO.StringIO.tell()
- StringIO.StringIO.truncate([len=0])cStringIO.StringIO.truncate ([len])
- StringIO.StringIO.write(s=...)cStringIO.StringIO.write(s)
- StringIO.StringIO.writelines(list=...)cStringIO.String IO.writelines(list)
-
2.2.3. Converting Between Binary and ASCII
- FUNCTIONS
-
FUNCTIONS
- binascii.a2b_base64(s)
- binascii.a2b_hex(s)
- binascii.a2b_hqx(s)
- binascii.a2b_qp(s [,header=0])
- binascii.a2b_uu(s)
- binascii.b2a_base64(s)
- binascii.b2a_hex(s)
- binascii.b2a_hqx(s)
- binascii.b2a_qp(s [,quotetabs=0 [,istext=1 [header=0]]])
- binascii.b2a_uu(s)
- binascii.crc32(s [,crc])
- binascii.crc_hqx(s, crc)
- binascii.hexlify(s)
- binascii.rlecode_hqx(s)
- binascii.rledecode_hqx(s)
- binascii.unhexlify(s)
- EXCEPTIONS
- FUNCTIONS
- CLASSES
- FUNCTIONS
- FUNCTIONS
- 2.2.4. Cryptography
-
2.2.5. Compression
- CLASSES
- METHODS AND ATTRIBUTES
- CONSTANTS
- FUNCTIONS
- CLASSES
-
METHODS AND ATTRIBUTES
- zipfile.ZipFile.close()
- zipfile.ZipFile.getinfo(name=...)
- zipfile.ZipFile.infolist()
- zipfile.ZipFile.namelist()
- zipfile.ZipFile.printdir()
- zipfile.ZipFile.read(name=...)
- zipfile.ZipFile.testzip()
- zipfile.ZipFile.write(filename=...[,arcname=...[,compress_type=...]])
- zipfile.ZipFile.writestr(zinfo=..., bytes=...)
- zipfile.ZipFile.NameTolnfo
- zipfile.ZipFile.compression
- zipfile.ZipFile.debug = 0
- zipfile.ZipFile.filelist
- zipfile.ZipFile.filename
- zipfile.ZipFile.fp
- zipfile.ZipFile.mode
- zipfile.ZipFile.start_dir
- zipfile.Ziplnfo.CRC
- zipfile.ZipInfo.comment
- zipfile.ZipInfo.compress_size
- zipfile.ZipInfo.compress_type
- zipfile.ZipInfo.create_system
- zipfile.ZipInfo.create_version
- zipfile.ZipInfo.date_time
- zipfile.ZipInfo.external_attr
- zipfile.ZipInfo.extract_version
- zipfile.ZipInfo.file_offset
- zipfile.ZipInfo.file size
- zipfile.ZipInfo.filename
- zipfile.ZipInfo.header_offset
- zipfile.ZipInfo.volume
- EXCEPTIONS
- CONSTANTS
- FUNCTIONS
- CLASS FACTORIES
- METHODS AND ATTRIBUTES
- EXCEPTIONS
-
2.2.6. Unicode
- ascii, us-ascii
- base64
- latin-1, iso-8859-1
- quopri
- rot13
- utf-7
- utf-8
- utf-16
- utf-16-le
- utf-16-be
- unicode-escape
- raw-unicode-escape
- strict
- ignore
- replace
- uââ.encode([enc [,errmode]])ââ.encode([enc [,errmode]])
- unicode(s [,enc [,errmode]])
- unichr(cp)
- codecs.open(filename=...[,mode='rb' [,encoding=...[,errors='strict' [,buffering=1]]]])
- codecs.EncodedFile(file=..., data_encoding=...[,file_encoding=...[,errors='strict']])
-
FUNCTIONS
- unicodedata.bidirectional(unichr)
- unicodedata.category (unichr)
- unicodedata.combining(unichr)
- unicodedata.decimal(unichr [,default])
- unicodedata.decomposition(unichr)
- unicodedata.digit(unichr [,default])
- unicodedata.lookup(name)
- unicodedata.mirrored(unichr)
- unicodedata.name(unichr)
- unicodedata.numeric(unichr [,default])
-
2.2.1. Basic String Transformations
- 2.3. Solving Problems
-
2.1. Some Common Tasks
-
3. Regular Expressions
- 3.1. A Regular Expression Tutorial
-
3.2. Some Common Tasks
- 3.2.1. Problem: Making a text block flush left
- 3.2.2. Problem: Summarizing command-line option documentation
- 3.2.3. Problem: Detecting duplicate words
- 3.2.4. Problem: Checking for server errors
- 3.2.5. Problem: Reading lines with continuation characters
- 3.2.6. Problem: Identifying URLs and email addresses in texts
- 3.2.7. Problem: Pretty-printing numbers
-
3.3. Standard Modules
- 3.3.1. Versions and Optimizations
- 3.3.2. Simple Pattern Matching
-
3.3.3. Regular Expression Modules
- FUNCTIONS
- PATTERN SUMMARY
-
ATOMIC OPERATORS
- Plain symbol
- Escape: â\â
- Grouping operators: â(â, â)â
- Backreference: â\dâ, â\ddâ
- Character classes: â[â, â]â
- Digit character class: â\dâ
- Non-digit character class: â\Dâ
- Alphanumeric character class: â\wâ
- Non-alphanumeric character class: â\Wâ
- Whitespace character class: â\sâ
- Non-whitespace character class: â\Sâ
- Wildcard character: â.â
- Beginning of line: â^â
- Beginning of string: â\Aâ
- End of line: â$â
- End of string: â\Zâ
- Word boundary: â\bâ
- Non-word boundary: â\Bâ
- Alternation operator: â |â
-
QUANTIFIERS
- Universal quantifier: â*â
- Non-greedy universal quantifier: â*?â
- Existential quantifier: â+â
- Non-greedy existential quantifier: â+?â
- Potentiality quantifier: â?â
- Non-greedy potentiality quantifier: â??â
- Exact numeric quantifier: â{num}â
- Lower-bound quantifier: â{min,}â
- Bounded numeric quantifier: â{min,max}â
- Non-greedy bounded quantifier: â{min,max}?â
-
GROUP-LIKE PATTERNS
- Pattern modifiers: â(?Limsux)â
- Comments: â(?#...)â
- Non-backreferenced atom: â(?:...)â
- Positive Lookahead assertion: â(?=...)â
- Negative Lookahead assertion: â(?!...)â
- Positive Lookbehind assertion: â(?< =...)â
- Negative Lookbehind assertion: â(?<!...)â
- Named group identifier: â(?P<name>)â
- Named group backreference: â(?P=name)â
- CONSTANTS
- FUNCTIONS
- CLASS FACTORIES
-
METHODS AND ATTRIBUTES
- re.compile.findall(s)
- re.compile.flags
- re.compile.groupindex
- re.compile.match(s [,start [,end]])
- re.compile.pattern
- re.compile.search(s [,start [,end]])
- re.compile.split(s [,maxsplit])
- re.compile.sub(repl, s [,count=0])
- re.compile.subn()
- re.match.end([group])re.search.end ([group])
- re.match.endpos, re.search.endpos
- re.match.expand(template)re.search.expand(template)
- re.match.group([group [,...]])re.search.group([group [,...]])
- re.match.groupdict([defval])re.search.groupdict([defval])
- re.match.groups([defval])re.search.groups([defval])
- re.match.lastgroup, re.search.lastgroup
- re.match.lastindex, re.search.lastindex
- re.match.pos, re.search.pos
- re.match.re, re.search.re
- re.match.span ([group])re.search.span([group])
- re.match.start ([group])re.search.start ([group])
- re.match.string, re.search.string
- EXCEPTIONS
-
4. Parsers and State Machines
- 4.1. An Introduction to Parsers
-
4.2. An Introduction to State Machines
- 4.2.1. Understanding State Machines
- 4.2.2. Text Processing State Machines
- 4.2.3. When Not to Use a State Machine
- 4.2.4. When to Use a State Machine
- 4.2.5. An Abstract State Machine Class
- 4.2.6. Processing a Report with a Concrete State Machine
- 4.2.7. Subgraphs and State Reuse
- 4.2.8. Exercise: Finding other solutions
-
4.3. Parser Libraries for Python
- 4.3.1. Specialized Parsers in the Standard Library
-
4.3.2. Low-Level State Machine Parsing
- BENCHMARKS
- DEBUGGING A TAG TABLE
-
CONSTANTS
- mx.TextTools.a2zmx.TextTools.a2z_set
- mx.TextTools.A2Zmx.TextTools.A2Z_set
- mx.TextTools.umlautemx.TextTools.umlaute_set
- mx.TextTools.Umlautemx.TextTools.Umlaute_set
- mx.TextTools.alphamx.TextTools.alpha_set
- mx.TextTools.german_alphamx.TextTools.german_alpha_set
- mx.TextTools.numbermx.TextTools.number_set
- mx.TextTools.alphanumericmx.TextTools.alphanumeric_set
- mx.TextTools.whitemx.TextTools.white_set
- mx.TextTools.newlinemx.TextTools.newline_set
- mx.TextTools.formfeedmx.TextTools.formfeed_set
- mx.TextTools.whitespacemx.TextTools.whitespace_set
- mx.TextTools.anymx.TextTools.any_set
- COMMANDS
- UNCONDITIONAL COMMANDS
- MATCHING PARTICULAR CHARACTERS
- MATCHING SEQUENCES
- COMPOUND MATCHES
- MODIFIERS
- CLASSES
-
METHODS AND ATTRIBUTES
- mx.TextTools.BMS.search(s [,start [,end]])mx.TextTools.FS.search(s [,start [,end]])mx.TextTools.TextSearch.search(s [,start [,end]])
- mx.TextTools.BMS.find(s, [,start [,end]])mx.TextTools.FS.find(s, [,start [,end]])mx.TextTools.TextSearch.search(s [,start [,end]])
- mx.TextTools.BMS.findall(s [,start [,end]])mx.TextTools.FS.findall(s [,start [,end]])mx.TextTools.TextSearch.search(s [,start [,end]])
- mx.TextTools.BMS.matchmx.TextTools.FS.matchmx.TextTools.TextSearch.match
- mx.TextTools.BMS.translatemx.TextTools.FS.translatemx.TextTools.TextSearch.match
- mx.TextTools.CharSet.contains(c)
- mx.TextTools.CharSet.search(s [,direction [,start=0 [,stop=len(s)]]])
- mx.TextTools.CharSet.match(s [,direction [,start=0 [,stop=len(s)]]])
- mx.TextTools.CharSet.split(s [,start=0 [,stop=len(text)]])
- mx.TextTools.CharSet.splitx(s [,start=0 [,stop=len(text)]])
- mx.TextTools.CharSet.strip(s [,where=0 [,start=0 [,stop=len(s)]]])
- FUNCTIONS
-
UTILITY FUNCTIONS
- mx.TextTools.charsplit(s, char, [start [,end]])
- mx.TextTools.collapse(s, sep=' ')
- mx.TextTools.countlines(s)
- mx.TextTools.find(s, search_obj, [start, [,end]])
- mx.TextTools.findall(s, search_obj [,start [,end]])
- mx.TextTools.hex2str(hexstr)
- mx.TextTools.is_whitespace(s [,start [,end]])
- mx.TextTools.isascii(s)
- mx.TextTools.join(joinlist [,sep=ââ [,start [,end]]])
- mx.TextTools.lower(s)
- mx.TextTools.prefix(s, prefixes [,start [,stop [,translate]]])
- mx.TextTools.multireplace(s ,replacements [,start [,stop]])
- mx.TextTools.replace(s, old, new [,start [,stop]])
- mx.TextTools.setfind(s, set [,start [,end]])
- mx.TextTools.setsplit(s, set [,start [,stop]])
- mx.TextTools.setsplitx(text,set[,start =0, stop =len(text)])
- mx.TextTools.splitat(s, char, [n=1 [,start [end]]])
- mx.TextTools.splitlines(s)
- mx.TextTools.splitwords(s)
- mx.TextTools.str2hex(s)
- mx.TextTools.suffix(s, suffixes [,start [,stop [,translate]]])
- mx.TextTools.upper(s)
- 4.3.3. High-Level EBNF Parsing
- 4.3.4. High-Level Programmatic Parsing
-
5. Internet Tools and Techniques
-
5.1. Working with Email and Newsgroups
-
5.1.1. Manipulating and Creating Message Texts
-
CLASSES
- email.MIMEBase.MIMEBase(maintype, subtype, **params)
- email.MIMENonMultipart.MIMENonMultipart(maintype, subtype, **params)
- email.MIMEMultipart.MIMEMultipart([subtype=âmixedâ [boundary, [,*subparts [,**params]]]])
- email.MIMEAudio.MIMEAudio(audiodata [,subtype [,encoder [,**params]]])
- email.MIMEImage.MIMEImage(imagedata [,subtype [,encoder [,**params]]])
- email.MIMEText.MIMEText(text [,subtype [,charset]])
- FUNCTIONS
- FUNCTIONS
- CLASSES
- METHODS
- CLASSES
- METHODS
- FUNCTIONS
- FUNCTIONS
- CLASSES
-
METHODS AND ATTRIBUTES
- email.Message.Message.add_header(field, value [,**params])
- email.Message.Message.as_string([unixfrom=0])
- email.Message.Message.attach(mess)
- email.Message.Message.del_param(param [,header=â Content-Typeâ [,requote=1]])
- email.Message.Message.epilogue
- email.Message.Message.get_all(field [,failobj=None])
- email.Message.Message.get_boundary([failobj=None])
- email.Message.Message.get_charsets([failobj=None])
- email.Message.Message.get_content_charset([failobj=None])
- email.Message.Message.get_content_maintype()
- email.Message.Message.get_content_subtype()
- email.Message.Message.get_content_type()
- email.Message.Message.get_default_type()
- email.Message.Message.get_filename([failobj=None])
- email.Message.Message.get_param(param [,failobj [,header=...[,unquote=1]]])
- email.Message.Message.get_params([,failobj=None [,header=...[,unquote=1]]])
- email.Message.Message.get_payload([i [,decode=0]])
- email.Message.Message.get_unixfrom()
- email.Message.Message.is_multipart()
- email.Message.Message.preamble
- email.Message.Message.replace_header(field, value)
- email.Message.Message.set_boundary(s)
- email.Message.Message.set_default_type(ctype)
- email.Message.Message.set_param(param, value [,header=â Content-Typeâ [,requote=1 [,charset [,language]]]])
- email.Message.Message.set_payload(payload [,charset=None])
- email.Message.Message.set_type(ctype [,header=âContent-Typeâ [,requote=1]])
- email.Message.Message.set_unixfrom(s)
- email.Message.Message.walk()
- CLASSES
- METHODS
-
FUNCTIONS
- email.Utils.decode_rfc2231(s)
- email.Utils.encode_rfc2231(s [,charset [,language]])
- email.Utils.formataddr(pair)
- email.Utils.formataddr([timeval [,localtime=0]])
- email.Utils.getaddresses(addresses)
- email.Utils.make_msgid([seed])
- email.Utils.mktime_tz(tuple)
- email.Utils.parseaddr(address)
- email.Utils.parsedate(datestr)
- email.Utils.parsedate_tz(datestr)
- email.Utils.quote(s)
- email.Utils.unquote(s)
-
CLASSES
-
5.1.2. Communicating with Mail Servers
- CLASSES
-
METHODS
- imaplib.IMAP4.close()
- imaplib.IMAP4.expunge()
- imaplib.IMAP4.fetch(message_set, message_parts)
- imaplib.IMAP4.list([dirname=ââ [,pattern=â*â])
- imaplib.IMAP4.login(user, passwd)
- imaplib.IMAP4.logout()
- imaplib.IMAP4.search(charset, criterion1 [,criterion2 [,...]])
- imaplib.lMAP4.select([mbox=âINBOXâ [,readonly=0])
- CLASSES
- METHODS
- CLASSES
- METHODS
-
5.1.3. Message Collections and Message Parts
-
CLASSES
- mailbox.UnixMailbox(file [,factory=rfc822.Message])
- mailbox.PortableUnixMailbox(file [,factory=rfc822.Message])
- mailbox.BabylMailbox(file [,factory=rfc822.Message])
- mailbox.MmdfMailbox(file [,factory=rfc822.Message])
- mailbox.MHMailbox(dirname [,factory=rfc822.Message])
- mailbox.Maildir(dirname [,factory=rfc822.Message])
- FUNCTIONS
- ATTRIBUTES
-
CLASSES
-
5.1.1. Manipulating and Creating Message Texts
-
5.2. World Wide Web Applications
- 5.2.1. Common Gateway Interface
-
5.2.2. Parsing, Creating, and Manipulating HTML Documents
- ATTRIBUTES
- CLASSES
-
METHODS AND ATTRIBUTES
- HTMLParser.HTMLParser.close()
- HTMLParser.HTMLParser.feed(data)
- HTMLParser.HTMLParser.getpos()
- HTMLParser.HTMLParser.handle_charref(name)
- HTMLParser.HTMLParser.handle_comment(data)
- HTMLParser.HTMLParser.handle_data(data)
- HTMLParser.HTMLParser.handle_decl(data)
- HTMLParser.HTMLParser.handle_endtag(tag)
- HTMLParser.HTMLParser.handle_entityref(name)
- HTMLParser.HTMLParser.handle_pi(data)
- HTMLParser.HTMLParser.handle_startendtag(tag, attrs)
- HTMLParser.HTMLParser.handle_starttag(tag, attrs)
- HTMLParser.HTMLParser.lasttag
- HTMLParser.HTMLParser.reset()
-
5.2.3. Accessing Internet Resources
- FUNCTIONS
- CLASSES
-
METHODS AND ATTRIBUTES
- urllib.URLFancyopener.get_user_passwd(host, realm)
- urllib.URLopener.open(url [,data])urllib.URLFancyopener.open(url [,data])
- urllib.URLopener.open_unknown (url [,data])urllib.URLFancyopener.open_unknown (url [,data])
- urllib.URLFancyopener.prompt_user_passwd(host, realm)
- urllib.URLopener.retrieve(url [,fname [,reporthook [,data]]])urllib.URLFancyopener.retrieve(url [,fname [,reporthook [,data]]])
- urllib.URLopener.versionurllib.URFancyLopener.version
- FUNCTIONS
- 5.3. Synopses of Other Internet Modules
- 5.4. Understanding XML
-
5.1. Working with Email and Newsgroups
-
A. A Selective and Impressionistic Short Review of Python
- A.1. What Kind of Language Is Python?
- A.2. Namespaces and Bindings
- A.3. Datatypes
-
A.4. Flow Control
- A.4.1. if/then/else Statements
- A.4.2. Boolean Shortcutting
- A.4.3. for/continue/break Statements
- A.4.4. map(), filter(), reduce(), and List Comprehensions
- A.4.5. while/else/continue/break Statements
- A.4.6. Functions, Simple Generators, and the yield Statement
- A.4.7. Raising and Catching Exceptions
- A.4.8. Data as Code
- A.5. Functional Programming
- B. A Data Compression Primer
- C. Understanding Unicode
- D. A State Machine for Adding Markup to Text
- E. Glossary
Product information
- Title: Text Processing in Python
- Author(s):
- Release date: June 2003
- Publisher(s): Addison-Wesley Professional
- ISBN: None
You might also like
book
Python Standard Library
Python Standard Library is an essential guide for serious Python programmers. Python is a modular language …
book
Text Processing with Ruby
Text is everywhere. Web pages, databases, the contents of files--for almost any programming task you perform, …
book
The Well-Grounded Python Developer
If you’re new to Python, it can be tough to understand when, where, and how to …
book
Pythonic Programming
Make your good Python code even better by following proven and effective pythonic programming tips. Avoid …