Python and HDF5

Book description

Gain hands-on experience with HDF5 for storing scientific data in Python. This practical guide quickly gets you up to speed on the details, best practices, and pitfalls of using HDF5 to archive and share numerical datasets ranging in size from gigabytes to terabytes.

Through real-world examples and practical exercises, you’ll explore topics such as scientific datasets, hierarchically organized groups, user-defined metadata, and interoperable files. Examples are applicable for users of both Python 2 and Python 3. If you’re familiar with the basics of Python data analysis, this is an ideal introduction to HDF5.

  • Get set up with HDF5 tools and create your first HDF5 file
  • Work with datasets by learning the HDF5 Dataset object
  • Understand advanced features like dataset chunking and compression
  • Learn how to work with HDF5’s hierarchical structure, using groups
  • Create self-describing files by adding metadata with HDF5 attributes
  • Take advantage of HDF5’s type system to create interoperable files
  • Express relationships among data with references, named types, and dimension scales
  • Discover how Python mechanisms for writing parallel code interact with HDF5

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Introduction
    1. Python and HDF5
      1. Organizing Data and Metadata
      2. Coping with Large Data Volumes
    2. What Exactly Is HDF5?
      1. HDF5: The File
      2. HDF5: The Library
      3. HDF5: The Ecosystem
  3. 2. Getting Started
    1. HDF5 Basics
    2. Setting Up
      1. Python 2 or Python 3?
      2. Code Examples
      3. NumPy
      4. HDF5 and h5py
      5. IPython
      6. Timing and Optimization
    3. The HDF5 Tools
      1. HDFView
      2. ViTables
      3. Command Line Tools
    4. Your First HDF5 File
      1. Use as a Context Manager
      2. File Drivers
        1. core driver
        2. family driver
        3. mpio driver
      3. The User Block
  4. 3. Working with Datasets
    1. Dataset Basics
      1. Type and Shape
      2. Reading and Writing
      3. Creating Empty Datasets
      4. Saving Space with Explicit Storage Types
      5. Automatic Type Conversion and Direct Reads
      6. Reading with astype
      7. Reshaping an Existing Array
      8. Fill Values
    2. Reading and Writing Data
      1. Using Slicing Effectively
      2. Start-Stop-Step Indexing
      3. Multidimensional and Scalar Slicing
      4. Boolean Indexing
      5. Coordinate Lists
      6. Automatic Broadcasting
      7. Reading Directly into an Existing Array
      8. A Note on Data Types
    3. Resizing Datasets
      1. Creating Resizable Datasets
      2. Data Shuffling with resize
      3. When and How to Use resize
  5. 4. How Chunking and Compression Can Help You
    1. Contiguous Storage
    2. Chunked Storage
    3. Setting the Chunk Shape
      1. Auto-Chunking
      2. Manually Picking a Shape
    4. Performance Example: Resizable Datasets
    5. Filters and Compression
      1. The Filter Pipeline
      2. Compression Filters
      3. GZIP/DEFLATE Compression
      4. SZIP Compression
      5. LZF Compression
      6. Performance
    6. Other Filters
      1. SHUFFLE Filter
      2. FLETCHER32 Filter
    7. Third-Party Filters
  6. 5. Groups, Links, and Iteration: The “H” in HDF5
    1. The Root Group and Subgroups
    2. Group Basics
      1. Dictionary-Style Access
      2. Special Properties
    3. Working with Links
      1. Hard Links
      2. Free Space and Repacking
      3. Soft Links
      4. External Links
      5. A Note on Object Names
      6. Using get to Determine Object Types
      7. Using require to Simplify Your Application
    4. Iteration and Containership
      1. How Groups Are Actually Stored
      2. Dictionary-Style Iteration
      3. Containership Testing
    5. Multilevel Iteration with the Visitor Pattern
      1. Visit by Name
      2. Multiple Links and visit
      3. Visiting Items
      4. Canceling Iteration: A Simple Search Mechanism
    6. Copying Objects
      1. Single-File Copying
    7. Object Comparison and Hashing
  7. 6. Storing Metadata with Attributes
    1. Attribute Basics
      1. Type Guessing
      2. Strings and File Compatibility
      3. Python Objects
      4. Explicit Typing
    2. Real-World Example: Accelerator Particle Database
      1. Application Format on Top of HDF5
      2. Analyzing the Data
  8. 7. More About Types
    1. The HDF5 Type System
    2. Integers and Floats
    3. Fixed-Length Strings
    4. Variable-Length Strings
      1. The vlen String Data Type
      2. Working with vlen String Datasets
      3. Byte Versus Unicode Strings
      4. Using Unicode Strings
      5. Don’t Store Binary Data in Strings!
      6. Future-Proofing Your Python 2 Application
    5. Compound Types
    6. Complex Numbers
    7. Enumerated Types
    8. Booleans
    9. The array Type
    10. Opaque Types
    11. Dates and Times
  9. 8. Organizing Data with References, Types, and Dimension Scales
    1. Object References
      1. Creating and Resolving References
      2. References as “Unbreakable” Links
      3. References as Data
    2. Region References
      1. Creating Region References and Reading
      2. Fancy Indexing
      3. Finding Datasets with Region References
    3. Named Types
      1. The Datatype Object
      2. Linking to Named Types
      3. Managing Named Types
    4. Dimension Scales
      1. Creating Dimension Scales
      2. Attaching Scales to a Dataset
  10. 9. Concurrency: Parallel HDF5, Threading, and Multiprocessing
    1. Python Parallel Basics
    2. Threading
    3. Multiprocessing
    4. MPI and Parallel HDF5
      1. A Very Quick Introduction to MPI
      2. MPI-Based HDF5 Program
      3. Collective Versus Independent Operations
      4. Atomicity Gotchas
  11. 10. Next Steps
    1. Asking for Help
    2. Contributing
  12. Index
  13. About the Author
  14. Colophon
  15. Copyright

Product information

  • Title: Python and HDF5
  • Author(s): Andrew Collette
  • Release date: November 2013
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449367831