You are previewing Practical Text Mining with Perl.
O'Reilly logo
Practical Text Mining with Perl

Book Description

Provides readers with the methods, algorithms, and means to perform text mining tasks

This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information retrieval--and provides readers with the means to successfully complete text mining tasks on their own.

The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore:

  • Probability and texts, including the bag-of-words model

  • Information retrieval techniques such as the TF-IDF similarity measure

  • Concordance lines and corpus linguistics

  • Multivariate techniques such as correlation, principal components analysis, and clustering

  • Perl modules, German, and permutation tests

  • Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and worked-out examples further complements the book's student-friendly format.

    Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.

    Table of Contents

    1. COVER
    2. SERIES TITLE
    3. TITLE
    4. COPYRIGHT PAGE
    5. DEDICATION
    6. LIST OF FIGURES
    7. LIST OF TABLES
    8. PREFACE
    9. ACKNOWLEDGMENTS
    10. CHAPTER 1: INTRODUCTION
      1. 1.1 OVERVIEW OF THIS BOOK
      2. 1.2 TEXT MINING AND RELATED FIELDS
      3. 1.3 ADVICE FOR READING THIS BOOK
    11. CHAPTER 2: TEXT PATTERNS
      1. 2.1 INTRODUCTION
      2. 2.2 REGULAR EXPRESSIONS
      3. 2.3 FINDING WORDS IN A TEXT
      4. 2.4 DECOMPOSING POE’S “THE TELL-TALE HEART” INTO WORDS
      5. 2.5 A SIMPLE CONCORDANCE
      6. 2.6 FIRST ATTEMPT AT EXTRACTING SENTENCES
      7. 2.7 REGEX ODDS AND ENDS
      8. 2.8 REFERENCES
      9. PROBLEMS
    12. CHAPTER 3: QUANTITATIVE TEXT SUMMARIES
      1. 3.1 INTRODUCTION
      2. 3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL
      3. 3.3 ARRAYS AND CONTEXT IN PERL
      4. 3.4 WORD LENGTHS IN POE’S “THE TELL-TALE HEART”
      5. 3.5 ARRAYS AND FUNCTIONS
      6. 3.6 HASHES
      7. 3.7 TWO TEXT APPLICATIONS
      8. 3.8 COMPLEX DATA STRUCTURES
      9. 3.9 REFERENCES
      10. 3.10 FIRST TRANSITION
      11. PROBLEMS
    13. CHAPTER 4: PROBABILITY AND TEXT SAMPLING
      1. 4.1 INTRODUCTION
      2. 4.2 PROBABILITY
      3. 4.3 CONDITIONAL PROBABILITY
      4. 4.4 MEAN AND VARIANCE OF RANDOM VARIABLES
      5. 4.5 THE BAG-OF-WORDS MODEL FOR POE’S “THE BLACK CAT”
      6. 4.6 THE EFFECT OF SAMPLE SIZE
      7. 4.7 REFERENCES
      8. PROBLEMS
    14. CHAPTER 5: APPLYING INFORMATION RETRIEVAL TO TEXT MINING
      1. 5.1 INTRODUCTION
      2. 5.2 COUNTING LETTERS AND WORDS
      3. 5.3 TEXT COUNTS AND VECTORS
      4. 5.4 THE TERM-DOCUMENT MATRIX APPLIED TO POE
      5. 5.5 MATRIX MULTIPLICATION
      6. 5.6 FUNCTIONS OF COUNTS
      7. 5.7 DOCUMENT SIMILARITY
      8. 5.8 REFERENCES
      9. PROBLEMS
    15. CHAPTER 6: CONCORDANCE LINES AND CORPUS LINGUISTICS
      1. 6.1 INTRODUCTION
      2. 6.2 SAMPLING
      3. 6.3 CORPUS AS BASELINE
      4. 6.4 CONCORDANCING
      5. 6.5 COLLOCATIONS AND CONCORDANCE LINES
      6. 6.6 APPLICATIONS WITH REFERENCES
      7. 6.7 SECOND TRANSITION
      8. PROBLEMS
    16. CHAPTER 7: MULTI VARIATE TECHNIQUES WITH TEXT
      1. 7.1 INTRODUCTION
      2. 7.2 BASIC STATISTICS
      3. 7.3 BASIC LINEAR ALGEBRA
      4. 7.4 PRINCIPAL COMPONENTS ANALYSIS
      5. 7.5 TEXT APPLICATIONS
      6. 7.6 APPLICATIONS AND REFERENCES
      7. PROBLEMS
    17. CHAPTER 8: TEXT CLUSTERING
      1. 8.1 INTRODUCTION
      2. 8.2 CLUSTERING
      3. 8.3 A NOTE ON CLASSIFICATION
      4. 8.4 REFERENCES
      5. 8.5 LAST TRANSITION
      6. PROBLEMS
    18. CHAPTER 9: A SAMPLE OF ADDITIONAL TOPICS
      1. 9.1 INTRODUCTION
      2. 9.2 PERL MODULES
      3. 9.3 OTHER LANGUAGES: ANALYZING GOETHE IN GERMAN
      4. 9.4 PERMUTATION TESTS
      5. 9.5 REFERENCES
    19. APPENDIX A: OVERVIEW OF PERL FOR TEXT MINING
      1. A.1 BASIC DATA STRUCTURES
      2. A.2 OPERATORS
      3. A.3 BRANCHING AND LOOPING
      4. A.4 A FEW PERL FUNCTIONS
      5. A.5 INTRODUCTION TO REGULAR EXPRESSIONS
    20. APPENDIX B: SUMMARY OF R USED IN THIS BOOK
      1. B.1 BASICS OF R
      2. B.2 THIS BOOK’S R CODE
    21. REFERENCES
    22. INDEX