You are previewing Working with Text.
O'Reilly logo
Working with Text

Book Description

What is text mining, and how can it be used? What relevance do these methods have to everyday work in information science and the digital humanities? How does one develop competences in text mining? Working with Text provides a series of cross-disciplinary perspectives on text mining and its applications. As text mining raises legal and ethical issues, the legal background of text mining and the responsibilities of the engineer are discussed in this book. Chapters provide an introduction to the use of the popular GATE text mining package with data drawn from social media, the use of text mining to support semantic search, the development of an authority system to support content tagging, and recent techniques in automatic language evaluation. Focused studies describe text mining on historical texts, automated indexing using constrained vocabularies, and the use of natural language processing to explore the climate science literature. Interviews are included that offer a glimpse into the real-life experience of working within commercial and academic text mining.



  • Introduces text analysis and text mining tools
  • Provides a comprehensive overview of costs and benefits
  • Introduces the topic, making it accessible to a general audience in a variety of fields, including examples from biology, chemistry, sociology, and criminology

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Contributors
  6. Preface
  7. Acknowledgements
  8. Chapter 1: Working with Text
    1. 1.1 Introduction: Portraits of the Past
    2. 1.2 The Reading Robot
    3. 1.3 From Data to Text Mining
    4. 1.4 Definitions of Text Mining
    5. 1.5 Exploring the Disciplinary Neighbourhood
    6. 1.6 Prerequisites for Text Mining
    7. 1.7 Learning Minecraft: What Makes a Text Miner?
    8. 1.8 Contemporary Attitudes to Text Mining
    9. 1.9 Conclusions
  9. Chapter 2: A Day at Work (with Text): A Brief Introduction
    1. Abstract
    2. 2.1 Introduction
    3. 2.2 Encouraging an Interest in Text Mining
    4. 2.3 Legal and Ethical Aspects of Text Mining
    5. 2.4 Manual Annotation: Preparing for Evaluation
    6. 2.5 Common Text Mining Tasks
    7. 2.6 Basic Corpus Analysis
    8. 2.7 Preprocessing a Text
    9. 2.8 Extracting Features from a Text
    10. 2.9 Information Extraction
    11. 2.10 Applications of Indexing and Metadata Extraction
    12. 2.11 Extraction of Subjective Views
    13. 2.12 Build, Customise or Apply? Choosing an Appropriate Implementation
    14. 2.13 Evaluation
    15. 2.14 The Role of Visualisation in Text Mining
    16. 2.15 Visualisation Tools and Frameworks
    17. 2.16 Conclusions
  10. Chapter 3: If You Find Yourself in a Hole, <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="italic">Stop Digging</span>: Legal and Ethical Issues of Text/Data Mining in Research: Legal and Ethical Issues of Text/Data Mining in Research
    1. Abstract
    2. 3.1 Introduction
    3. 3.2 Key Legal Issues in Data Mining
    4. 3.3 Ethics
    5. 3.4 Conclusions: Working on the Borders of Law and Ethics
  11. Chapter 4: Responsible Content Mining
    1. Abstract
    2. 4.1 Introduction to Content Mining
    3. 4.2 Obtaining Permission to Content Mine
    4. 4.3 Responsible Crawling
    5. 4.4 Publication of Results
    6. 4.5 Citation and Acknowledgement
    7. 4.6 Proposed Best Practise Guidelines for Content Mining
  12. Chapter 5: Text Mining for Semantic Search in Europe PubMed Central Labs
    1. Abstract
    2. 5.1 Introduction
    3. 5.2 Previous Work
    4. 5.3 Design and Implementation
    5. 5.4 Performance and Critique
    6. 5.5 Conclusions
    7. 5.6 Availability
    8. Appendix: Resources Used for Indexing
  13. Chapter 6: Extracting Information from Social Media with GATE
    1. Abstract
    2. Acknowledgements
    3. 6.1 Introduction
    4. 6.2 Social Media Streams: Characteristics, Challenges and Opportunities
    5. 6.3 The GATE Family of Text Mining Tools: An Overview
    6. 6.4 Information Extraction: An Overview
    7. 6.5 IE from Social Media with GATE
    8. 6.6 Conclusion and Future Work
  14. Chapter 7: Newton: Building an Authority-Driven Company Tagging and Resolution System
    1. Abstract
    2. Acknowledgements
    3. 7.1 Introduction
    4. 7.2 Related Work
    5. 7.3 System Overview
    6. 7.4 Learning Company Name Links
    7. 7.5 System Development
    8. 7.6 Conclusions
  15. Chapter 8: Automatic Language Identification
    1. Abstract
    2. Acknowledgements
    3. 8.1 Introduction
    4. 8.2 Historical Overview
    5. 8.3 Computational Techniques
    6. 8.4 Applications and Related Tasks
    7. 8.5 Conclusion
  16. Chapter 9: User-Driven Text Mining of Historical Text
    1. Abstract
    2. Acknowledgements
    3. 9.1 Related Work on Text Mining Historical Documents
    4. 9.2 The Trading Consequences System
    5. 9.3 Data Collections
    6. 9.4 Challenges of Processing Digitised Historical Text
    7. 9.5 Text Mining Component
    8. 9.6 User-Driven Text Mining
    9. 9.7 Conclusion
  17. Chapter 10: Automatic Text Indexing with SKOS Vocabularies in HIVE
    1. Abstract
    2. Acknowledgements
    3. 10.1 Introduction
    4. 10.2 Automatic Indexing with Machine Learning
    5. 10.3 Algorithms for Text Data Mining: KEA, KEA++ and MAUI
    6. 10.4 Algorithm Training and Workflow
    7. 10.5 The HIVE System
    8. 10.6 Text Mining for Documents Indexing Using SKOS Vocabularies in HIVE
    9. 10.7 Conclusions
  18. Chapter 11: The PIMMS Project and Natural Language Processing for Climate Science: Extending the ChemicalTagger Natural Language Processing Tool with Climate Science Controlled Vocabularies
    1. Abstract
    2. Acknowledgements
    3. 11.1 Introduction
    4. 11.2 Methodology
    5. 11.3 Results
    6. 11.4 Overall Conclusions and Suggestions for Further Work
  19. Chapter 12: Building Better Mousetraps: A Linguist in NLP
  20. Chapter 13: Raúl Garreta, Co-founder of Tryolabs.com, Tells Emma Tonkin About the Journey from Software Engineering Graduate to Startup Entrepreneur
  21. Appendix A: Resources for Text Mining
    1. A.1 Introduction
    2. A.2 Text Mining Software and Libraries
    3. A.3 Text Mining Frameworks and Packages
    4. A.4 Web Mining Packages
    5. A.5 Data Mining Packages
    6. A.6 A Selection of Components and Packages
    7. A.7 Web Interfaces for Text Mining
    8. A.8 Distribution and Scaling
  22. Appendix B: Databases and Vocabularies
    1. B.1 Sample Data Sets
    2. B.2 Datasets primarily used for text categorization
    3. Sources
    4. Uses
    5. B.3 Useful Tertiary Data Sets
    6. Sources
  23. Appendix C: Visualisation Tools and Resources
    1. C.1 D3 – Data Driven Documents
    2. C.2 Processing and Processing.js
    3. C.3 Map Display
    4. C.4 Command Line Visualisation Tools
    5. C.5 Graphical Tools
    6. C.6 Geographic Data Sets
  24. Appendix D: Learning Opportunities
    1. D.1 United Kingdom
    2. D.2 Ireland
    3. D.3 Sweden
    4. D.4 France
    5. D.5 United States
    6. D.6 Short Courses, Training Courses and MOOCs
  25. Index