You are previewing Emerging Technologies of Text Mining: Techniques and Applications.
O'Reilly logo
Emerging Technologies of Text Mining: Techniques and Applications

Book Description

"Massive amounts of textual data make up most organizations' stored information. Therefore, there is increasingly high demand for a comprehensive resource providing practical hands-on knowledge for real-world applications.

Emerging Technologies of Text Mining: Techniques and Applications provides the most recent technical information related to the computational models of the text mining process, discussing techniques within the realms of classification, association analysis, information extraction, and clustering. Offering an innovative approach to the utilization of textual information mining to maximize competitive advantage, Emerging Technologies of Text Mining: Techniques and Applications will provide libraries with the defining reference on this topic."

Table of Contents

  1. Copyright
  2. Foreword
  3. Preface
  4. Acknowledgment
  5. I. Information Extraction: Methodologies and Applications
    1. ABSTRACT
    2. INTRODUCTION
    3. METHODOLOGIES
      1. Rule Learning-Based Extraction Methods
        1. Dictionary-Based Method
        2. Rule-Based Method
          1. (LP)2
          2. iASA
        3. Wrapper Induction
          1. WIEN
          2. BWI
      2. Classification-Based Extraction Methods
        1. Classification Model
        2. Boundary Detection Using Classification Model
        3. Enhancing IE by a Two-Level Boundary Classification Model
        4. Enhancing IE by Unbalance Classification Model
        5. Sequential Labeling-Based Extraction Methods
        6. Generative Model
          1. Limitations of Generative Models
        7. Discriminative Models
          1. Maximum Entropy Markov Models (MEMMs)
          2. Label Bias Problem
          3. Conditional Random Fields (CRFs)
      3. Sequential Labeling-Based Extraction Methods
        1. NonLinear Conditional Random Fields
        2. Condition Random Fields for Relational Learning
        3. 2D CRFs for Web Information Extraction
        4. Dynamic CRFs
        5. Tree-Structure CRFs for Information Extraction
    4. APPLICATIONS
      1. Information Extraction in Digital Libraries
      2. Information Extraction from E-Mails
      3. Person Profile Extraction
      4. Table Extraction Using Conditional Random Fields
      5. Shallow Parsing with Conditional Random Fields
    5. FUTURE RESEARCH DIRECTIONS
    6. CONCLUSION
    7. ACKNOWLEDGMENT
    8. REFERENCES
      1. Additional Reading
    9. ENDNOTE
  6. II. Creating Strategic Information for Organizations with Structured Text
    1. ABSTRACT
    2. INTRODUCTION
    3. INTELLIGENCE VS. ESPIONAGE
    4. A SYSTEMIC APPROACH
    5. COLLECTION AND STORAGE: DATA WAREHOUSE
    6. TREATMENT AND PREPARATION: CONDITION OF QUALITY
    7. DATA MINING
    8. SOME INTELLIGENT APPLICATIONS
      1. Relationship with Clients
      2. Credit and Insolvency
      3. Wal-Mart "Beer and Diapers"
      4. Procter & Gamble Sales Data
      5. Sky Survey Cataloging
    9. TEXT MINING
      1. Practical Examples
        1. The Corpus of Data to be Treated
        2. The Analyses Performed
    10. INFORMATION DIFFUSION TO THE DECISION MAKER
    11. FUTURE TRENDS
    12. CONCLUSION
    13. REFERENCES
      1. Additional Reading
    14. APPENDIX 1
  7. III. Automatic NLP for Competitive Intelligence
    1. ABSTRACT
    2. COMPETITIVE INTELLIGENCE
    3. SUPPORTING FUNCTIONALITIES TO CI
    4. TEXT MINING
    5. IMPORTANCE OF LINGUISTIC KNOWLEDGE
    6. NATURAL LANGUAGE PROCESSING
    7. NLP MODEL
      1. Automatic Acquisition
      2. The Lexicon
      3. About the Delimitation of a Lexical Unit
      4. Ontology
      5. Precision and Recall
      6. NLP Techniques
        1. Tokenization
        2. Normalization
        3. Multiword Expression
        4. Sentence Boundary
        5. Part-of-Speech Tagging
        6. Phrase Recognition
        7. Named Entity Recognition
        8. Named Entity Classification
        9. Parsing
        10. Coreference
          1. Acronyms, Initials and Abbreviations
          2. Truncated Names
          3. Pronominal Anaphora
          4. Synonyms
          5. Misspelling
      7. Word Sense Discrimination
      8. Automatic Detection of Synonyms
      9. CI Functionalities
        1. Filtering
        2. Event Alert
        3. Semantic Search
    8. CASE STUDY
    9. CONCLUSION
    10. FUTURE RESEARCH DIRECTIONS
    11. REFERENCES
      1. Additional Reading
    12. ENDNOTES
  8. IV. Mining Profiles and Definitions with Natural Language Processing
    1. ABSTRACT
    2. INTRODUCTION
    3. NATURAL LANGUAGE PROCESSING TOOLS
      1. Text Processing Tools
      2. Semantic Annotation
      3. Coreference Resolution
    4. SYNTACTIC AND SEMANTIC ANALYSIS
    5. SUMMARISATION TOOLKIT
    6. CASE STUDIES
      1. Definitional Question Answering
      2. Our Approach
      3. Linguistic Patterns
      4. Secondary Terms
      5. Identifying Definitions in Texts
      6. Profile-Based Summarisation
      7. Sentence Extraction System
      8. Preprocessing
      9. Content Selection
      10. Greedy Sentence Removal
      11. Evalution
    7. CONCLUSION
    8. FUTURE RESEARCH DIRECTIONS
    9. ACKNOWLEDGMENT
    10. REFERENCES
      1. Additional Reading
    11. ENDNOTE
  9. V. Deriving Taxonomy from Documents at Sentence Level
    1. ABSTRACT
    2. INTRODUCTION
    3. DOCUMENT MODELING WITH SALIENT SEMANTIC FEATURES
      1. Frequent word sequences
      2. Document Model
        1. An Illustration Example
    4. HIERARCHICAL AGGLOMERATIVE CLUSTERING
    5. EXPERIMENTS AND RESULTS
    6. DISCUSSION
    7. CONCLUSION
    8. FUTURE RESEARCH DIRECTIONS
    9. REFERENCES
      1. Additional Readings
  10. VI. Rule Discovery from Textual Data
    1. ABSTRACT
    2. INTRODUCTION
    3. FORMAT OF TEXTUAL DATA
    4. FUZZY DECISION TREE
      1. Format of Fuzzy Decision Tree
      2. Inductive learning Method
      3. Inference Method
    5. RULE DISCOVERY BASED ON A KEY CONCEPT DICTIONARY
      1. Format of a key concept dictionary
      2. Creation of a Key Concept Dictionary
      3. Acquisition of Rules
    6. RULE DISCOVERY BASED ON A KEY PHRASE PATTERN DICTIONARY
      1. Format of a key phrase pattern dictionary
      2. Introduction of Center Word Sets
      3. Acquisition of Rules
    7. APPLICATION TASKS
      1. An Analysis System for Daily Business Reports
      2. An E-Mail Analysis System
    8. CONCLUSION
    9. FUTURE RESEARCH DIRECTIONS
    10. REFERENCES
      1. Additional Reading
  11. VII. Exploring Unclassified Texts Using Multiview Semisupervised Learning
    1. ABSTRACT
    2. INTRODUCTION
    3. SEMISUPERVISED LEARNING
    4. SEMISUPERVISED MULTIVIEW ALGORITHMS
    5. CO-TRAINING
    6. CO-EM
    7. CO-TESTING and CO-EMT
      1. Other Algorithms
    8. APPLICATIONS OF CO-TRAINING TO TEXT CLASSIFICATION
      1. Web Pages Classification
      2. E-Mail Classification
      3. Multigram Approach
    9. EXPERIMENTAL RESULTS
      1. Dataset Preprocessing
      2. Experiment 1
      3. Experiment 2
    10. CONCLUSION
    11. FUTURE RESEARCH DIRECTIONS
    12. ACKNOWLEDGMENT
    13. REFERENCES
      1. Additional Reading
  12. VIII. A Multi–Agent Neural Network System for Web Text Mining
    1. ABSTRACT
    2. INTRODUCTION
    3. THE FRAMEWORK OF THE BPNN-BASED INTELLIGENT WEB TEXT MINING
      1. The Framework of BPNN-based Intelligent Web Text Mining
      2. The Main Processes of The BPNN-Based Web Text Mining System
        1. Web Document Search
        2. Web Text Processing
        3. Word Division Processing
        4. Text Feature Representation
        5. Typical Feature Selection
      3. Feature Vector Conversion
      4. BPNN-Based Learning Mechanism
      5. The Limitation of Single BPNN Agent-Based Web Text Mining
    4. MULTI-AGENT BASED WEB TEXT MINING SYSTEM
      1. The Structure of Multi-Agent Based Web Text Mining System
      2. The Implementation of Multi-Agent Web Text Mining System
    5. EXPERIMENT STUDY
      1. Data Description and Experiment Design
      2. Experimental Results
    6. CONCLUSION
    7. FUTURE RESEARCH DIRECTIONS
    8. ACKNOWLEDGMENT
    9. REFERENCES
      1. Additional Reading
  13. IX. Contextualized Clustering in Exploratory Web Search
    1. ABSTRACT
    2. INTRODUCTION
    3. SEARCH AND CLUSTERING
      1. Clustering Search Results
      2. Traditional Clustering Approaches
        1. Hierarchical Clustering
        2. K-Means Clustering
        3. Buckshot and Fractation Algorithm
      3. Suffix Tree Clustering
        1. Cleaning Documents
        2. Identifying Base Clusters
        3. Combining Base Clusters
      4. Document Snippets
      5. Related Work on Search Result Clustering
    4. ONLINE CLUSTERING IN HOBSearch
      1. Stemming Snippets
      2. Removing Stopwords in Snippets
      3. Labeling Clusters
      4. Ranking Clusters
    5. DISCUSSION
      1. Evaluation Method
      2. Overall Evaluation Results
      3. Clustering Performance
      4. Base Cluster Similarity
      5. Stemming and Tagging Snippets
      6. Label Overlap
      7. Additional Sources for Cluster Generation
    6. TEXT MINING IN WEB SEARCH
    7. CONCLUSION
    8. FUTURE RESEARCH DIRECTIONS
    9. REFERENCES
      1. Additional Reading
  14. X. AntWeb—Web Search Based on Ant Behavior: Approach and Implementation in Case of Interlegis
    1. ABSTRACT
    2. INTRODUCTION
    3. INTERLEGIS: INTEGRATION AND PARTICIPATION FOR BRAZILIAN LEGISLATIVE SOCIETY
    4. ANTWEB'S APPROACH
      1. Basic Theory of Antweb
      2. Goodness: The Mean Value of Access to Page
      3. Related Processes and Algorithms of Antweb
        1. Model to Search More Than One Target Page
        2. Identification of the Target Pages
        3. The Algorithm for Updating Pheromone
        4. The Adaptive Process to Present the Pages
    5. IMPLEMENTATION OF AntWeb IN INTERLEGIS PORTAL
      1. Architecture of Antweb
        1. Defining the Category Subset of Visitors
        2. Classifying a Visitor to a Category
        3. Upgrade the Pheromone for the Identified Visitor
        4. The System Returns the Page Adapted for the User and the Cycle is Initiated Again
      2. Database of Antweb
      3. The Pheromone Updating Module
      4. The Adaptation Page Module
      5. Off-Line Web Mining Module and Online Web Mining Module
    6. CASE STUDY OF ANTWEB IN INTERLEGIS
      1. Off-Line Web Mining
      2. Visiting Simulation Using Antweb
      3. Simulation with the Modification of the Parameters
        1. Simulation Without AntWeb
        2. Simulation With AntWeb
    7. CONCLUSION
    8. FUTURE RESEARCH DIRECTIONS
    9. ACKNOWLEDGMENT
    10. REFERENCES
    11. Additional Reading
  15. XI. Conceptual Clustering of Textual Documents and Some Insights for Knowledge Discovery
    1. ABSTRACT
    2. INTRODUCTION
    3. BACKGROUND
      1. Pattern Representation
      2. Pattern Proximity
      3. Clustering or Grouping
      4. Data Abstraction
      5. Assessment of Output
    4. CONCEPTUAL CLUSTERING OF TEXTUAL DOCUMENTS
      1. Pattern Representation
      2. Pattern Proximity
      3. Clustering Algorithm
      4. Case Study
    5. FUTURE TRENDS
    6. CONCLUSION
    7. FUTURE RESEARCH DIRECTIONS
    8. REFERENCES
    9. Additional Reading
    10. ENDNOTES
  16. XII. A Hierarchical Online Classifier for Patent Categorization
    1. ABSTRACT
    2. INTRODUCTION
    3. TYPICAL SCENARIOS FOR PATENT CLASSIFICATION TASKS
      1. Preclassification
      2. Patent Categorization in Small Offices
      3. Eventual Inventors
      4. Patent Information Providers
      5. Evaluation of PC Tasks
    4. STATE-OF-THE-ART OF PATENT CATEGORIZATION
    5. HIERARCHICAL ONLINE CLASSIFIER
      1. Notation
      2. Schema of Online Classifiers
      3. The Proposed Algorithm of HITEC
        1. Taxonomy Driven Architecture
        2. Calculation of Relevance Score and Weight Updated Schema
        3. Relaxed Greedy Algorithm for Category Activation
        4. Training with Primary and Secondary Categories
        5. Summary of the Algorithm
    6. EXPERIMENTS
      1. Corpora of Patent Applications
      2. Performance Measures
      3. Details of Implementation
        1. Feature Weights
        2. Dimensionality Reduction
      4. Results
        1. WIPO-Alpha Corpus
        2. Espace A/B Corpus
        3. Time Requirement and Availability
    7. CONCLUSION
    8. FUTURE RESEARCH DIRECTIONS
    9. ACKNOWLEDGMENT
    10. REFERENCES
    11. Additional Reading
    12. ENDNOTES
  17. XIII. Text Mining to Define a Validated Model of Hospital Rankings
    1. ABSTRACT
    2. INTRODUCTION
    3. BACKGROUND INTO PATIENT SEVERITY INDICES
      1. Problems with Terminology Definitions
      2. Modeling the Quality of Healthcare Providers
      3. Patient Condition Coding
      4. Compressing Large Numbers of Categorical Variables
      5. Standard Model to Define Patient Severity
      6. Examining Lack of Uniformity in Data Entry
    4. SOLUTIONS USING TEXT ANALYSIS
      1. Expectation Maximization Clustering of Patient Condition Codes
      2. Use of Concept Links
    5. PREDICTIVE MODELING WITH TEXT CLUSTERS
      1. Problem of Outliers in Health Care Provider Reimbursements
      2. Predictive Modeling Using Text Clusters for Outlier Reimbursements
    6. FUTURE RESEARCH DIRECTIONS
    7. REFERENCES
    8. Additional Reading
      1. Text and Data Mining
      2. Definition of Health Care Quality
  18. XIV. An Interpretation Process for Clustering Analysis Based on the Ontology of Language
    1. ABSTRACT
    2. INTRODUCTION
    3. ONTOLOGY OF LANGUAGE
    4. THE KNOWLEDGE BASED ON THE OBSERVER
    5. MENTAL MODELS AND GENERATION OF KNOWLEDGE
      1. Mental Models and Previous Knowledge
      2. Mental Models and Learning
      3. Origin of Mental Models
    6. FUNDAMENTAL LINGUISTIC ACTS
      1. Affirmations
      2. Statements
      3. Assessments
    7. PROCESS OF SUBSTANTIATING ASSESSMENTS
      1. Model of Validation of Assessments
    8. ACTION COORDINATION MODEL
      1. Commitments and Conversation
      2. Commitment Cycle: Its Types, Steps, and Phases
    9. CLUSTERING ANALYSIS MODEL
      1. Decision Making and Creation of Knowledge in Clustering Analysis
      2. Phases of the Clustering Analysis Process
    10. CYCLE OF ACTION COORDINATION IN CLUSTERING ANALYSIS
    11. RESULTS INTERPRETATION IN CLUSTERING ANALYSIS
      1. Model of Assessment Validation in CA
      2. Description of the Assessment Validation Phases
    12. CONCLUSION
    13. FUTURE RESEARCH DIRECTIONS
    14. REFERENCES
    15. Additional Reading
  19. Compilation of References
  20. About the Contributors