You are previewing IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content.
O'Reilly logo
IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content

Book Description

With IBM® Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

To help you achieve the most from your unstructured content, this IBM Redbooks® publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWare® Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

Table of Contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. The team who wrote this book
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Summary of changes
    1. May 2011, Second Edition
  5. Chapter 1. Overview of IBM Content Analytics
    1. 1.1 Business need and the Content Analytics solution
      1. 1.1.1 Business need and problem statement
      2. 1.1.2 The Content Analytics solution
    2. 1.2 History, changes, and what’s new
      1. 1.2.1 Product history
      2. 1.2.2 Product changes
      3. 1.2.3 What’s new in Content Analytics Version 2.2
    3. 1.3 Important concepts and terminology
      1. 1.3.1 Unstructured and structured content
      2. 1.3.2 Text analytics
      3. 1.3.3 Search, discovery, and data mining
      4. 1.3.4 Collections
      5. 1.3.5 Facets
      6. 1.3.6 Frequency
      7. 1.3.7 Correlation
      8. 1.3.8 Deviation
    4. 1.4 Content Analytics architecture
      1. 1.4.1 Main components
      2. 1.4.2 Data flow
      3. 1.4.3 Scalability
      4. 1.4.4 Security
  6. Chapter 2. Application design and preparation
    1. 2.1 Use-case scenarios
      1. 2.1.1 Call center
      2. 2.1.2 Insurance fraud
      3. 2.1.3 Quality assurance
      4. 2.1.4 Content assessment
    2. 2.2 Data considerations
      1. 2.2.1 Content Analytics data model
      2. 2.2.2 Structured and unstructured sources
      3. 2.2.3 Multiple data sources
      4. 2.2.4 Date-sensitive data
      5. 2.2.5 Extracting information from textual data
      6. 2.2.6 The number of collections to use
    3. 2.3 Design guide for building a text analytics collection
      1. 2.3.1 Building a text analytics collection
      2. 2.3.2 A walk through the building process
      3. 2.3.3 Planning for iteration
    4. 2.4 Programming interfaces
      1. 2.4.1 Search and Index API
      2. 2.4.2 REST API
  7. Chapter 3. Understanding content analysis
    1. 3.1 Basic concepts of Content Analytics
      1. 3.1.1 Manual versus automated analysis
      2. 3.1.2 Frequency versus deviation
      3. 3.1.3 Precision versus recall
    2. 3.2 Typical cycle of analysis with Content Analytics
      1. 3.2.1 Setting the objectives of the analysis
      2. 3.2.2 Gathering data
      3. 3.2.3 Analyzing data
      4. 3.2.4 Taking action based on the analysis
      5. 3.2.5 Validating the effect
    3. 3.3 Successful use cases
      1. 3.3.1 Voice of customer
      2. 3.3.2 Analysis of other data
    4. 3.4 Summary
  8. Chapter 4. Installing and configuring IBM Content Analytics
    1. 4.1 Installing Content Analytics
      1. 4.1.1 Process overview
      2. 4.1.2 Confirming the system requirements and supported data sources
      3. 4.1.3 Determining the installation server type and procedure consideration
      4. 4.1.4 Parameters used during the installation
      5. 4.1.5 Installing the agent server
      6. 4.1.6 Installing Content Analytics on a single server
      7. 4.1.7 Running the First Steps program to verify the installation
      8. 4.1.8 Starting the Text Analytics tutorial
    2. 4.2 Administering Content Analytics
      1. 4.2.1 Starting the system
      2. 4.2.2 Accessing the administration console
      3. 4.2.3 Stopping the server
    3. 4.3 Configuring a text analytics collection
      1. 4.3.1 Designing a sample collection
      2. 4.3.2 Creating a text analytics collection
      3. 4.3.3 Defining and configuring a crawler
      4. 4.3.4 Building an index in the text analytics collection
    4. 4.4 Verifying that the collection is available
      1. 4.4.1 Starting the search server for the text analytics collection
      2. 4.4.2 Accessing the collection with the text miner application
    5. 4.5 Deploying the configuration
      1. 4.5.1 Using the esadmin export and import commands
      2. 4.5.2 Using the esbackup and esrestore commands
      3. 4.5.3 Usage guidelines for these commands
  9. Chapter 5. Text miner application: Basic features
    1. 5.1 Overview of the text miner application
      1. 5.1.1 Accessing the text miner application
      2. 5.1.2 Application window layout and functional overview
      3. 5.1.3 Selecting a collection for analysis
      4. 5.1.4 Changing the default behavior by using preferences
    2. 5.2 Search and discovery features
      1. 5.2.1 Limiting the scope of your analysis using facets
      2. 5.2.2 Limiting the scope of your analysis using search operators
      3. 5.2.3 Limiting the scope of your analysis using dates
      4. 5.2.4 Query syntax
      5. 5.2.5 Type ahead
      6. 5.2.6 Saved searches
      7. 5.2.7 Advanced search
    3. 5.3 Query tree
      1. 5.3.1 Accessing the query tree
      2. 5.3.2 Understanding the query tree
      3. 5.3.3 Query tree examples
      4. 5.3.4 Editing the query tree
    4. 5.4 Query builder
      1. 5.4.1 Accessing the query builder
      2. 5.4.2 Features of the Query Builder window
      3. 5.4.3 Using the query builder
      4. 5.4.4 Preferred practice for using the query builder and the query tree
    5. 5.5 Rule-based categories with a query
      1. 5.5.1 Enabling the rule-based categories feature
      2. 5.5.2 Configuring rules for rule-based categories
      3. 5.5.3 Configuring rule-based categories
      4. 5.5.4 Adding the current query as a category rule
    6. 5.6 Common view features
    7. 5.7 Document flagging
      1. 5.7.1 Configuring document flags
      2. 5.7.2 Setting document flags
      3. 5.7.3 Viewing the document values of a flag facet
  10. Chapter 6. Text miner application: Views
    1. 6.1 Views
    2. 6.2 Documents view
      1. 6.2.1 Understanding the Documents view
      2. 6.2.2 Viewing the document contents and facets
      3. 6.2.3 When to use the Documents view
    3. 6.3 Facets view
      1. 6.3.1 Understanding the Facets view
      2. 6.3.2 Using the Facets view
    4. 6.4 Time Series view
      1. 6.4.1 Features in the Time Series view
      2. 6.4.2 Understanding the Time Series view
      3. 6.4.3 Using the Time Series view
    5. 6.5 Trends view
      1. 6.5.1 Features in the Trends view
      2. 6.5.2 Sort criteria
      3. 6.5.3 Understanding the Trends view
      4. 6.5.4 When to use the Trends view
    6. 6.6 Deviations view
      1. 6.6.1 Features in the Deviations view
      2. 6.6.2 Understanding the Deviations view
      3. 6.6.3 Using the Deviations view
    7. 6.7 Facet Pairs view
      1. 6.7.1 Table view
      2. 6.7.2 Grid view
      3. 6.7.3 Bird’s eye view
      4. 6.7.4 Understanding the Facet Pairs view with correlation values
      5. 6.7.5 Using the Facet Pairs view
    8. 6.8 Connections view
      1. 6.8.1 Features in the Connections view
      2. 6.8.2 Understanding the Connections view
      3. 6.8.3 When to use the Connections view
    9. 6.9 Dashboard view
      1. 6.9.1 Configuring the Dashboard
      2. 6.9.2 Viewing the Dashboard
      3. 6.9.3 Working with the Dashboard
      4. 6.9.4 Saving Dashboard charts as images
  11. Chapter 7. Performing content analysis
    1. 7.1 Discovering actionable insight with the text miner application
      1. 7.1.1 The sample data
      2. 7.1.2 Insights without customization
      3. 7.1.3 Considerations about what you want to discover from the data
    2. 7.2 Content analysis scenarios
      1. 7.2.1 Scenario 1: Using a custom dictionary to discoverpackage-related calls
      2. 7.2.2 Scenario 2: Using custom text analysis rules to discover trouble-related calls
      3. 7.2.3 Scenario 3: Discovering the cause of increasing calls
      4. 7.2.4 Conclusion
    3. 7.3 Configuring the Dictionary Lookup annotator
      1. 7.3.1 When to use the Dictionary Lookup annotator
      2. 7.3.2 Configuring custom user dictionaries
      3. 7.3.3 Migrating the Content Analyzer dictionaries
      4. 7.3.4 Validation and maintenance
    4. 7.4 Configuring the Pattern Matcher annotator
      1. 7.4.1 When to use the Pattern Matcher annotator
      2. 7.4.2 Configuring custom text analysis rules
      3. 7.4.3 Migrating the Content Analyzer rules
      4. 7.4.4 Designing the custom text analysis rules
      5. 7.4.5 Validation and maintenance
    5. 7.5 Preferred practices
  12. Chapter 8. Discovering insight with terms of interest and document clustering
    1. 8.1 The power of dictionary-driven analytics
      1. 8.1.1 Multiple viewpoints for analyzing the same data
    2. 8.2 Terms of interest
      1. 8.2.1 Basic algorithm for identifying terms of interest
      2. 8.2.2 Limitations in using automatic identification of terms of interest
      3. 8.2.3 Preferred use of terms of interest identified automatically
      4. 8.2.4 Efficient and effective creation of dictionary
    3. 8.3 Document clustering
      1. 8.3.1 Setting up document cluster
      2. 8.3.2 Creating a cluster proposal
      3. 8.3.3 Refining the cluster results
      4. 8.3.4 Deploying clusters to a category
      5. 8.3.5 Working with the cluster results
      6. 8.3.6 Creating and deploying the clustering resource
      7. 8.3.7 Preferred practices
  13. Chapter 9. Content analysis with IBM Classification Module
    1. 9.1 The Classification Module annotator
      1. 9.1.1 When to use the Classification Module annotator
      2. 9.1.2 The Classification Module technology
    2. 9.2 Fine-tuning your analysis with the Classification Module annotator
      1. 9.2.1 Building your collection
      2. 9.2.2 Refining the analysis
      3. 9.2.3 Using a conceptual search for advanced content discovery
    3. 9.3 Creating and deploying the Classification Module resource
      1. 9.3.1 Starting the Classification Module server
      2. 9.3.2 Creating and training the knowledge bases
      3. 9.3.3 Creating a decision plan
      4. 9.3.4 Deploying the knowledge base and decision plan
      5. 9.3.5 Configuring the Classification Module annotator
    4. 9.4 Validation and maintenance of the Classification Module annotator
      1. 9.4.1 Using the Classification Module sample programs
      2. 9.4.2 Classification Module annotator validation techniques
    5. 9.5 Preferred practices
  14. Chapter 10. Importing CSV files, exporting data, and performing deep inspection
    1. 10.1 Importing CSV files
    2. 10.2 Overview of exporting documents and data
      1. 10.2.1 Crawled documents
      2. 10.2.2 Analyzed documents
      3. 10.2.3 Search result documents
      4. 10.2.4 Exported data manifest
    3. 10.3 Location and format of the exported data
      1. 10.3.1 Location of the exported data
      2. 10.3.2 Metadata format
      3. 10.3.3 Binary content format
      4. 10.3.4 Common Analysis Structure format
      5. 10.3.5 Extracted text format
    4. 10.4 Common configuration of the export feature
      1. 10.4.1 Document URI pattern
      2. 10.4.2 Exporting XML attributes and preserving file extensions
      3. 10.4.3 Adding exported documents to the index
      4. 10.4.4 Exporting information about deleted documents
      5. 10.4.5 Scheduling
    5. 10.5 Monitoring export requests
    6. 10.6 Enabling export and sample configurations
      1. 10.6.1 Exporting crawled documents to a file system for Content Collector
      2. 10.6.2 Exporting analyzed documents to a relational database
      3. 10.6.3 Exporting search result documents to the file system for Classification Module
      4. 10.6.4 Exporting search result documents to CSV files
    7. 10.7 Deep inspection
      1. 10.7.1 Location and format of the exported data
      2. 10.7.2 Common configuration
      3. 10.7.3 Enabling deep inspection
      4. 10.7.4 Generating deep inspection reports
      5. 10.7.5 Optional: Scheduling a deep inspection run
      6. 10.7.6 Monitoring the deep inspection requests
      7. 10.7.7 Validating the deep inspection reports generation
    8. 10.8 Creating and deploying a custom plug-in
  15. Chapter 11. Configuring annotators
    1. 11.1 Document processing pipeline and the annotators
      1. 11.1.1 UIMA document processing pipeline
      2. 11.1.2 Language Identification annotator
      3. 11.1.3 Linguistic Analysis annotator
      4. 11.1.4 Named Entity Recognition annotator
      5. 11.1.5 Dictionary Lookup and Pattern Matcher annotators
      6. 11.1.6 Classification Module annotator
    2. 11.2 Custom annotators
      1. 11.2.1 LanguageWare Resource Workbench
      2. 11.2.2 Creating custom annotators using the Apache UIMA SDK
    3. 11.3 Validation
      1. 11.3.1 Real-time NLP
      2. 11.3.2 Advanced techniques
      3. 11.3.3 Summary of validation techniques
  16. Chapter 12. IBM Content Assessment scenario
    1. 12.1 Content Assessment offering
      1. 12.1.1 Concepts and terminology
    2. 12.2 Overview of Content Assessment
      1. 12.2.1 Content decommissioning scenario
      2. 12.2.2 Dynamically analyzing and collecting your content
    3. 12.3 Content Assessment workflow
      1. 12.3.1 Decommissioning content
      2. 12.3.2 Performing dynamic analysis
      3. 12.3.3 Preserving and using business data
    4. 12.4 Records management and email archiving
    5. 12.5 Preferred practices
    6. 12.6 Summary
  17. Chapter 13. Integrating Cognos Business Intelligence
    1. 13.1 Initial setup
      1. 13.1.1 Running the esrepcog command
      2. 13.1.2 Configuring default application user roles
      3. 13.1.3 Configuring database connectivity
    2. 13.2 Generating Cognos BI reports
    3. 13.3 Creating custom Cognos 8 BI reports
      1. 13.3.1 Configuring export options
      2. 13.3.2 Exporting search results
      3. 13.3.3 Loading the exported data model into Cognos
  18. Chapter 14. Customizing and extending the text miner application
    1. 14.1 Customizing the text miner application
      1. 14.1.1 Analytics Customizer
      2. 14.1.2 Modifying the URI link in the Documents view
    2. 14.2 Reasons for extending the text miner application
    3. 14.3 Sample plug-ins for text miner views
    4. 14.4 Customizing the sample text miner plug-in
      1. 14.4.1 Changing the view tab title
      2. 14.4.2 Customizing the plug-in template HTML file
      3. 14.4.3 Customizing the javascript widget
      4. 14.4.4 Updating the style sheet for the plug-in
    5. 14.5 Testing the customized plug-in
  19. Chapter 15. Performance tips
    1. 15.1 General performance guidelines
      1. 15.1.1 Factors that influence the performance of the system
      2. 15.1.2 Variables
    2. 15.2 Tuning the crawler component
      1. 15.2.1 Increasing active crawler threads
      2. 15.2.2 Setting the maximum heap size
    3. 15.3 Tuning the document processor
      1. 15.3.1 Setting the number of document processor threads
      2. 15.3.2 Increasing the maximum heap size
    4. 15.4 Tuning the indexer
      1. 15.4.1 Setting the number of indexer threads
      2. 15.4.2 Specifying the taxonomy cache type
      3. 15.4.3 Increasing the maximum heap size
      4. 15.4.4 Increasing the buffer size
      5. 15.4.5 Increasing the index commit interval
    5. 15.5 Enhancing the search performance
      1. 15.5.1 Increasing the search result cache entries
      2. 15.5.2 Enabling the optional facet index
      3. 15.5.3 Increasing the maximum heap size
      4. 15.5.4 Setting the number of threads for rebuilding optional facet index
    6. 15.6 Scalability
    7. 15.7 Monitoring the system
      1. 15.7.1 General guideline for monitoring the system
      2. 15.7.2 Using the esadmin command utility
      3. 15.7.3 General guidelines for monitoring the operating system
  20. Chapter 16. Hints and tips for troubleshooting
    1. 16.1 Overview of troubleshooting
    2. 16.2 General troubleshooting guidelines
      1. 16.2.1 Understanding the problem
      2. 16.2.2 Understanding the environment
    3. 16.3 Working with the logs in Content Analytics
      1. 16.3.1 The location of the logs
      2. 16.3.2 Understanding the log contents
      3. 16.3.3 The first log to examine
    4. 16.4 Installation and administration-related troubleshooting
    5. 16.5 Text miner application-related troubleshooting
    6. 16.6 Data processing flow-related troubleshooting
    7. 16.7 Export-related troubleshooting
    8. 16.8 Classification Module server-related troubleshooting tips
    9. 16.9 Reporting a problem to the IBM Software Support
    10. 16.10 Advanced troubleshooting topics
      1. 16.10.1 Changing the log level of the system and collection logs
      2. 16.10.2 Generating a javacore or a heapdump for a Java session
  21. Appendix A. Security in IBM Content Analytics
    1. The security concept in Content Analytics
    2. Enabling login security in the embedded application server (Jetty)
    3. Configuring application user roles
    4. Limiting user access to the text analytics collection
  22. Related publications
    1. IBM Redbooks
    2. Other publications
    3. Online resources
    4. Help from IBM
  23. Back cover