You are previewing Algorithms of the Intelligent Web.

Algorithms of the Intelligent Web

Cover of Algorithms of the Intelligent Web by Haralambos Marmanis... Published by Manning Publications
  1. Copyright
  2. Preface
  3. Acknowledgments
    1. H. Marmanis
    2. D. Babenko
  4. About this book
    1. Roadmap
    2. Who should read this book
    3. Code Conventions
    4. Author Online
    5. About the cover illustration
  5. 1. What is the intelligent web?
    1. 1.1. Examples of intelligent web applications
    2. 1.2. Basic elements of intelligent applications
    3. 1.3. What applications can benefit from intelligence?
      1. 1.3.1. Social networking sites
      2. 1.3.2. Mashups
      3. 1.3.3. Portals
      4. 1.3.4. Wikis
      5. 1.3.5. Media-sharing sites
      6. 1.3.6. Online gaming
    4. 1.4. How can I build intelligence in my own application?
      1. 1.4.1. Examine your functionality and your data
      2. 1.4.2. Get more data from the web
    5. 1.5. Machine learning, data mining, and all that
    6. 1.6. Eight fallacies of intelligent applications
      1. 1.6.1. Fallacy #1: Your data is reliable
      2. 1.6.2. Fallacy #2: Inference happens instantaneously
      3. 1.6.3. Fallacy #3: The size of data doesn't matter
      4. 1.6.4. Fallacy #4: Scalability of the solution isn't an issue
      5. 1.6.5. Fallacy #5: Apply the same good library everywhere
      6. 1.6.6. Fallacy #6: The computation time is known
      7. 1.6.7. Fallacy #7: Complicated models are better
      8. 1.6.8. Fallacy #8: There are models without bias
    7. 1.7. Summary
  6. 1.8. References
  7. 2. Searching
    1. 2.1. Searching with Lucene
      1. 2.1.1. Understanding the Lucene code
      2. 2.1.2. Understanding the basic stages of search
    2. 2.2. Why search beyond indexing?
    3. 2.3. Improving search results based on link analysis
      1. 2.3.1. An introduction to PageRank
      2. 2.3.2. Calculating the PageRank vector
      3. 2.3.3. alpha: The effect of teleportation between web pages
      4. 2.3.4. Understanding the power method
      5. 2.3.5. Combining the index scores and the PageRank scores
    4. 2.4. Improving search results based on user clicks
      1. 2.4.1. A first look at user clicks
      2. 2.4.2. Using the NaiveBayes classifier
      3. 2.4.3. Combining Lucene indexing, PageRank, and user clicks
    5. 2.5. Ranking Word, PDF, and other documents without links
      1. 2.5.1. An introduction to DocRank
      2. 2.5.2. The inner workings of DocRank
    6. 2.6. Large-scale implementation issues
    7. 2.7. Is what you got what you want? Precision and recall
    8. 2.8. Summary
    9. 2.9. To do
  8. 2.10. References
  9. 3. Creating suggestions and recommendations
    1. 3.1. An online music store: the basic concepts
      1. 3.1.1. The concepts of distance and similarity
      2. 3.1.2. A closer look at the calculation of similarity
      3. 3.1.3. Which is the best similarity formula?
    2. 3.2. How do recommendation engines work?
      1. 3.2.1. Recommendations based on similar users
      2. 3.2.2. Recommendations based on similar items
      3. 3.2.3. Recommendations based on content
    3. 3.3. Recommending friends, articles, and news stories
      1. 3.3.1. Introducing MyDiggSpace.com
      2. 3.3.2. Finding friends
      3. 3.3.3. The inner workings of DiggDelphi
    4. 3.4. Recommending movies on a site such as Netflix.com
      1. 3.4.1. An introduction of movie datasets and recommenders
      2. 3.4.2. Data normalization and correlation coefficients
    5. 3.5. Large-scale implementation and evaluation issues
    6. 3.6. Summary
    7. 3.7. To Do
  10. 3.8. References
  11. 4. Clustering: grouping things together
    1. 4.1. The need for clustering
      1. 4.1.1. User groups on a website: a case study
      2. 4.1.2. Finding groups with a SQL order by clause
      3. 4.1.3. Finding groups with array sorting
    2. 4.2. An overview of clustering algorithms
      1. 4.2.1. Clustering algorithms based on cluster structure
      2. 4.2.2. Clustering algorithms based on data type and structure
      3. 4.2.3. Clustering algorithms based on data size
    3. 4.3. Link-based algorithms
      1. 4.3.1. The dendrogram: a basic clustering data structure
      2. 4.3.2. A first look at link-based algorithms
      3. 4.3.3. The single-link algorithm
      4. 4.3.4. The average-link algorithm
      5. 4.3.5. The minimum-spanning-tree algorithm
    4. 4.4. The k-means algorithm
      1. 4.4.1. A first look at the k-means algorithm
      2. 4.4.2. The inner workings of k-means
    5. 4.5. Robust Clustering Using Links (ROCK)
      1. 4.5.1. Introducing ROCK
      2. 4.5.2. Why does ROCK rock?
    6. 4.6. DBSCAN
      1. 4.6.1. A first look at density-based algorithms
      2. 4.6.2. The inner workings of DBSCAN
    7. 4.7. Clustering issues in very large datasets
      1. 4.7.1. Computational complexity
      2. 4.7.2. High dimensionality
    8. 4.8. Summary
    9. 4.9. To Do
  12. 4.10. References
  13. 5. Classification: placing things where they belong
    1. 5.1. The need for classification
    2. 5.2. An overview of classifiers
      1. 5.2.1. Structural classification algorithms
      2. 5.2.2. Statistical classification algorithms
      3. 5.2.3. The lifecycle of a classifier
    3. 5.3. Automatic categorization of emails and spam filtering
      1. 5.3.1. NaïveBayes classification
      2. 5.3.2. Rule-based classification
    4. 5.4. Fraud detection with neural networks
      1. 5.4.1. A use case of fraud detection in transactional data
      2. 5.4.2. Neural networks overview
      3. 5.4.3. A neural network fraud detector at work
      4. 5.4.4. The anatomy of the fraud detector neural network
      5. 5.4.5. A base class for building general neural networks
    5. 5.5. Are your results credible?
    6. 5.6. Classification with very large datasets
    7. 5.7. Summary
    8. 5.8. To do
  14. 5.9. References
    1. Classification schemes
    2. Books and articles
  15. 6. Combining classifiers
    1. 6.1. Credit worthiness: a case study for combining classifiers
      1. 6.1.1. A brief description of the data
      2. 6.1.2. Generating artificial data for real problems
    2. 6.2. Credit evaluation with a single classifier
      1. 6.2.1. The naïve Bayes baseline
      2. 6.2.2. The decision tree baseline
      3. 6.2.3. The neural network baseline
    3. 6.3. Comparing multiple classifiers on the same data
      1. 6.3.1. McNemar's test
      2. 6.3.2. The difference of proportions test
      3. 6.3.3. Cochran's Q test and the F test
    4. 6.4. Bagging: bootstrap aggregating
      1. 6.4.1. The bagging classifier at work
      2. 6.4.2. A look under the hood of the bagging classifier
      3. 6.4.3. Classifier ensembles
    5. 6.5. Boosting: an iterative improvement approach
      1. 6.5.1. The boosting classifier at work
      2. 6.5.2. A look under the hood of the boosting classifier
    6. 6.6. Summary
    7. 6.7. To Do
  16. 6.8. References
  17. 7. Putting it all together: an intelligent news portal
    1. 7.1. An overview of the functionality
    2. 7.2. Getting and cleansing content
      1. 7.2.1. Get set. Get ready. Crawl the Web!
      2. 7.2.2. Review of the search prerequisites
      3. 7.2.3. A default set of retrieved and processed news stories
    3. 7.3. Searching for news stories
    4. 7.4. Assigning news categories
      1. 7.4.1. Order matters!
      2. 7.4.2. Classifying with the NewsProcessor class
      3. 7.4.3. Meet the classifier
      4. 7.4.4. Classification strategy: going beyond low-level assignments
    5. 7.5. Building news groups with the NewsProcessor class
      1. 7.5.1. Clustering general news stories
      2. 7.5.2. Clustering news stories within a news category
    6. 7.6. Dynamic content based on the user's ratings
    7. 7.7. Summary
    8. 7.8. To do
  18. 7.9. References
  19. A. Introduction to BeanShell
    1. A.1. What is BeanShell?
    2. A.2. Why use BeanShell?
    3. A.3. Running BeanShell
  20. A.4. References
  21. B. Web crawling
    1. B.1. An overview of crawler components
      1. B.1.1. The stages of crawling
      2. B.1.2. Our simple crawler
      3. B.1.3. Open source web crawlers
  22. B.2. References
  23. C. Mathematical refresher
    1. C.1. Vectors and matrices
    2. C.2. Measuring distances
    3. C.3. Advanced matrix methods
  24. C.4. References
  25. D. Natural language processing
  26. D.1. References
  27. E. Neural networks
  28. E.1. References
O'Reilly logo

Appendix B. Web crawling

This appendix provides an overview of web crawling components, a brief description of the implementation details for the crawler provided with the book, and a few open-source crawlers written in Java.

An overview of crawler components

Web crawlers are used to discover, download, and store content from the Web. As we've seen in chapter 2, a web crawler is just a part of a larger application such as a search engine.

A typical web crawler has the following components:

  • A repository module to keep track of all URLs known to the crawler.

  • A document download module that retrieves documents from the Web using provided set of URLs.

  • A document parsing module that's responsible for extracting the raw content out of a variety of document ...

The best content for your career. Discover unlimited learning on demand for around $1/day.