Conclusion and Future Work

We described an approach to surfacing content from the Deep Web, thereby making that content accessible through search-engine queries. The most significant requirement from our system is that it be completely automatic (and hence scale to the Web), and retrieve content from any domain in any language. Interestingly, these stringent requirements pushed us toward a relatively simple and elegant solution, thereby showing that simplicity is often the key in solving hard problems.

There are many directions for future work on surfacing the Deep Web. In particular, there are certain patterns in forms that can be identified to broaden the coverage of our crawl. For example, pairs of fields are often related to each other (e.g., MinPrice and MaxPrice), and entering valid and carefully chosen pairs of values can result in surfacing more pages.

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.