In Chapter 6, we discussed a number of Kettle features for Web-based extraction. In this chapter, we take a closer look at the Kettle features for working with websites and web services, and how to deal with the typical data formats they use to exchange data.
Before we dive into the details, we first need to explain how web services fit in the context of ETL and data integration, and discuss which concepts and techniques are involved. We also provide an overview of a few data formats that are commonly used to exchange data over the Web. In the remainder of this chapter, we illustrate a few typical scenarios for using Kettle with web services.
The majority of what we typically refer to as "The Web" consists of web pages. Web pages are essentially documents, primarily intended for a human audience, that can be retrieved using Hypertext Transfer Protocol (HTTP) and are typically coded in HTML. Via a web browser application, users are connected to a server that is part of a network (the Internet), allowing them to retrieve web pages from different computers in the network (sites). The web pages themselves are human-readable documents that are coded in some form of hypertext, which simply means it contains conveniently navigable links to access other, related web pages.
HTTP defines how a request from a client (such as a web browser) is transferred over the Internet to finally reach a host that is able to send a response containing a resource. ...