Time for action – getting web server data into Hadoop

Let's take a look at how we can simplistically copy data from a web server onto HDFS.

  1. Retrieve the text of the NameNode web interface to a local file:
    $ curl localhost:50070 > web.txt
    
  2. Check the file size:
    $ ls -ldh web.txt 
    

    You will receive the following response:

    -rw-r--r-- 1 hadoop hadoop 246 Aug 19 08:53 web.txt
    
  3. Copy the file to HDFS:
    $ hadoop fs -put web.txt web.txt
    
  4. Check the file on HDFS:
    $ hadoop fs -ls 
    

    You will receive the following response:

    Found 1 items
    -rw-r--r--   1 hadoop supergroup        246 2012-08-19 08:53 /user/hadoop/web.txt
    

What just happened?

There shouldn't be anything that is surprising here. We use the curl utility to retrieve a web page from the embedded web server hosting ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.