O'Reilly logo
  • Somasundaram Sekar thinks this is interesting:

The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.

The client then calls read() on the stream (step 3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the...

From

Cover of Hadoop: The Definitive Guide, 4th Edition

Note

Question? If the Client other than data node accesses the file, will it not get the file that may be the replica of the one that was ingested, if a 1GB file was stored by splitting across 128MB blocks and when client accesses it will get the block address based on the proximity to datanode, in which is it not guaranteed to the order in which it might have been ingested