Using the distributed copy (DistCp)

Distributed copy (DistCp) is a Hadoop utility used to copy data in parallel within and between clusters. It uses Hadoop's MapReduce to perform the copy operation. DistCp is the most widely used data transfer tool in Hadoop clusters. For example:

$ hadoop distcp hdfs://namenode1/src hdfs://namenode2/dest

The preceding command would copy the src folder and all its contents from the cluster managed by namenode1 to the cluster managed by namenode2 as the dest folder. DistCp, by default, does not overwrite the files at the target location and skips copying them if the files already exists. However, files can be forced to be overwritten using the overwrite flag.

There are several options that can be used along with ...

Get Cloudera Administration Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.