This is the one resource that probably has the most room for variance. Several factors can help to determine the optimal disk size:
- Anticipated size of a single copy of the dataset
- Replication Factor (RF)
- Operational throughput requirements
- Cost of cloud volumes (usually per hour)
- Compaction strategy used on the larger tables
- Whether the size of the dataset will be static, or grow over time
- Whether the application team has an archival strategy
Let's walk through a little exercise here.
Assume that we need to build a cluster for an application ...