Example of a Sizing Calculation
Consider a scenario where the purchasing details of a customer are stored in the purchase table.
Size of Rows Without Secondary Indexes
Size of Row (in bytes) = 8 + 8 + 10 + 10 + 8 + 10K = 10,044
Estimated Internal Overhead + Primary Index (Long) Overhead per Row = (32 + 27) = 59 bytes
Size of Row Including Overhead = 10,103
Size of All Rows with No Secondary Indexes = 5M x 10,103 = 50.5GB
Size of Rows with Secondary Indexes
Index Overhead per Row = 45 bytes (might vary depending on actual values being indexed)
purchase_id_idx = 5M x (45 + 8) = 0.27GB
customer_full_name_idx = 5M x (45 + 10 + 10) = 0.33GB
Size of Secondary Indexes = 0.6GB Total Size In Bytes
(All Rows + Secondary Indexes) = 51.1GB
Additional Factors That Affect Sizing
The purchase table with five million rows would be estimated to occupy 51.1GB on disk. You must provision additional disk space to account for compactions and other internal activity. Depending on the configuration options, you might need 10% to 100% additional space on disk. The additional disk space needed depends on whether you opt for better write performance by not using compaction and providing more disk space reducing the amount of disk space required by using compaction. The different compaction levels can be used to reduce the actual disk space requirements of a typical 10K XML string when reducing performance. For better performance, apply the 100% Free Space Factor from the Excel spreadsheet which results in an estimated 51.1GB of additional free disk space needed.
Determining the Allocation of Physical Machines to the Nodes
When we map the disk space needed to node processes running on the hardware, a general guideline would be to start with two copysets. Each copyset would own half the data (approximately 26GB) and each copyset would be configured with at least two nodes for redundancy (where each node would need to hold the full amount of data owned by that copyset). As a result, on each copyset, you would have two copies of the data for redundancy.
Here is how this would map to four node processes (where each node process would be run on a separate machine):
Copyset1
Node = 26GB
Node = 26GB
Copyset2
Node = 26GB
Node = 26GB
After adding a multiplier (Free Space Factor times node data size) = approximately 52GB.
Apply the multiplier to ensure that there is enough free disk space. A general guideline is to double all of these numbers to have approximately 52GB on each node.
However, there is no strict guideline for the amount of RAM needed based on the amount of data. If there is available RAM, it would be used for caching. In this case, given that nearly all the data must be able to fit in RAM on each node, you could opt for 32GB or 64GB of RAM on each of the nodes.
Similarly, you could opt for an SSD that could account for the data and free space such as 256GB SSD on each machine.
Remember that every node must be capable of holding the amount of data held by the copyset. After determining the size of a node, the next step is to decide how many nodes you want in a copyset. The number of nodes depends on how many replicas of data you want to maintain and how many copysets you want to have in the data grid.
After you have an estimate about the number of nodes and replicas of data, you can use the following formula to determine the number of nodes in a data grid:
Total number of nodes in the data grid = Number of copysets * number of replicas in a copyset