Example of a Sizing Calculation
Consider a scenario where the purchasing details of a customer are stored in the purchase table. There are around 5,000,000 rows in this table with the following schema:
| Name of the field | Data Type | An estimation of the disk space consumed (in bytes) |
|---|---|---|
| customer_id (Primary Index) | Long | 8 |
| purchase_id | Long | 8 |
| customer_first_name | String | 10 |
| customer_last_name | String | 10 |
| customer_post_code | Long | 8 |
| payload | String | 10K |
Size of Rows without Secondary Indexes
Size of a Row (in bytes) = 8 + 8 + 10 + 10 + 8 + 10K = 10,044
Estimated Internal Overhead + Primary Index (Long) Overhead per Row = (32 + 27) = 59 bytes
Size of Row Including Overhead = 10,103
Size of All Rows with No Secondary Indexes = 5M x 10,103 = 50.5 GB
Size of Rows with Secondary Indexes
Index Overhead per Row = 45 bytes (might vary depending on the actual values being indexed)
purchase_id_idx = 5M x (45 + 8) = 0.27 GB
customer_full_name_idx = 5M x (45 + 10 + 10) = 0.33 GB
Size of Secondary Indexes = 0.6 GB Total Size In Bytes
(All Rows + Secondary Indexes) = 51.1 GB
Additional Factors That Affect Sizing
The purchase table with 5,000,000 rows would be estimated to occupy 51.1 GB on disk. You must provision additional disk space to account for compactions and other internal activity. Depending on the configuration options, you might need 10% to 100% additional space on disk. The additional disk space needed depends on whether you opt for better write performance by not using compaction and providing more disk space, reducing the amount of disk space required by using compaction. The different compaction levels can be used to reduce the actual disk space requirements of a typical 10K XML string when reducing performance. For better performance, apply the 100% Free Space Factor from the Excel spreadsheet, which results in an estimated 51.1 GB of additional free disk space needed.
Determining the Allocation of Physical Computers to the Nodes
When we map the disk space needed to node processes running on the hardware, a general guideline would be to start with two copysets. Each copyset would own half the data (approximately 26 GB) and each copyset would be configured with at least two nodes for redundancy (where each node would have to hold the full amount of data owned by that copyset). As a result, on each copyset, you would have two copies of the data for redundancy.
Here is how this would map to four node processes (where each node process would be run on a separate computer):
Copyset1
Node = 26 GB
Node = 26 GB
Copyset2
Node = 26 GB
Node = 26 GB
After adding a multiplier (Free Space Factor times node data size) = approximately 52 GB.
Apply the multiplier to ensure that there is enough free disk space. A general guideline is to double all of these numbers to have approximately 52 GB on each node.
However, there is no strict guideline for the amount of RAM needed based on the amount of data. If there is available RAM, it would be used for caching. In this case, given that nearly all the data must be able to fit in RAM on each node, you can opt for 32 GB or 64 GB of RAM on each of the nodes.
Similarly, you can opt for an SSD that can account for the data and free space, such as a 256 GB SSD on each computer.
Remember that every node must hold the amount of data held by the copyset. After determining the size of a node, the next step is to decide how many nodes you want in a copyset. The number of nodes depends on how many replicas of data you want to maintain and how many copysets you want to have in the data grid.
After you have an estimate about the number of nodes and replicas of data, you can use the following formula to determine the number of nodes in a data grid:
Total number of nodes in the data grid = Number of copysets * number of replicas in a copyset
However, you can provision them that way for optimal read performance depending on read access patterns by the application.