Example of a Sizing Calculation

Consider a scenario where the purchasing details of a customer are stored in the purchase table.

There are around five million rows in this table with the following schema:
Name of the field Data Type An estimation of the disk space consumed (in bytes)
customer_id (Primary Index) Long 8
purchase_id Long 8
customer_first_name String 10
customer_last_name String 8
customer_post_code Long 8
payload String 10K

Size of Rows Without Secondary Indexes

Size of Row (in bytes) = 8 + 8 + 10 + 10 + 8 + 10K = 10,044

Estimated Internal Overhead + Primary Index (Long) Overhead per Row = (32 + 27) = 59 bytes

Size of Row Including Overhead = 10,103

Size of All Rows with No Secondary Indexes = 5M x 10,103 = 50.5GB

Size of Rows with Secondary Indexes

Index Overhead per Row = 45 bytes (might vary depending on actual values being indexed)

purchase_id_idx = 5M x (45 + 8) = 0.27GB

customer_full_name_idx = 5M x (45 + 10 + 10) = 0.33GB

Size of Secondary Indexes = 0.6GB Total Size In Bytes

(All Rows + Secondary Indexes) = 51.1GB

Additional Factors That Affect Sizing

The purchase table with five million rows would be estimated to occupy 51.1GB on disk. You must provision additional disk space to account for compactions and other internal activity. Depending on the configuration options, you might need 10% to 100% additional space on disk. The additional disk space needed depends on whether you opt for better write performance by not using compaction and providing more disk space reducing the amount of disk space required by using compaction. The different compaction levels can be used to reduce the actual disk space requirements of a typical 10K XML string when reducing performance. For better performance, apply the 100% Free Space Factor from the Excel spreadsheet which results in an estimated 51.1GB of additional free disk space needed.

Determining the Allocation of Physical Machines to the Nodes

When we map the disk space needed to node processes running on the hardware, a general guideline would be to start with two copysets. Each copyset would own half the data (approximately 26GB) and each copyset would be configured with at least two nodes for redundancy (where each node would need to hold the full amount of data owned by that copyset). As a result, on each copyset, you would have two copies of the data for redundancy.

Here is how this would map to four node processes (where each node process would be run on a separate machine):

Copyset1

Node = 26GB

Node = 26GB

Copyset2

Node = 26GB

Node = 26GB

After adding a multiplier (Free Space Factor times node data size) = approximately 52GB.

Apply the multiplier to ensure that there is enough free disk space. A general guideline is to double all of these numbers to have approximately 52GB on each node.

However, there is no strict guideline for the amount of RAM needed based on the amount of data. If there is available RAM, it would be used for caching. In this case, given that nearly all the data must be able to fit in RAM on each node, you could opt for 32GB or 64GB of RAM on each of the nodes.

Similarly, you could opt for an SSD that could account for the data and free space such as 256GB SSD on each machine.

Remember that every node must be capable of holding the amount of data held by the copyset. After determining the size of a node, the next step is to decide how many nodes you want in a copyset. The number of nodes depends on how many replicas of data you want to maintain and how many copysets you want to have in the data grid.

After you have an estimate about the number of nodes and replicas of data, you can use the following formula to determine the number of nodes in a data grid:

Total number of nodes in the data grid = Number of copysets *
number of replicas in a copyset
Note: The memory requirements with ActiveSpaces 4.0 do not require each node to have as much RAM as the full amount of data. However, you can provision them that way for optimal read performance depending on read access patterns by the application.