Example of a Sizing Calculation

Consider a scenario where the purchasing details of a customer are stored in the purchase table. There are around 5,000,000 rows in this table with the following schema:

Name of the field	Data Type	An estimation of the disk space consumed (in bytes)
customer_id (Primary Index)	Long	8
purchase_id	Long	8
customer_first_name	String	10
customer_last_name	String	10
customer_post_code	Long	8
payload	String	10K

Size of Rows without Secondary Indexes

Size of a Row (in bytes) = 8 + 8 + 10 + 10 + 8 + 10K = 10,044

Estimated Internal Overhead + Primary Index (Long) Overhead per Row = (32 + 27) = 59 bytes

Size of Row Including Overhead = 10,103

Size of All Rows with No Secondary Indexes = 5M x 10,103 = 50.5 GB

Size of Rows with Secondary Indexes

Index Overhead per Row = 45 bytes (might vary depending on the actual values being indexed)

purchase_id_idx = 5M x (45 + 8) = 0.27 GB

customer_full_name_idx = 5M x (45 + 10 + 10) = 0.33 GB

Size of Secondary Indexes = 0.6 GB Total Size In Bytes

(All Rows + Secondary Indexes) = 51.1 GB

Additional Factors That Affect Sizing

The purchase table with 5,000,000 rows would be estimated to occupy 51.1 GB on disk. You must provision additional disk space to account for compactions and other internal activity. Depending on the configuration options, you might need 10% to 100% additional space on disk. The additional disk space needed depends on whether you opt for better write performance by not using compaction and providing more disk space, reducing the amount of disk space required by using compaction. The different compaction levels can be used to reduce the actual disk space requirements of a typical 10K XML string when reducing performance. For better performance, apply the 100% Free Space Factor from the Excel spreadsheet, which results in an estimated 51.1 GB of additional free disk space needed.

Determining the Allocation of Physical Computers to the Nodes

When we map the disk space needed to node processes running on the hardware, a general guideline would be to start with two copysets. Each copyset would own half the data (approximately 26 GB) and each copyset would be configured with at least two nodes for redundancy (where each node would have to hold the full amount of data owned by that copyset). As a result, on each copyset, you would have two copies of the data for redundancy.

Here is how this would map to four node processes (where each node process would be run on a separate computer):

Copyset1

Node = 26 GB

Copyset2

Node = 26 GB

After adding a multiplier (Free Space Factor times node data size) = approximately 52 GB.

Apply the multiplier to ensure that there is enough free disk space. A general guideline is to double all of these numbers to have approximately 52 GB on each node.

However, there is no strict guideline for the amount of RAM needed based on the amount of data. If there is available RAM, it would be used for caching. In this case, given that nearly all the data must be able to fit in RAM on each node, you can opt for 32 GB or 64 GB of RAM on each of the nodes.

Similarly, you can opt for an SSD that can account for the data and free space, such as a 256 GB SSD on each computer.

Remember that every node must hold the amount of data held by the copyset. After determining the size of a node, the next step is to decide how many nodes you want in a copyset. The number of nodes depends on how many replicas of data you want to maintain and how many copysets you want to have in the data grid.

After you have an estimate about the number of nodes and replicas of data, you can use the following formula to determine the number of nodes in a data grid:

Total number of nodes in the data grid = Number of copysets *
number of replicas in a copyset

Note: ActiveSpaces 4.0 and later does not require each node to have as much RAM as the full amount of data.
However, you can provision them that way for optimal read performance depending on read access patterns by the application.

Did you find this helpful?

Yes No