Principles of Data Movement
Good data movement design can be summarized in two principles:
| • | Move each piece of data over the network as few times as possible—preferably just once. |
The less that data is moved, the less time it takes to move it. But the many layers of abstraction offered by modern computer systems can hide data movement, making it harder to see the bottlenecks. Network file systems are a good example: there is no way to tell from reading the code whether a file is being read from the local disk or over a network, but the performance difference can be significant.
| • | Move data as early as possible—preferably before the computation starts. |
Doing so improves the performance of the computation because the stopwatch that times the computation is started after the data movement has already occurred. But this is more than a mere accounting trick. Consider a nightly report that must run after 5 PM to avoid conflicting with daytime Services. If the data for the report is available at 4 PM, it can be distributed to Engines in the hour before the report runs.