Preparing Data and Deploying Models

In many cases, data science teams team prepare data from a variety of different sources before modeling.

This data might originate from relational databases, flat files, or structured and unstructured data in Hadoop. A single workflow can connect to all of these sources for aggregation and cleansing into a final consolidated representation, which can then be moved using Copy operators to the desired analytics sandbox, such as a folder in HDFS.

Teams typically operationalize these flows using the job scheduler, periodically updating the analytics sandbox with a cleansed and aggregated version of the latest live data. The same jobs can contain subsequent modeling flows that update trained models as soon as the new data are available.