Preparing Data and Deploying Models

In many cases, data science teams team prepare data from a variety of different sources before modeling.

This data might originate from relational databases, flat files, or structured and unstructured data in Hadoop. A single workflow can connect to all of these sources for aggregation and cleansing into a final consolidated representation, which can then be moved using Copy operators to the desired analytics sandbox, such as a folder in HDFS.

Teams typically operationalize these flows using the job scheduler, periodically updating the analytics sandbox with a cleansed and aggregated version of the latest live data. The same jobs can contain subsequent modeling flows that update trained models as soon as the new data are available.

Related concepts

Moving a Workflow from Development to Production

Optimizing Models

Contents

Index

Search Results

Preparing Data and Deploying Models