Preparing Data and Deploying Models
In many cases, data science teams team prepare data from a variety of different sources before modeling.
This data might originate from relational databases, flat files, or structured and unstructured data in Hadoop. A single workflow can connect to all of these sources for aggregation and cleansing into a final consolidated representation, which can then be moved using Copy operators to the desired analytics sandbox, such as a folder in HDFS.
Teams typically operationalize these flows using the job scheduler, periodically updating the analytics sandbox with a cleansed and aggregated version of the latest live data. The same jobs can contain subsequent modeling flows that update trained models as soon as the new data are available.