Workload selection

Depending on the size (CPU, Memory, Cache) of your CDW environment, you must choose which workloads should be migrated from base Impala to CDW Impala.

Business Intelligence workloads

If you have teams of data analysts using Business Intelligence (BI) tools and applications, then you can migrate to CDW Data Service on Private Cloud to benefit from data and result caching. If you have repeated BI workloads, see Guidelines for moving repeated BI workloads.

Pioneering user base

New features are developed and released in CDW first. If your user base needs to use newer technologies such as Iceberg, Unified Analytics, Data and Result Caching, and so on, then you can consider moving your workloads to CDW Data Service on Private Cloud.

Performance-demanding workloads

If your teams are running complex and large queries that require high memory and run for a long time that cause other SLA-driven workloads to be impacted, then you can isolate them out from the Impala service on the base cluster and run them from dedicated Impala Virtual Warehouses to achieve compute isolation.

Kudu workloads

If your workloads depend on streaming datasets that are inserted, updated, or deleted frequently, then you must configure Kudu on the base cluster and you can use Impala to query Kudu on base. CDW supports creating Impala tables in Kudu. For more information, see Configuring Impala Virtual Warehouses to create Impala tables in Kudu in Cloudera Data Warehouse Private Cloud.

Anti-pattern workloads

If your workloads scan extremely large datasets, then you need to consider data locality before moving the workloads from the Impala service on the base cluster to Impala in CDW Data Service. Scanning large data sets over the network can take time. If you plan to move such workloads to CDW, then ensure that the data can be cached for better performance. You can also increase the auto-suspend time for the Impala Virtual Warehouse to make sure that the cache is retained.

Other things that you must consider while assessing the workloads are tenant modeling and sizing followed by performance testing. This ensures that queries return correct results and that performance is acceptable.