Products Comparison

GCP Data Analytics Products Differences

Cloud Dataproc vs Cloud Dataflow

Here are three main points to consider while trying to choose between Dataproc and Dataflow

  • Provisioning Dataproc - Manual provisioning of clusters Dataflow - Serverless. Automatic provisioning of clusters

  • Hadoop Dependencies Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.

  • Portability Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.

This flowchart from the google website explains how to go about choosing one over the other.

More info on this link https://stackoverflow.com/questions/46436794/what-is-the-difference-between-google-cloud-dataflow-and-google-cloud-dataproc

Last updated