Python has quickly solidified itself as one of the top languages for data scientists looking to prep, process, and analyze data for analytics and machine learning related use cases. Dask is a Python library for parallel computing with similar APIs to the most popular Python data science libraries such as Pandas, NumPy and scikit-learn. Dask’s parallel processing
Today, Dask is the most commonly used parallelism framework within the PyData and SciPy communities. Dask is designed to scale from parallelizing workloads on the CPUs in your laptop to thousands of nodes in a cloud cluster. In conjunction with the RAPIDS framework developed by NVIDIA, you can utilize the parallel processing power of both CPUs and NVIDIA GPUs.
Dask is built for the Python data science community
Dask is built on top of NumPy, Pandas, Scikit-Learn and other popular Python data science libraries. As such, the APIs are deliberately designed to help you seamlessly transition from these core libraries to the scalable Dask versions of each. The Dask documentation shows some excellent examples of how some of these libraries translate to Dask, which you can find here.
How Dask is used
Dask is being used by data science teams working on a wide range of problems, including high-performance computing, climate science, banking and imaging problems. Additionally, Dask is also well-suited for business intelligence problems. See here for a list of problems that teams have made progress using Dask.
Why use Dask on Dataproc
Dask provides a fast and easy way to run data transformation jobs on your big data. With Dask-Yarn, a Skein-based tool for running Dask applications on Yarn, the task scheduling is relegated to the YARN scheduler, freeing you from needing to manage another set of software on your cluster. Yarn takes care of allocating the resource management necessary to finish processing your jobs. Additionally, you get access to the full set of features offered by the Dataproc service, including Autoscaling, Jupyter component and component gateway for submitting jobs via a Jupyter Notebook.
Dask supports data loads from many different sources such as GCS and HDFS, and many different data types such as CSV, parquet and avro. These are supported by different projects such PyArrow, GCSFS, FastParquet, and FastAvro, all of which are included with Dataproc.
Additionally, you can also configure Dask on Dataproc to utilize Dask with its native scheduler, as opposed to Yarn.
Create a Dataproc cluster with Dask
You can create a Dataproc cluster by selecting a region with the Dask initialization action, Jupyter optional component and component gateway enabled with the following command.