Vaex: One Library To Rule Them All.

Pandas vs Dask vs Vaex

Shritam Kumar Mund
5 min readJun 6, 2021
Photo by wallpapercave

Have you ever wonder how much data our planet can create every day?

Well, in 2020, that figure stood at 2.5 quintillion bytes per day.

There are 18 zeros in a quintillion. Just FYI.

Data growth stats provided some big numbers. And they’ll only get bigger. As a data enthusiast, our goal is to perform always some kind of calculation or process on it in the fastest way possible. In today’s blog, we are going to compare some of the best libraries out there to load and process larger datasets.

  1. Pandas
  2. Dask
  3. Vaex

Note:

1. The dataset we are going to use in this blog post has been taken  from NYC Taxi Record Data — Jan-2020. That can be downloaded from here.2. The size of the CSV file is around 600MB.3. The script I am running on the system has 16GB RAM.

Pandas

Photo from Kung Fu Panda by Looper

In Python, Pandas is the most popular library used for data engineering. Even every Pythonista in the field of Data Science uses it. As long as the data we work with is small enough (to fit in the RAM), Pandas is amazing. But often, in reality, we have to deal with much larger datasets in the size of several gigabytes or larger.

GIF by Tenor

In such cases, it isn’t a very good idea to go ahead with pandas. Because it isn’t really built for speed. Pandas is designed to work only on a single core. Even though most of our machines have multiple CPU cores, pandas cannot utilize the multi-cores available.

Let's do a quick experiment. Ahhh.., "quick"! Let's see.

Load the dataset using Pandas:

import pandas as pd
df = pd.read_csv(‘yellow_tripdata_2020–01.csv’)
df.head()

%%time is a magic command. It’s a part of IPython. %%time prints the wall time for the entire cell. I have used %%time to see how much time it’s taking to load the file, and as expected it took around 10 seconds to load a 600MB file.

Create a new feature using Pandas:

Let’s go one step forward and see, how much time will it take to create a new feature or column on the dataset. For this test case, I’ll multiply 2 of the columns to create a new one.

df[‘new_column’] = df.PULocationID * df.DOLocationID

Let's make a note of our observation from the experiment, to compare with Dask and Vaex. It took around 47.6 Milliseconds.

Dask

https://dask.org/

Dask is known for parallel computations on single machines by leveraging the multi-core CPUs and streaming data efficiently from disk. It’s one of the toughest competitors to Pandas. Parallel programming offers many benefits, but it’s also quite complex no matter whether you’re using threads, CPU cores, GPUs, or clusters.

“uncertainty of parallel processing” photo by ᴅᴀᴠɪᴅ ᴡʜɪᴛᴛᴀᴋᴇʀ on Twitter

Parallel processing doesn’t always work out as neatly as you expect. Sometimes it requires more hardware resources to complete heavy tasks.

Let’s check the performance using Dask. The Jupyter Kernel was restarted before running Dask commands.

Load the dataset using Dask

import dask.dataframe as dd%%time
dask_df = dd.read_csv(‘yellow_tripdata_2020–01.csv’)
dask_df.head()

As you can see with the help of parallel computing it can achieve the same task within 1.37 Seconds.

Let’s check how much time will it take to create a new feature.

Create a new feature with Dask

%%time
dask_df[‘new_column’] = dask_df.PULocationID * dask_df.DOLocationID

10.9 Milliseconds! Took half of the time compared to Pandas.

Vaex

https://vaex.io/

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames, to explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data. Vaex relies heavily on lazy evaluation and memory mapping.

  1. lazy evaluation: not doing any computation until it’s certain the results are needed.
  2. memory mapping: treating files on hard drives as if they were loaded into RAM

Vaex requires conversion of CSV to HDF5(The Hierarchical Data Format version 5) format, which doesn’t bother me as you can go for a short break, come back and the data will be converted. Also, you can stream the existing data directly from S3.

Create Hdf5 files

import vaexvaex_df = vaex.from_csv(‘yellow_tripdata_2020–01.csv’, convert=True, chunk_size=5_000_000)

Now, let’s repeat the operations above but with Vaex.

Load the dataset using Vaex

Wow! Did you see that? It took only 16.9 Millisecond.

Create a new feature using Vaex

Now let’s see the result for our further operation i.e creating a new feature,

%%time
vaex_dff[‘new_column’] = vaex_dff.PULocationID * vaex_dff.DOLocationID

Vaex needed only 751 Microseconds to create the new feature.

Results

The table below shows the execution times of the Pandas vs Dask vs Vaex experiment.

Result comparison

The winner of the experiment is clear. Vaex can able to process bigger datasets within a fraction of a few milliseconds while Pandas and Dask can’t. This experiment is specific as I am testing performance on a small dataset. You can see a drastic change in result when the file size gets bigger.

Thank you for reading!

Follow me on Medium for the latest updates. 😃

Also Read,

  1. How to Create an AI App to Generate Crontab Rules Using OpenAI and Streamlit.
  2. How I built a Medicine Search Engine using Elastic Search from scratch using Python?
  3. A-Z Exploratory Data Analysis

--

--