Vaex: One Library To Rule Them All.
Pandas vs Dask vs Vaex
Have you ever wonder how much data our planet can create every day?
Well, in 2020, that figure stood at 2.5 quintillion bytes per day.
There are 18 zeros in a quintillion. Just FYI.
Data growth stats provided some big numbers. And they’ll only get bigger. As a data enthusiast, our goal is to perform always some kind of calculation or process on it in the fastest way possible. In today’s blog, we are going to compare some of the best libraries out there to load and process larger datasets.
- Pandas
- Dask
- Vaex
Note:
1. The dataset we are going to use in this blog post has been taken from NYC Taxi Record Data — Jan-2020. That can be downloaded from here.2. The size of the CSV file is around 600MB.3. The script I am running on the system has 16GB RAM.
Pandas
In Python, Pandas is the most popular library used for data engineering. Even every Pythonista in the field of Data Science uses it. As long as the data we work with is small enough (to fit in the RAM), Pandas is amazing. But often, in reality, we have to deal with much larger datasets in the size of several gigabytes or larger.
In such cases, it isn’t a very good idea to go ahead with pandas. Because it isn’t really built for speed. Pandas is designed to work only on a single core. Even though most of our machines have multiple CPU cores, pandas cannot utilize the multi-cores available.
Let's do a quick experiment. Ahhh.., "quick"! Let's see.
Load the dataset using Pandas:
import pandas as pd
df = pd.read_csv(‘yellow_tripdata_2020–01.csv’)
df.head()
%%time
is a magic command. It’s a part of IPython. %%time
prints the wall time for the entire cell. I have used %%time
to see how much time it’s taking to load the file, and as expected it took around 10 seconds to load a 600MB file.
Create a new feature using Pandas:
Let’s go one step forward and see, how much time will it take to create a new feature or column on the dataset. For this test case, I’ll multiply 2 of the columns to create a new one.
df[‘new_column’] = df.PULocationID * df.DOLocationID
Let's make a note of our observation from the experiment, to compare with Dask and Vaex. It took around 47.6 Milliseconds.
Dask
Dask is known for parallel computations on single machines by leveraging the multi-core CPUs and streaming data efficiently from disk. It’s one of the toughest competitors to Pandas. Parallel programming offers many benefits, but it’s also quite complex no matter whether you’re using threads, CPU cores, GPUs, or clusters.
Parallel processing doesn’t always work out as neatly as you expect. Sometimes it requires more hardware resources to complete heavy tasks.
Let’s check the performance using Dask. The Jupyter Kernel was restarted before running Dask commands.
Load the dataset using Dask
import dask.dataframe as dd%%time
dask_df = dd.read_csv(‘yellow_tripdata_2020–01.csv’)
dask_df.head()
As you can see with the help of parallel computing it can achieve the same task within 1.37 Seconds.
Let’s check how much time will it take to create a new feature.
Create a new feature with Dask
%%time
dask_df[‘new_column’] = dask_df.PULocationID * dask_df.DOLocationID
10.9 Milliseconds! Took half of the time compared to Pandas.
Vaex
Vaex is a high-performance Python library for lazy Out-of-Core DataFrames, to explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data. Vaex relies heavily on lazy evaluation and memory mapping.
- lazy evaluation: not doing any computation until it’s certain the results are needed.
- memory mapping: treating files on hard drives as if they were loaded into RAM
Vaex requires conversion of CSV to HDF5(The Hierarchical Data Format version 5) format, which doesn’t bother me as you can go for a short break, come back and the data will be converted. Also, you can stream the existing data directly from S3.
Create Hdf5 files
import vaexvaex_df = vaex.from_csv(‘yellow_tripdata_2020–01.csv’, convert=True, chunk_size=5_000_000)
Now, let’s repeat the operations above but with Vaex.
Load the dataset using Vaex
Wow! Did you see that? It took only 16.9 Millisecond.
Create a new feature using Vaex
Now let’s see the result for our further operation i.e creating a new feature,
%%time
vaex_dff[‘new_column’] = vaex_dff.PULocationID * vaex_dff.DOLocationID
Vaex needed only 751 Microseconds to create the new feature.
Results
The table below shows the execution times of the Pandas vs Dask vs Vaex experiment.
The winner of the experiment is clear. Vaex can able to process bigger datasets within a fraction of a few milliseconds while Pandas and Dask can’t. This experiment is specific as I am testing performance on a small dataset. You can see a drastic change in result when the file size gets bigger.
Thank you for reading!
Follow me on Medium for the latest updates. 😃