The Python ecosystem is continuously changing. Many people who use Pandas could also benefit from using Polars as a viable alternative which generally speaking is faster. This article will help motivate the transition from Pandas to Polars based on performance.

Independently than the package, many people who use Pandas will:

  • Read Data
  • Derive some insight
  • Share data

The data set used is The International trade: June 2022 quarter provided by the New Zealand Government. See this link.

The purpose of this information is to compare imports vs exports by the different services per county in New Zealand, in other words a simple pivot table of data provided.

Read Data

If Excel was your goto tool, you would quickly realise the data is too big, and if even if it did fit there are some missing fields preventing the pivot table experience to work out of the box. Here is how pandas and polars import the data. This is a side by side comparison of how to bring CSV file into Python using Pandas and Polars.

import pandas as pd


filename = "output_csv_full.csv"
dict_types = {
    "time_ref": int,
    "account": str,
    "code": str,
    "country_code": str,
    "product_type": str,
    "value": float,
    "status": str,
}

def load_pandas():
    return pd.read_csv(filename, dtype=dict_types)
import polars as pl


filename = "output_csv_full.csv"
dict_types = {
    "time_ref": int,
    "account": str,
    "code": str,
    "country_code": str,
    "product_type": str,
    "value": float,
    "status": str,
}

def load_polars():
    return pl.read_csv(filename, dtypes=dict_types)

Using pytest and pytest-benchmark just on the reading of the CSV it became apparent that pandas is 9x slower than polars.

PackageDuration (ns)Times Slower
Polars128.17481.0
Pandas1,158.53509.04

Derive some insight – creating a pivot table

Looking at the data only makes sense if you can derive some insight or information. One common way is to create a pivot table. Please note this is where according to polars documentation, polars != pandas.

def derive_insight_pandas(df):
    return pd.pivot_table(
        df, index="account", columns="product_type", values="value", aggfunc=np.sum
    )
def derive_insight_polars(df):
    return df.pivot(
        index="account", columns="product_type", aggregate_fn="sum", values="value"
    )
PackageDuration (ns)Times Slower
Polars44.50881.0
Pandas229.51175.16

In this case, the speed up is not as big as 9x, but 5x which is still impressive. Please note we are just using a simple pivot command, which is most cases is too simplistic but a common first approach when looking at a single CSV file.

Writing to CSV

The last test to validate and compare the 2 packages it is to export the resulting data to CSV. In this case we are exporting the result of the pivot table to csv.

def write_csv_pandas(df):
    df.to_csv("pandas_pivot.csv")
def write_csv_polars(df):
    df.write_csv("polars_pivot.csv")
PackageDuration (ns)Times Slower
Polars263.16201.0
Pandas733.18752.79

In this case we see a 2.79 speed improvement over pandas. Meaning that Polars does out perform pandas in many cases.

Please note: that this article may illude that there is a 1 to 1 mapping of Pandas functions to Polars, this is NOT the case. To get the most out of polars please read their official documentation. There is a section about coming from Pandas.

Given this insight we are also selling on VersionBay Store t-shirts with Polars logo.