...

Polars vs. Pandas: Polars DataFrame Tutorial

Hello friends!

Today we are going to talk about Polars, the new DataFrame library that has the data science world abuzz. Will Polars replace Pandas? Find out here.

 

What is Polars?

If you’re reading this site and/or you’ve taken my courses, you know all about Pandas. Pandas is most popular for its DataFrame, inspired by DataFrames in R. But Pandas is not without its limits, especially when it comes to “big data”. The Polars library aims to improve on Pandas’ lack of speed and efficiency for large datasets.

So what is Polars? Polars is a high-speed DataFrame library and in-memory query engine that excels in data wrangling, pipelines, and super-fast APIs. Its parallel execution capabilities, efficient caching algorithms, and user-friendly API make it an ideal choice for data manipulation.

Features of Polars:

  • The absence of an index in Polars DataFrames, making it easier to manipulate the data.
  • The use of Apache Arrow arrays in Polars, which are more efficient in terms of load time, memory usage, and computation than the Numpy arrays used by Pandas.
  • The ability to perform more parallel operations in Polars, thanks to its implementation in Rust.
  • The incorporation of lazy evaluation in Polars, which optimizes and accelerates queries for faster performance and reduced memory usage, unlike Pandas which only supports eager evaluation.

 

How to install Polars

Install Polars using pip:

pip install polars

Or install Polars in your Anaconda environment:

conda install -c conda-forge polars

 

How to create a DataFrame with Polars

 

This will output a nicely formatted table just like Pandas does.

Note: unlike Pandas, column names must be strings. You may recall that Pandas allows you to use integers as column names.

 

Check the types of each column:

This outputs:

[Utf8, Int64, Int64, Int64]

 

Like Pandas, you can get the column names by calling the columns attribute:

This outputs:

['name', 'hours_studied', 'hours_on_tiktok', 'passed_exam']

 

You can also get the DataFrame represented as a list of tuples (not recommended for large datasets) by calling the rows function:

This outputs:

[('Alice', 1, 10, 0), ('Bob', 2, 0, 1), ('Carol', 3, 0, 1)]

 

Selecting Rows and Columns

The syntax for selecting columns in Polars is a bit more verbose than in Pandas, but still relatively simple:

This outputs:

You can also select multiple columns by passing in a list of column names:

This outputs:

You can select columns using ‘col’ objects (when using a string, Polars infers that it corresponds to a column):

This outputs:

Using the col object, you can also select columns by their type:

This returns all int64 columns:

This is how you’d select all string columns:

This outputs:

There are multiple ways to select rows.

The row function:

As you can see, this returns a tuple containing the elements of the row.

You can also use square brackets, which will return a DataFrame (not a tuple, unlike the row function). One thing I always thought was weird about Pandas is that square brackets were used to select columns, whereas for most tabular/array-like data structures (e.g. Numpy arrays) this notation is used to select rows.

Select the first 2 columns (you can use ranges):

This outputs:

Select the first column only (you can use integer indices):

This outputs:

Select the first and last columns (you can use multiple integer indices in a list, and you can use negative numbers to count from the end):

This outputs:

You can use the filter method to select rows based on logical conditions:

This outputs:

You can chain multiple logical conditions together:

This outputs:

Note: in addition to AND (&), other operators include OR (|) and NOT (~), just like Pandas.

Finally, you can chain filter and select methods together:

This outputs:

 

Read and write data files

Of course, you’re not going to be entering your data manually in the form of lists (unless you are a masochist).

Polars includes the familiar “read_csv” and “read_json” functions that Pandas has:

I always like to get a sense of what’s in my DataFrames by using the head() and tail() functions (just like in Pandas):

This outputs:

And tail:

This outputs:

Save your CSVs:

 

Will Polars replace Pandas?

It’s probably too early to tell, but it’s definitely something to keep an eye on!