Today we are going to talk about Polars, the new DataFrame library that has the data science world abuzz. Will Polars replace Pandas? Find out here.
What is Polars?
If you’re reading this site and/or you’ve taken my courses, you know all about Pandas. Pandas is most popular for its DataFrame, inspired by DataFrames in R. But Pandas is not without its limits, especially when it comes to “big data”. The Polars library aims to improve on Pandas’ lack of speed and efficiency for large datasets.
So what is Polars? Polars is a high-speed DataFrame library and in-memory query engine that excels in data wrangling, pipelines, and super-fast APIs. Its parallel execution capabilities, efficient caching algorithms, and user-friendly API make it an ideal choice for data manipulation.
Features of Polars:
- The absence of an index in Polars DataFrames, making it easier to manipulate the data.
- The use of Apache Arrow arrays in Polars, which are more efficient in terms of load time, memory usage, and computation than the Numpy arrays used by Pandas.
- The ability to perform more parallel operations in Polars, thanks to its implementation in Rust.
- The incorporation of lazy evaluation in Polars, which optimizes and accelerates queries for faster performance and reduced memory usage, unlike Pandas which only supports eager evaluation.
How to install Polars
Install Polars using pip:
pip install polars
Or install Polars in your Anaconda environment:
conda install -c conda-forge polars
How to create a DataFrame with Polars
This will output a nicely formatted table just like Pandas does.
Note: unlike Pandas, column names must be strings. You may recall that Pandas allows you to use integers as column names.
Check the types of each column:
[Utf8, Int64, Int64, Int64]
Like Pandas, you can get the column names by calling the columns attribute:
['name', 'hours_studied', 'hours_on_tiktok', 'passed_exam']
You can also get the DataFrame represented as a list of tuples (not recommended for large datasets) by calling the rows function:
[('Alice', 1, 10, 0), ('Bob', 2, 0, 1), ('Carol', 3, 0, 1)]
Selecting Rows and Columns
The syntax for selecting columns in Polars is a bit more verbose than in Pandas, but still relatively simple:
You can also select multiple columns by passing in a list of column names:
You can select columns using ‘col’ objects (when using a string, Polars infers that it corresponds to a column):
Using the col object, you can also select columns by their type:
This returns all int64 columns:
This is how you’d select all string columns:
There are multiple ways to select rows.
The row function:
As you can see, this returns a tuple containing the elements of the row.
You can also use square brackets, which will return a DataFrame (not a tuple, unlike the row function). One thing I always thought was weird about Pandas is that square brackets were used to select columns, whereas for most tabular/array-like data structures (e.g. Numpy arrays) this notation is used to select rows.
Select the first 2 columns (you can use ranges):
Select the first column only (you can use integer indices):
Select the first and last columns (you can use multiple integer indices in a list, and you can use negative numbers to count from the end):
You can use the filter method to select rows based on logical conditions:
You can chain multiple logical conditions together:
Note: in addition to AND (&), other operators include OR (|) and NOT (~), just like Pandas.
Finally, you can chain filter and select methods together:
Read and write data files
Of course, you’re not going to be entering your data manually in the form of lists (unless you are a masochist).
Polars includes the familiar “read_csv” and “read_json” functions that Pandas has:
I always like to get a sense of what’s in my DataFrames by using the head() and tail() functions (just like in Pandas):
Save your CSVs:
Will Polars replace Pandas?
It’s probably too early to tell, but it’s definitely something to keep an eye on!