INTRODUCTION TO PANDAS

6 min readJul 5, 2021

Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on the top of the NumPy library. Pandas is fast and it has high-performance & productivity for users.

Install Pandas

pip install pandas

Pandas Data Structures

Series

Series is an object which is similar to Python built-in list data structure but differs from it because it has associated label with each element or so-called index. This distinctive feature makes it look like associated array or dictionary (hashmap representation).

>>> import pandas as pd
>>> my_series = pd.Series([5, 6, 7, 8, 9, 10])
>>> my_series
0 5
1 6
2 7
3 8
4 9
5 10
dtype: int64
>>>

Take a look at the output above and you will see that index is leftward and values are to the right. If index is not provided explicitly, then pandas creates RangeIndex starting from 0 to N-1, where N is a total number of elements. Moreover, each Series object has data type (dtype), in our case data type is int64.

Series has attributes to extract its values and index:

>>> my_series.index
RangeIndex(start=0, stop=6, step=1)
>>> my_series.values
array([ 5, 6, 7, 8, 9, 10], dtype=int64)

You can retrieve elements by their index number:

>>> my_series[4]
9

You can provide index (labels) explicitly:

>>> my_series2 = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
>>> my_series2['f']
10

It is easy to retrieve several elements by their indexes or make group assignment:

>>> my_series2[['a', 'b', 'f']]
a 5
b 6
f 10
dtype: int64
>>> my_series2[['a', 'b', 'f']] = 0
>>> my_series2
a 0
b 0
c 7
d 8
e 9
f 0
dtype: int64

Filtering and math operations are easy as well:

>>> my_series2[my_series2 > 0]
c 7
d 8
e 9
dtype: int64
>>> my_series2[my_series2 > 0] * 2
c 14
d 16
e 18
dtype: int64

Because Series is very similar to dictionary, where key is an index and value is an element, we can do this:

>>> my_series3 = pd.Series({'a': 5, 'b': 6, 'c': 7, 'd': 8})
>>> my_series3
a 5
b 6
c 7
d 8
dtype: int64
>>> 'd' in my_series3
True

Also Series object and its index have name attributes, so you can label them:

>>> my_series3.name = 'numbers'
>>> my_series3.index.name = 'letters'
>>> my_series3
letters
a 5
b 6
c 7
d 8
Name: numbers, dtype: int64

Index can be changed “on fly” by assigning list to index attribute:

>>> my_series3.index = ['A', 'B', 'C', 'D']
>>> my_series3
A 5
B 6
C 7
D 8
Name: numbers, dtype: int64

But bear in mind that the length of the list should be equal to the number of elements inside Series and also labels have to be unique.

DataFrame

Simply said, DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series object, rows consist of elements inside Series.

DataFrame can be constructed using built-in Python dicts:

>>> df = pd.DataFrame({
... 'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
... 'population': [17.04, 143.5, 9.5, 45.5],
... 'square': [2724902, 17125191, 207600, 603628]
... })
>>> df
country population square
0 Kazakhstan 17.04 2724902
1 Russia 143.50 17125191
2 Belarus 9.50 207600
3 Ukraine 45.50 603628

In order to make sure that each column is a Series object let’s do this:

>>> df['country']
0 Kazakhstan
1 Russia
2 Belarus
3 Ukraine
Name: country, dtype: object
>>> type(df['country'])
<class 'pandas.core.series.Series'>

DataFrame object has 2 indexes: column index and row index. If you do not provide row index explicitly, pandas will create RangeIndex from 0 to N-1, where N is a number of rows inside DataFrame.

>>> df.columns
Index([u'country', u'population', u'square'], dtype='object')
>>> df.index
RangeIndex(start=0, stop=4, step=1)

Our table/DataFrame has 4 elements from 0 to 3.

Pandas Panels

In Pandas, if you want to contain any sort of data that is three dimensional, a panel becomes a potent contender. It is used less frequently than series or dataframes, but it is useful

Following are the three axes,

items: Each item in this axis corresponds to one data frame, and this is called axis 0.
major_axis: This axis actually contains the rows or indexes of each of the data frames, and this is called axis 1.
minor_axis: This axis actually contains all the columns of each of the data frames, and this is called axis 2.

Creating a Panel in pandas

import pandas as pd pd.Panel(data, copy, dtype, items, major_axis, minor_axis)

Different ways to create Pandas Dataframe

Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.
Method #1: Creating Pandas DataFrame from objects.

import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
df

Method #2: Creating Pandas DataFrame from file.

df= pd.read_excel(r”C:\Users\rs960\Downloads\Churn.xlsx”)

Method #3: Creating Pandas DataFrame from database.

Method #4: Creating Pandas DataFrame from API.

Describe and summarize the data

import pandas as pd
# importing regex module
import re
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
# removing null values to avoid errors
data.dropna(inplace = True)
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
# calling describe method
desc = data.describe(percentiles = perc, include = include)
# display
desc

Dataframe objects

Pandas offers two primary data structures: Series and the DataFrame objects. Whereas a Series represents a one-dimensional labeled indexed array based on the NumPy ndarray, a DataFrame object treats tabular (and multi-dimensional) data as a labeled, indexed series of observations. You can compare a DataFrame with a spreadsheet Excel or a relational database table. If you use R, this will look very familiar as R also uses Data Frames. You can use DataFrames for organizing data or exploratory data analysis.

Creating a DataFrame Object

The following code loads the pandas package, reads a csv file, applies a tab as a separator and prints the DataFrame object inside an IDE. If you´re running the same code in a Jupyter Notebook, you´ll notice that the cells have a neat layout with borders, lacking in an IDE. The JN also prints the total number of rows and columns underneath the DataFrame.

>>Import pandas
>>df = Pandas.read_csv(r”c:\data\myfile.csv”, sep=’\t’)
>>print(df)
>>df.shape # prints the amount of rows and column numbers
>>df.columns # prints column names of dataset
>>df.dtypes #lists data types of all columns