INTRODUCTION TO PANDAS
Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on the top of the NumPy library. Pandas is fast and it has high-performance & productivity for users.
Install Pandas
pip install pandas
Pandas Data Structures
Series
Series is an object which is similar to Python built-in list data structure but differs from it because it has associated label with each element or so-called index. This distinctive feature makes it look like associated array or dictionary (hashmap representation).
>>> import pandas as pd
>>> my_series = pd.Series([5, 6, 7, 8, 9, 10])
>>> my_series
0 5
1 6
2 7
3 8
4 9
5 10
dtype: int64
>>>
Take a look at the output above and you will see that index is leftward and values are to the right. If index is not provided explicitly, then pandas creates RangeIndex starting from 0 to N-1, where N is a total number of elements. Moreover, each Series object has data type (dtype), in our case data type is int64.
Series has attributes to extract its values and index:
>>> my_series.index
RangeIndex(start=0, stop=6, step=1)
>>> my_series.values
array([ 5, 6, 7, 8, 9, 10], dtype=int64)
You can retrieve elements by their index number:
>>> my_series[4]
9
You can provide index (labels) explicitly:
>>> my_series2 = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
>>> my_series2['f']
10
It is easy to retrieve several elements by their indexes or make group assignment:
>>> my_series2[['a', 'b', 'f']]
a 5
b 6
f 10
dtype: int64
>>> my_series2[['a', 'b', 'f']] = 0
>>> my_series2
a 0
b 0
c 7
d 8
e 9
f 0
dtype: int64
Filtering and math operations are easy as well:
>>> my_series2[my_series2 > 0]
c 7
d 8
e 9
dtype: int64
>>> my_series2[my_series2 > 0] * 2
c 14
d 16
e 18
dtype: int64
Because Series is very similar to dictionary, where key is an index and value is an element, we can do this:
>>> my_series3 = pd.Series({'a': 5, 'b': 6, 'c': 7, 'd': 8})
>>> my_series3
a 5
b 6
c 7
d 8
dtype: int64
>>> 'd' in my_series3
True
Also Series object and its index have name attributes, so you can label them:
>>> my_series3.name = 'numbers'
>>> my_series3.index.name = 'letters'
>>> my_series3
letters
a 5
b 6
c 7
d 8
Name: numbers, dtype: int64
Index can be changed “on fly” by assigning list to index attribute:
>>> my_series3.index = ['A', 'B', 'C', 'D']
>>> my_series3
A 5
B 6
C 7
D 8
Name: numbers, dtype: int64
But bear in mind that the length of the list should be equal to the number of elements inside Series and also labels have to be unique.
DataFrame
Simply said, DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series object, rows consist of elements inside Series.
DataFrame can be constructed using built-in Python dicts:
>>> df = pd.DataFrame({
... 'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
... 'population': [17.04, 143.5, 9.5, 45.5],
... 'square': [2724902, 17125191, 207600, 603628]
... })
>>> df
country population square
0 Kazakhstan 17.04 2724902
1 Russia 143.50 17125191
2 Belarus 9.50 207600
3 Ukraine 45.50 603628
In order to make sure that each column is a Series object let’s do this:
>>> df['country']
0 Kazakhstan
1 Russia
2 Belarus
3 Ukraine
Name: country, dtype: object
>>> type(df['country'])
<class 'pandas.core.series.Series'>
DataFrame object has 2 indexes: column index and row index. If you do not provide row index explicitly, pandas will create RangeIndex from 0 to N-1, where N is a number of rows inside DataFrame.
>>> df.columns
Index([u'country', u'population', u'square'], dtype='object')
>>> df.index
RangeIndex(start=0, stop=4, step=1)
Our table/DataFrame has 4 elements from 0 to 3.
Pandas Panels
In Pandas, if you want to contain any sort of data that is three dimensional, a panel becomes a potent contender. It is used less frequently than series or dataframes, but it is useful
Following are the three axes,
- items: Each item in this axis corresponds to one data frame, and this is called axis 0.
- major_axis: This axis actually contains the rows or indexes of each of the data frames, and this is called axis 1.
- minor_axis: This axis actually contains all the columns of each of the data frames, and this is called axis 2.
Creating a Panel in pandas
import pandas as pd
pd.Panel(data, copy, dtype, items, major_axis, minor_axis)
Different ways to create Pandas Dataframe
Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.
Method #1: Creating Pandas DataFrame from objects.
import
pandas as pd
# initialize list of lists
data =
[['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df =
pd.DataFrame(data, columns =
['Name', 'Age'])
# print dataframe.
df
Method #2: Creating Pandas DataFrame from file.
df= pd.read_excel(r”C:\Users\rs960\Downloads\Churn.xlsx”)
Method #3: Creating Pandas DataFrame from database.
Method #4: Creating Pandas DataFrame from API.
Describe and summarize the data
import
pandas as pd
# importing regex module
import
re
# making data frame
data =
pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
# removing null values to avoid errors
data.dropna(inplace =
True)
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
# calling describe method
desc =
data.describe(percentiles =
perc, include =
include)
# display
desc
Dataframe objects
Pandas offers two primary data structures: Series and the DataFrame objects. Whereas a Series represents a one-dimensional labeled indexed array based on the NumPy ndarray, a DataFrame object treats tabular (and multi-dimensional) data as a labeled, indexed series of observations. You can compare a DataFrame with a spreadsheet Excel or a relational database table. If you use R, this will look very familiar as R also uses Data Frames. You can use DataFrames for organizing data or exploratory data analysis.
Creating a DataFrame Object
The following code loads the pandas package, reads a csv file, applies a tab as a separator and prints the DataFrame object inside an IDE. If you´re running the same code in a Jupyter Notebook, you´ll notice that the cells have a neat layout with borders, lacking in an IDE. The JN also prints the total number of rows and columns underneath the DataFrame.
>>Import pandas
>>df = Pandas.read_csv(r”c:\data\myfile.csv”, sep=’\t’)
>>print(df)
>>df.shape # prints the amount of rows and column numbers
>>df.columns # prints column names of dataset
>>df.dtypes #lists data types of all columns