INTRODUCTION TO DATA SCIENCE
BY-RAVI SHANKAR
Data Handling is a process of gathering, recording, and presenting information in a way that is helpful to others in using (read or fetch information) for instance, in graphs or charts. It is sometimes also known as statistics. It is also used for comparing data and taking out mean, median, and mode.
1. NumPy — Introduction
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was also developed, having some additional functionalities. In 2005, Travis Oliphant created NumPy package by incorporating the features of Numarray into Numeric package. There are many contributors to this open source project.
Operations using NumPy
Using NumPy, a developer can perform the following operations −
- Mathematical and logical operations on arrays.
- Fourier transforms and routines for shape manipulation.
- Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.
Standard Python distribution doesn’t come bundled with NumPy module. A lightweight alternative is to install NumPy using popular Python package installer, pip.
pip install numpy
NumPy supports a much greater variety of numerical types than Python does. The following table shows different scalar data types defined in NumPy.
Sr.No.Data Types & Description
1.bool_
Boolean (True or False) stored as a byte
2.int_
Default integer type (same as C long; normally either int64 or int32)
3.intc
Identical to C int (normally int32 or int64)
4.intp
Integer used for indexing (same as C ssize_t; normally either int32 or int64)
5.int8
Byte (-128 to 127)
6.int16
Integer (-32768 to 32767)
7.int32
Integer (-2147483648 to 2147483647)
8.int64
Integer (-9223372036854775808 to 9223372036854775807)
9.uint8
Unsigned integer (0 to 255)
10.uint16
Unsigned integer (0 to 65535)
11.uint32
Unsigned integer (0 to 4294967295)
12.uint64
Unsigned integer (0 to 18446744073709551615)
13.float_
Shorthand for float64
14.float16
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
15.float32
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
16.float64
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
17.complex_
Shorthand for complex128
18.complex64
Complex number, represented by two 32-bit floats (real and imaginary components)
19.complex128
Complex number, represented by two 64-bit floats (real and imaginary components)
Data Type Objects (dtype)
A data type object describes interpretation of fixed block of memory corresponding to an array, depending on the following aspects −
- Type of data (integer, float or Python object)
- Size of data
- Byte order (little-endian or big-endian)
- In case of structured type, the names of fields, data type of each field and part of the memory block taken by each field.
- If data type is a subarray, its shape and data type
The byte order is decided by prefixing ‘<’ or ‘>’ to data type. ‘<’ means that encoding is little-endian (least significant is stored in smallest address). ‘>’ means that encoding is big-endian (most significant byte is stored in smallest address).
A dtype object is constructed using the following syntax −
numpy.dtype(object, align, copy)
The parameters are −
- Object − To be converted to data type object
- Align − If true, adds padding to the field to make it similar to C-struct
- Copy − Makes a new copy of dtype object. If false, the result is reference to builtin data type object
Example 1
# using array-scalar type
import numpy as np
dt = np.dtype(np.int32)
print dt
The output is as follows −
int32
Example 2
#int8, int16, int32, int64 can be replaced by equivalent string 'i1', 'i2','i4', etc.
import numpy as np dt = np.dtype('i4')
print dt
The output is as follows −
int32
Example 3
# using endian notation
import numpy as np
dt = np.dtype('>i4')
print dt
The output is as follows −
>i4
The following examples show the use of structured data type. Here, the field name and the corresponding scalar data type is to be declared.
Example 4
# first create structured data type
import numpy as np
dt = np.dtype([('age',np.int8)])
print dt
The output is as follows −
[('age', 'i1')]
Example 5
# now apply it to ndarray object
import numpy as np dt = np.dtype([('age',np.int8)])
a = np.array([(10,),(20,),(30,)], dtype = dt)
print a
The output is as follows −
[(10,) (20,) (30,)]
Example 6
# file name can be used to access content of age column
import numpy as np dt = np.dtype([('age',np.int8)])
a = np.array([(10,),(20,),(30,)], dtype = dt)
print a['age']
The output is as follows −
[10 20 30]
Example 7
The following examples define a structured data type called student with a string field ‘name’, an integer field ‘age’ and a float field ‘marks’. This dtype is applied to ndarray object.
import numpy as np
student = np.dtype([('name','S20'), ('age', 'i1'), ('marks', 'f4')])
print student
The output is as follows −
[('name', 'S20'), ('age', 'i1'), ('marks', '<f4')])
Example 8
import numpy as np student = np.dtype([('name','S20'), ('age', 'i1'), ('marks', 'f4')])
a = np.array([('abc', 21, 50),('xyz', 18, 75)], dtype = student)
print a
The output is as follows −
[('abc', 21, 50.0), ('xyz', 18, 75.0)]
2. Python Pandas — Introduction
Standard Python distribution doesn’t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.
pip install pandas
Pandas deals with the following three data structures −
- Series
- DataFrame
- Panel
2.1 Python Pandas Series
The Pandas Series can be defined as a one-dimensional array that is capable of storing various data types. We can easily convert the list, tuple, and dictionary into series using “series’ method. The row labels of series are called the index. A Series cannot contain multiple columns. It has the following parameter:
- data: It can be any list, dictionary, or scalar value.
- index: The value of the index should be unique and hashable. It must be of the same length as data. If we do not pass any index, default np.arrange(n) will be used.
- dtype: It refers to the data type of series.
- copy: It is used for copying the data.
Creating a Series:
We can create a Series in two ways:
- Create an empty Series
- Create a Series using inputs.
Create an Empty Series:
We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:
- <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having default datatype, i.e., float64.
- import pandas as pd
- x = pd.Series()
- print (x)
Output
Series([], dtype: float64)
Creating a Series using inputs:
We can create Series by using various inputs:
- Array
- Dict
- Scalar value
Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use array() function in the program. If the data is ndarray, then the passed index must be of the same length.
If we do not pass an index, then by default index of range(n) is being passed where n defines the length of an array, i.e., [0,1,2,….range(len(array))-1].
- import pandas as pd
- import numpy as np
- info = np.array([‘P’,’a’,’n’,’d’,’a’,’s’])
- a = pd.Series(info)
- print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
2.2 Python Pandas DataFrame
Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard way to store data that has two different indexes, i.e., row index and column index. It consists of the following properties:
- The columns can be heterogeneous types like int, bool, and so on.
- It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in case of columns and “index” in case of rows.
Parameter & Description:
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The Default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is passed.
dtype: It refers to the data type of each column.
copy(): It is used for copying the data.
Create a DataFrame
We can create a DataFrame using following ways:
- dict
- Lists
- Numpy ndarrrays
- Series
Create an empty DataFrame
The below code shows how to create an empty DataFrame in Pandas:
- # importing the pandas library
- import pandas as pd
- df = pd.DataFrame()
- print (df)
Output
Empty DataFrame
Columns: []
Index: []
Explanation: In the above code, first of all, we have imported the pandas library with the alias pd and then defined a variable named as df that consists an empty DataFrame. Finally, we have printed it by passing the df into the print.
Create a DataFrame using List:
We can easily create a DataFrame in Pandas using list.
- # importing the pandas library
- import pandas as pd
- # a list of strings
- x = [‘Python’, ‘Pandas’]
- # Calling DataFrame constructor on list
- df = pd.DataFrame(x)
- print(df)
Output
0
0 Python
1 Pandas
3. Python File I/O
Files
Files are named locations on disk to store related information. They are used to permanently store data in a non-volatile memory (e.g. hard disk).
Since Random Access Memory (RAM) is volatile (which loses its data when the computer is turned off), we use files for future use of the data by permanently storing them.
When we want to read from or write to a file, we need to open it first. When we are done, it needs to be closed so that the resources that are tied with the file are freed.
Hence, in Python, a file operation takes place in the following order:
- Open a file
- Read or write (perform operation)
- Close the file
Opening Files in Python
Python has a built-in open()
function to open a file. This function returns a file object, also called a handle, as it is used to read or modify the file accordingly.
>>> f = open("test.txt") # open file in current directory
>>> f = open("C:/Python38/README.txt") # specifying full path
We can specify the mode while opening a file. In mode, we specify whether we want to read r
, write w
or append a
to the file. We can also specify if we want to open the file in text mode or binary mode.
The default is reading in text mode. In this mode, we get strings when reading from the file.
On the other hand, binary mode returns bytes and this is the mode to be used when dealing with non-text files like images or executable files.
f = open("test.txt") # equivalent to 'r' or 'rt'
f = open("test.txt",'w') # write in text mode
f = open("img.bmp",'r+b') # read and write in binary mode
Unlike other languages, the character a
does not imply the number 97 until it is encoded using ASCII
(or other equivalent encodings).
Moreover, the default encoding is platform dependent. In windows, it is cp1252
but utf-8
in Linux.
So, we must not also rely on the default encoding or else our code will behave differently in different platforms.
Hence, when working with files in text mode, it is highly recommended to specify the encoding type.
f = open("test.txt", mode='r', encoding='utf-8')
Closing Files in Python
When we are done with performing operations on the file, we need to properly close the file.
Closing a file will free up the resources that were tied with the file. It is done using the close()
method available in Python.
Python has a garbage collector to clean up unreferenced objects but we must not rely on it to close the file.
f = open("test.txt", encoding = 'utf-8')
# perform file operations
f.close()
This method is not entirely safe. If an exception occurs when we are performing some operation with the file, the code exits without closing the file.
A safer way is to use a try…finally block.
try:
f = open("test.txt", encoding = 'utf-8')
# perform file operations
finally:
f.close()
This way, we are guaranteeing that the file is properly closed even if an exception is raised that causes program flow to stop.
The best way to close a file is by using the with
statement. This ensures that the file is closed when the block inside the with
statement is exited.
We don’t need to explicitly call the close()
method. It is done internally.
with open("test.txt", encoding = 'utf-8') as f:
# perform file operations
Writing to Files in Python
In order to write into a file in Python, we need to open it in write w
, append a
or exclusive creation x
mode.
We need to be careful with the w
mode, as it will overwrite into the file if it already exists. Due to this, all the previous data are erased.
Writing a string or sequence of bytes (for binary files) is done using the write()
method. This method returns the number of characters written to the file.
with open("test.txt",'w',encoding = 'utf-8') as f:
f.write("my first file\n")
f.write("This file\n\n")
f.write("contains three lines\n")
This program will create a new file named test.txt
in the current directory if it does not exist. If it does exist, it is overwritten.
We must include the newline characters ourselves to distinguish the different lines.
Reading Files in Python
To read a file in Python, we must open the file in reading r
mode.
There are various methods available for this purpose. We can use the read(size)
method to read in the size number of data. If the size parameter is not specified, it reads and returns up to the end of the file.
We can read the text.txt
file we wrote in the above section in the following way:
>>> f = open("test.txt",'r',encoding = 'utf-8')
>>> f.read(4) # read the first 4 data
'This'>>> f.read(4) # read the next 4 data
' is '>>> f.read() # read in the rest till end of file
'my first file\nThis file\ncontains three lines\n'>>> f.read() # further reading returns empty sting
''
We can see that the read()
method returns a newline as '\n'
. Once the end of the file is reached, we get an empty string on further reading.
We can change our current file cursor (position) using the seek()
method. Similarly, the tell()
method returns our current position (in number of bytes).
>>> f.tell() # get the current file position
56>>> f.seek(0) # bring file cursor to initial position
0>>> print(f.read()) # read the entire file
This is my first file
This file
contains three lines
We can read a file line-by-line using a for loop. This is both efficient and fast.
>>> for line in f:
... print(line, end = '')
...
This is my first file
This file
contains three lines
In this program, the lines in the file itself include a newline character \n
. So, we use the end parameter of the print()
function to avoid two newlines when printing.
Alternatively, we can use the readline()
method to read individual lines of a file. This method reads a file till the newline, including the newline character.
>>> f.readline()
'This is my first file\n'>>> f.readline()
'This file\n'>>> f.readline()
'contains three lines\n'>>> f.readline()
''
Lastly, the readlines()
method returns a list of remaining lines of the entire file. All these reading methods return empty values when the end of file (EOF) is reached.
>>> f.readlines()
['This is my first file\n', 'This file\n', 'contains three lines\n']
4. Database Connection
There are the following steps to connect a python application to our database.
- Import mysql.connector module
- Create the connection object.
- Create the cursor object
- Execute the query
Creating the connection
To create a connection between the MySQL database and the python application, the connect() method of mysql.connector module is used.
Pass the database details like HostName, username, and the database password in the method call. The method returns the connection object.
The syntax to use the connect() is given below.
- Connection-Object= mysql.connector.connect(host = <host-name> , user = <username> , passwd = <password> )
Example
- import mysql.connector
- myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”)
- print(myconn)
Output:
<mysql.connector.connection.MySQLConnection object at 0x7fb142edd780>
Here, we must notice that we can specify the database name in the connect() method if we want to connect to a specific database.
Example
- import mysql.connector
- #Create the connection object
- myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”, database = “mydb”)
- #printing the connection object
- print(myconn)
Output:
<mysql.connector.connection.MySQLConnection object at 0x7ff64aa3d7b8>
Creating a cursor object
The cursor object can be defined as an abstraction specified in the Python DB-API 2.0. It facilitates us to have multiple separate working environments through the same connection to the database. We can create the cursor object by calling the ‘cursor’ function of the connection object. The cursor object is an important aspect of executing queries to the databases.
The syntax to create the cursor object is given below.
- <my_cur> = conn.cursor()
Example
- import mysql.connector
- myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”, database = “mydb”)
- print(myconn)
- cur = myconn.cursor()
- print(cur)
Output:
<mysql.connector.connection.MySQLConnection object at 0x7faa17a15748>
MySQLCursor: (Nothing executed yet)