INTRODUCTION TO DATA SCIENCE

Ravi shankar
12 min readJun 27, 2021

--

BY-RAVI SHANKAR

Data Handling is a process of gathering, recording, and presenting information in a way that is helpful to others in using (read or fetch information) for instance, in graphs or charts. It is sometimes also known as statistics. It is also used for comparing data and taking out mean, median, and mode.

1. NumPy — Introduction

Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was also developed, having some additional functionalities. In 2005, Travis Oliphant created NumPy package by incorporating the features of Numarray into Numeric package. There are many contributors to this open source project.

Operations using NumPy

Using NumPy, a developer can perform the following operations −

  • Mathematical and logical operations on arrays.
  • Fourier transforms and routines for shape manipulation.
  • Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.

Standard Python distribution doesn’t come bundled with NumPy module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install numpy

NumPy supports a much greater variety of numerical types than Python does. The following table shows different scalar data types defined in NumPy.

Sr.No.Data Types & Description

1.bool_

Boolean (True or False) stored as a byte

2.int_

Default integer type (same as C long; normally either int64 or int32)

3.intc

Identical to C int (normally int32 or int64)

4.intp

Integer used for indexing (same as C ssize_t; normally either int32 or int64)

5.int8

Byte (-128 to 127)

6.int16

Integer (-32768 to 32767)

7.int32

Integer (-2147483648 to 2147483647)

8.int64

Integer (-9223372036854775808 to 9223372036854775807)

9.uint8

Unsigned integer (0 to 255)

10.uint16

Unsigned integer (0 to 65535)

11.uint32

Unsigned integer (0 to 4294967295)

12.uint64

Unsigned integer (0 to 18446744073709551615)

13.float_

Shorthand for float64

14.float16

Half precision float: sign bit, 5 bits exponent, 10 bits mantissa

15.float32

Single precision float: sign bit, 8 bits exponent, 23 bits mantissa

16.float64

Double precision float: sign bit, 11 bits exponent, 52 bits mantissa

17.complex_

Shorthand for complex128

18.complex64

Complex number, represented by two 32-bit floats (real and imaginary components)

19.complex128

Complex number, represented by two 64-bit floats (real and imaginary components)

Data Type Objects (dtype)

A data type object describes interpretation of fixed block of memory corresponding to an array, depending on the following aspects −

  • Type of data (integer, float or Python object)
  • Size of data
  • Byte order (little-endian or big-endian)
  • In case of structured type, the names of fields, data type of each field and part of the memory block taken by each field.
  • If data type is a subarray, its shape and data type

The byte order is decided by prefixing ‘<’ or ‘>’ to data type. ‘<’ means that encoding is little-endian (least significant is stored in smallest address). ‘>’ means that encoding is big-endian (most significant byte is stored in smallest address).

A dtype object is constructed using the following syntax −

numpy.dtype(object, align, copy)

The parameters are −

  • Object − To be converted to data type object
  • Align − If true, adds padding to the field to make it similar to C-struct
  • Copy − Makes a new copy of dtype object. If false, the result is reference to builtin data type object

Example 1

# using array-scalar type 
import numpy as np
dt = np.dtype(np.int32)
print dt

The output is as follows −

int32

Example 2

#int8, int16, int32, int64 can be replaced by equivalent string 'i1', 'i2','i4', etc. 
import numpy as np
dt = np.dtype('i4')
print dt

The output is as follows −

int32

Example 3

# using endian notation 
import numpy as np
dt = np.dtype('>i4')
print dt

The output is as follows −

>i4

The following examples show the use of structured data type. Here, the field name and the corresponding scalar data type is to be declared.

Example 4

# first create structured data type 
import numpy as np
dt = np.dtype([('age',np.int8)])
print dt

The output is as follows −

[('age', 'i1')]

Example 5

# now apply it to ndarray object 
import numpy as np
dt = np.dtype([('age',np.int8)])
a = np.array([(10,),(20,),(30,)], dtype = dt)
print a

The output is as follows −

[(10,) (20,) (30,)]

Example 6

# file name can be used to access content of age column 
import numpy as np
dt = np.dtype([('age',np.int8)])
a = np.array([(10,),(20,),(30,)], dtype = dt)
print a['age']

The output is as follows −

[10 20 30]

Example 7

The following examples define a structured data type called student with a string field ‘name’, an integer field ‘age’ and a float field ‘marks’. This dtype is applied to ndarray object.

import numpy as np 
student = np.dtype([('name','S20'), ('age', 'i1'), ('marks', 'f4')])
print student

The output is as follows −

[('name', 'S20'), ('age', 'i1'), ('marks', '<f4')])

Example 8

import numpy as np student = np.dtype([('name','S20'), ('age', 'i1'), ('marks', 'f4')]) 
a = np.array([('abc', 21, 50),('xyz', 18, 75)], dtype = student)
print a

The output is as follows −

[('abc', 21, 50.0), ('xyz', 18, 75.0)]

2. Python Pandas — Introduction

Standard Python distribution doesn’t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install pandas

Pandas deals with the following three data structures −

  • Series
  • DataFrame
  • Panel

2.1 Python Pandas Series

The Pandas Series can be defined as a one-dimensional array that is capable of storing various data types. We can easily convert the list, tuple, and dictionary into series using “series’ method. The row labels of series are called the index. A Series cannot contain multiple columns. It has the following parameter:

  • data: It can be any list, dictionary, or scalar value.
  • index: The value of the index should be unique and hashable. It must be of the same length as data. If we do not pass any index, default np.arrange(n) will be used.
  • dtype: It refers to the data type of series.
  • copy: It is used for copying the data.

Creating a Series:

We can create a Series in two ways:

  1. Create an empty Series
  2. Create a Series using inputs.

Create an Empty Series:

We can easily create an empty series in Pandas which means it will not have any value.

The syntax that is used for creating an Empty Series:

  1. <series object> = pandas.Series()

The below example creates an Empty Series type object that has no values and having default datatype, i.e., float64.

  1. import pandas as pd
  2. x = pd.Series()
  3. print (x)

Output

Series([], dtype: float64)

Creating a Series using inputs:

We can create Series by using various inputs:

  • Array
  • Dict
  • Scalar value

Creating Series from Array:

Before creating a Series, firstly, we have to import the numpy module and then use array() function in the program. If the data is ndarray, then the passed index must be of the same length.

If we do not pass an index, then by default index of range(n) is being passed where n defines the length of an array, i.e., [0,1,2,….range(len(array))-1].

  1. import pandas as pd
  2. import numpy as np
  3. info = np.array([‘P’,’a’,’n’,’d’,’a’,’s’])
  4. a = pd.Series(info)
  5. print(a)

Output

0    P
1 a
2 n
3 d
4 a
5 s
dtype: object

2.2 Python Pandas DataFrame

Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard way to store data that has two different indexes, i.e., row index and column index. It consists of the following properties:

  • The columns can be heterogeneous types like int, bool, and so on.
  • It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in case of columns and “index” in case of rows.

Parameter & Description:

data: It consists of different forms like ndarray, series, map, constants, lists, array.

index: The Default np.arrange(n) index is used for the row labels if no index is passed.

columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is passed.

dtype: It refers to the data type of each column.

copy(): It is used for copying the data.

Create a DataFrame

We can create a DataFrame using following ways:

  • dict
  • Lists
  • Numpy ndarrrays
  • Series

Create an empty DataFrame

The below code shows how to create an empty DataFrame in Pandas:

  1. # importing the pandas library
  2. import pandas as pd
  3. df = pd.DataFrame()
  4. print (df)

Output

Empty DataFrame
Columns: []
Index: []

Explanation: In the above code, first of all, we have imported the pandas library with the alias pd and then defined a variable named as df that consists an empty DataFrame. Finally, we have printed it by passing the df into the print.

Create a DataFrame using List:

We can easily create a DataFrame in Pandas using list.

  1. # importing the pandas library
  2. import pandas as pd
  3. # a list of strings
  4. x = [‘Python’, ‘Pandas’]
  5. # Calling DataFrame constructor on list
  6. df = pd.DataFrame(x)
  7. print(df)

Output

0
0 Python
1 Pandas

3. Python File I/O

Files

Files are named locations on disk to store related information. They are used to permanently store data in a non-volatile memory (e.g. hard disk).

Since Random Access Memory (RAM) is volatile (which loses its data when the computer is turned off), we use files for future use of the data by permanently storing them.

When we want to read from or write to a file, we need to open it first. When we are done, it needs to be closed so that the resources that are tied with the file are freed.

Hence, in Python, a file operation takes place in the following order:

  1. Open a file
  2. Read or write (perform operation)
  3. Close the file

Opening Files in Python

Python has a built-in open() function to open a file. This function returns a file object, also called a handle, as it is used to read or modify the file accordingly.

>>> f = open("test.txt")    # open file in current directory
>>> f = open("C:/Python38/README.txt") # specifying full path

We can specify the mode while opening a file. In mode, we specify whether we want to read r, write w or append a to the file. We can also specify if we want to open the file in text mode or binary mode.

The default is reading in text mode. In this mode, we get strings when reading from the file.

On the other hand, binary mode returns bytes and this is the mode to be used when dealing with non-text files like images or executable files.

f = open("test.txt")      # equivalent to 'r' or 'rt'
f = open("test.txt",'w') # write in text mode
f = open("img.bmp",'r+b') # read and write in binary mode

Unlike other languages, the character a does not imply the number 97 until it is encoded using ASCII (or other equivalent encodings).

Moreover, the default encoding is platform dependent. In windows, it is cp1252 but utf-8 in Linux.

So, we must not also rely on the default encoding or else our code will behave differently in different platforms.

Hence, when working with files in text mode, it is highly recommended to specify the encoding type.

f = open("test.txt", mode='r', encoding='utf-8')

Closing Files in Python

When we are done with performing operations on the file, we need to properly close the file.

Closing a file will free up the resources that were tied with the file. It is done using the close() method available in Python.

Python has a garbage collector to clean up unreferenced objects but we must not rely on it to close the file.

f = open("test.txt", encoding = 'utf-8')
# perform file operations
f.close()

This method is not entirely safe. If an exception occurs when we are performing some operation with the file, the code exits without closing the file.

A safer way is to use a try…finally block.

try:
f = open("test.txt", encoding = 'utf-8')
# perform file operations
finally:
f.close()

This way, we are guaranteeing that the file is properly closed even if an exception is raised that causes program flow to stop.

The best way to close a file is by using the with statement. This ensures that the file is closed when the block inside the with statement is exited.

We don’t need to explicitly call the close() method. It is done internally.

with open("test.txt", encoding = 'utf-8') as f:
# perform file operations

Writing to Files in Python

In order to write into a file in Python, we need to open it in write w, append a or exclusive creation x mode.

We need to be careful with the w mode, as it will overwrite into the file if it already exists. Due to this, all the previous data are erased.

Writing a string or sequence of bytes (for binary files) is done using the write() method. This method returns the number of characters written to the file.

with open("test.txt",'w',encoding = 'utf-8') as f:
f.write("my first file\n")
f.write("This file\n\n")
f.write("contains three lines\n")

This program will create a new file named test.txt in the current directory if it does not exist. If it does exist, it is overwritten.

We must include the newline characters ourselves to distinguish the different lines.

Reading Files in Python

To read a file in Python, we must open the file in reading r mode.

There are various methods available for this purpose. We can use the read(size) method to read in the size number of data. If the size parameter is not specified, it reads and returns up to the end of the file.

We can read the text.txt file we wrote in the above section in the following way:

>>> f = open("test.txt",'r',encoding = 'utf-8')
>>> f.read(4) # read the first 4 data
'This'
>>> f.read(4) # read the next 4 data
' is '
>>> f.read() # read in the rest till end of file
'my first file\nThis file\ncontains three lines\n'
>>> f.read() # further reading returns empty sting
''

We can see that the read() method returns a newline as '\n'. Once the end of the file is reached, we get an empty string on further reading.

We can change our current file cursor (position) using the seek() method. Similarly, the tell() method returns our current position (in number of bytes).

>>> f.tell()    # get the current file position
56
>>> f.seek(0) # bring file cursor to initial position
0
>>> print(f.read()) # read the entire file
This is my first file
This file
contains three lines

We can read a file line-by-line using a for loop. This is both efficient and fast.

>>> for line in f:
... print(line, end = '')
...
This is my first file
This file
contains three lines

In this program, the lines in the file itself include a newline character \n. So, we use the end parameter of the print() function to avoid two newlines when printing.

Alternatively, we can use the readline() method to read individual lines of a file. This method reads a file till the newline, including the newline character.

>>> f.readline()
'This is my first file\n'
>>> f.readline()
'This file\n'
>>> f.readline()
'contains three lines\n'
>>> f.readline()
''

Lastly, the readlines() method returns a list of remaining lines of the entire file. All these reading methods return empty values when the end of file (EOF) is reached.

>>> f.readlines()
['This is my first file\n', 'This file\n', 'contains three lines\n']

4. Database Connection

There are the following steps to connect a python application to our database.

  1. Import mysql.connector module
  2. Create the connection object.
  3. Create the cursor object
  4. Execute the query

Creating the connection

To create a connection between the MySQL database and the python application, the connect() method of mysql.connector module is used.

Pass the database details like HostName, username, and the database password in the method call. The method returns the connection object.

The syntax to use the connect() is given below.

  1. Connection-Object= mysql.connector.connect(host = <host-name> , user = <username> , passwd = <password> )

Example

  1. import mysql.connector
  2. myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”)
  3. print(myconn)

Output:

<mysql.connector.connection.MySQLConnection object at 0x7fb142edd780>

Here, we must notice that we can specify the database name in the connect() method if we want to connect to a specific database.

Example

  1. import mysql.connector
  2. #Create the connection object
  3. myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”, database = “mydb”)
  4. #printing the connection object
  5. print(myconn)

Output:

<mysql.connector.connection.MySQLConnection object at 0x7ff64aa3d7b8>

Creating a cursor object

The cursor object can be defined as an abstraction specified in the Python DB-API 2.0. It facilitates us to have multiple separate working environments through the same connection to the database. We can create the cursor object by calling the ‘cursor’ function of the connection object. The cursor object is an important aspect of executing queries to the databases.

The syntax to create the cursor object is given below.

  1. <my_cur> = conn.cursor()

Example

  1. import mysql.connector
  2. myconn = mysql.connector.connect(host = “localhost”, user = “root”,passwd = “google”, database = “mydb”)
  3. print(myconn)
  4. cur = myconn.cursor()
  5. print(cur)

Output:

<mysql.connector.connection.MySQLConnection object at 0x7faa17a15748> 
MySQLCursor: (Nothing executed yet)

--

--

No responses yet