INTRODUCTION TO PYTHON
BY- RAVI SHANKAR
1. Indexing and Slicing
Strings in python support indexing and slicing. To extract a single character from a string, follow the string with the index of the desired character surrounded by square brackets ([ ]), remembering that the first character of a string has index zero.
>>> what = 'This parrot is dead'
>>> what[3]
's'
>>> what[0]
'T'
If the subscript you provide between the brackets is less than zero, python counts from the end of the string, with a subscript of -1 representing the last character in the string.
>>> what[-1]
'd'
To extract a contiguous piece of a string (known as a slice), use a subscript consisting of the starting position followed by a colon (:, finally followed by one more than the ending position of the slice you want to extract. Notice that the slicing stops immediately before the second value:
>>> what[0:4]
'This'
>>> what[5:11]
'parrot'
One way to think about the indexes in a slice is that you give the starting position as the value before the colon, and the starting position plus the number of characters in the slice after the colon.
For the special case when a slice starts at the beginning of a string, or continues until the end, you can omit the first or second index, respectively. So to extract all but the first character of a string, you can use a subscript of 1: .
>>> what[1:]
'his parrot is dead'
To extract the first 3 characters of a string you can use :3 .
>>> what[:3]
'Thi'
If you use a value for a slice index which is larger than the length of the string, python does not raise an exceptrion, but treats the index as if it was the length of the string.
As always, variables and integer constants can be freely mixed:
>>> start = 3
>>> finish = 8
>>> what[start:finish]
's par'
>>> what[5:finish]
'par'
Using a second index which is less than or equal to the first index will result in an empty string. If either index is not an integer, a TypeError exception is raised unless, of course, that index was omitted.
2. Python Exception Handling (try..except..finally)
Exceptions in Python
Python has many built-in exceptions that are raised when your program encounters an error (something in the program goes wrong).
When these exceptions occur, the Python interpreter stops the current process and passes it to the calling process until it is handled. If not handled, the program will crash.
For example, let us consider a program where we have a function A
that calls function B
, which in turn calls function C
. If an exception occurs in function C
but is not handled in C
, the exception passes to B
and then to A
.
If never handled, an error message is displayed and our program comes to a sudden unexpected halt.
Catching Exceptions in Python
In Python, exceptions can be handled using a try
statement.
The critical operation which can raise an exception is placed inside the try
clause. The code that handles the exceptions is written in the except
clause.
We can thus choose what operations to perform once we have caught the exception. Here is a simple example.
# import module sys to get the type of exception
import sysrandomList = ['a', 0, 2]for entry in randomList:
try:
print("The entry is", entry)
r = 1/int(entry)
break
except:
print("Oops!", sys.exc_info()[0], "occurred.")
print("Next entry.")
print()
print("The reciprocal of", entry, "is", r)
Output
The entry is a
Oops! <class 'ValueError'> occurred.
Next entry.The entry is 0
Oops! <class 'ZeroDivisionError'> occured.
Next entry.The entry is 2
The reciprocal of 2 is 0.5
In this program, we loop through the values of the randomList list. As previously mentioned, the portion that can cause an exception is placed inside the try
block.
If no exception occurs, the except
block is skipped and normal flow continues(for last value). But if any exception occurs, it is caught by the except
block (first and second values).
Here, we print the name of the exception using the exc_info()
function inside sys
module. We can see that a
causes ValueError
and 0
causes ZeroDivisionError
.
Since every exception in Python inherits from the base Exception
class, we can also perform the above task in the following way:
# import module sys to get the type of exception
import sysrandomList = ['a', 0, 2]for entry in randomList:
try:
print("The entry is", entry)
r = 1/int(entry)
break
except Exception as e:
print("Oops!", e.__class__, "occurred.")
print("Next entry.")
print()
print("The reciprocal of", entry, "is", r)
This program has the same output as the above program.
Catching Specific Exceptions in Python
In the above example, we did not mention any specific exception in the except
clause.
This is not a good programming practice as it will catch all exceptions and handle every case in the same way. We can specify which exceptions an except
clause should catch.
A try
clause can have any number of except
clauses to handle different exceptions, however, only one will be executed in case an exception occurs.
We can use a tuple of values to specify multiple exceptions in an except clause. Here is an example pseudo code.
try:
# do something
passexcept ValueError:
# handle ValueError exception
passexcept (TypeError, ZeroDivisionError):
# handle multiple exceptions
# TypeError and ZeroDivisionError
passexcept:
# handle all other exceptions
pass
Raising Exceptions in Python
In Python programming, exceptions are raised when errors occur at runtime. We can also manually raise exceptions using the raise
keyword.
We can optionally pass values to the exception to clarify why that exception was raised.
>>> raise KeyboardInterrupt
Traceback (most recent call last):
...
KeyboardInterrupt>>> raise MemoryError("This is an argument")
Traceback (most recent call last):
...
MemoryError: This is an argument>>> try:
... a = int(input("Enter a positive integer: "))
... if a <= 0:
... raise ValueError("That is not a positive number!")
... except ValueError as ve:
... print(ve)
...
Enter a positive integer: -2
That is not a positive number!
Python try with else clause
In some situations, you might want to run a certain block of code if the code block inside try
ran without any errors. For these cases, you can use the optional else
keyword with the try
statement.
Note: Exceptions in the else clause are not handled by the preceding except clauses.
Let’s look at an example:
# program to print the reciprocal of even numberstry:
num = int(input("Enter a number: "))
assert num % 2 == 0
except:
print("Not an even number!")
else:
reciprocal = 1/num
print(reciprocal)
Output
If we pass an odd number:
Enter a number: 1
Not an even number!
If we pass an even number, the reciprocal is computed and displayed.
Enter a number: 4
0.25
However, if we pass 0, we get ZeroDivisionError
as the code block inside else
is not handled by preceding except
.
Enter a number: 0
Traceback (most recent call last):
File "<string>", line 7, in <module>
reciprocal = 1/num
ZeroDivisionError: division by zero
Python try…finally
The try
statement in Python can have an optional finally
clause. This clause is executed no matter what, and is generally used to release external resources.
For example, we may be connected to a remote data center through the network or working with a file or a Graphical User Interface (GUI).
In all these circumstances, we must clean up the resource before the program comes to a halt whether it successfully ran or not. These actions (closing a file, GUI or disconnecting from network) are performed in the finally
clause to guarantee the execution.
Here is an example of file operations to illustrate this.
try:
f = open("test.txt",encoding = 'utf-8')
# perform file operations
finally:
f.close()
This type of construct makes sure that the file is closed even if an exception occurs during the program execution.
3. Python Regular Expressions
Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python “re” module provides regular expression support.
In Python a regular expression search is typically written as:
match = re.search(pat, str)
The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern ‘word:’ followed by a 3 letter word (details below):
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
print 'found', match.group() ## 'found word:cat'
else:
print 'did not find'
The code match = re.search(pat, str)
stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.
The ‘r’ at the start of the pattern string designates a python “raw” string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the ‘r’ just as a habit.
Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
- a, X, 9, < — ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
- . (a period) — matches any single character except newline ‘\n’
- \w — (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0–9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
- \b — boundary between word and non-word
- \s — (lowercase s) matches a single whitespace character — space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
- \t, \n, \r — tab, newline, return
- \d — decimal digit [0–9] (some older regex utilities do not support but \d, but they all support \w and \s)
- ^ = start, $ = end — match the start or end of the string
- \ — inhibit the “specialness” of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, \@, to make sure it is treated just as a character.
Basic Examples
Joke: what do you call a pig with three eyes? piiig!
The basic rules of regular expression search for a pattern within a string are:
- The search proceeds through the string from start to end, stopping at the first match found
- All of the pattern must be matched, but not all of the string
- If
match = re.search(pat, str)
is successful, match is not None and in particular match.group() is the matching text
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
match = re.search(r'igs', 'piiig') # not found, match == None ## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig" ## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"
Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
- + — 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s
- * — 0 or more occurrences of the pattern to its left
- ? — match 0 or 1 occurrences of the pattern to its left
Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible — i.e. + and * go as far as possible (the + and * are said to be “greedy”).
Repetition Examples
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii" ## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii" ## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx') # found, match.group() == "1 2 3"
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx') # found, match.group() == "12 3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123" ## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
Emails Example
Suppose you want to find the email address inside the string ‘xyz alice-b@google.com purple monkey’. We’ll use this as a running example to demonstrate more regular expression features. Here’s an attempt using the pattern r’\w+@\w+’:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
print match.group() ## 'b@google'
The search does not get the whole email address in this case because the \w does not match the ‘-’ or ‘.’ in the address. We’ll fix this using the regular expression features below.
Square Brackets
Square brackets can be used to indicate a set of chars, so [abc] matches ‘a’ or ‘b’ or ‘c’. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add ‘.’ and ‘-’ to the set of chars which can appear around the @ with the pattern r’[\w.-]+@[\w.-]+’ to get the whole email address:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
print match.group() ## 'alice-b@google.com'
(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except ‘a’ or ‘b’.
Group Extraction
The “group” feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r’([\w.-]+)@([\w.-]+)’. In this case, the parenthesis do not change what the pattern will match, instead they establish logical “groups” inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
print match.group() ## 'alice-b@google.com' (the whole match)
print match.group(1) ## 'alice-b' (the username, group 1)
print match.group(2) ## 'google.com' (the host, group 2)
A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.
findall
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher' ## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print email
findall With Files
For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you — much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
findall and Groups
The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. (‘alice’, ‘google.com’).
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print tuples ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
print tuple[0] ## username
print tuple[1] ## host
Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)