Yuchen Liu

May 4, 20204 min read

Python Data Structures and Sequences Cheatsheet (Updated on 6/1/2020)

Updated: Jun 1, 2020

Data structures are very important to data analysis with python. They are simple but powerful and flexible. I prefer to store them in the Jupyter notebooks for a quick look-up. In this blog, I'll share some of my write-ups and continue updating the content to add whatever helpful.

---Updated June 1st, 2020---

Github repository for Python Data Structures

'List' - a collection which is ordered and changeable. Allows duplicate members. Content can be modified.

#Two ways to define a list
my_list = ['Niu','Gu','Shi']

food = ['apples','beef','potatos','pasta']
my_list_2 = list (food)

# Combining lists
my_list_combined = my_list + my_list_2
my_list_combined.extend(['a','b','c'])

my_list_combined

Adding / Removing / Checking elements in a list

my_list.append('Santa')

# Insert an element at a specific location
my_list.insert(1,'(1,2)')

print (my_list)

# Remove an element at a particular index
my_list.pop(1)
my_list

# Element can also be removed by value
my_list.remove('Santa')
my_list

# Check if a list contains a value
'Gu' in my_list

Sort a list

list_a = [1,5,7,3,5,9,7,33,4,77,8,5,701,109]
list_a.sort()
list_a

# Sort strings by lengths
my_list.sort(key=len)
my_list

Binary search & Maintain a sorted list

import bisect as bi

# Find the location to be inserted
bi.bisect(list_a,55)

# Find the location and insert a new element in a sorted list
bi.insort(list_a,888)

list_a

Slicing ( Play around the data with different slice notations passed to the indexing operator [ ] )

my_list[1:2]
my_list[:1]
my_list[2:]
my_list[-2]
my_list[-2:]
my_list[::2]
my_list[::-2]
my_list[-2::]
my_list[-2:-1]

🤪 Check the codes with outputs for 'list' in my Github: Python Data Structures Cheat Sheet - list.ipynb

'Tuple' - a one-dimensional, fixed-length, immutable sequence of Python Objects

my_tup = 4,5,6
print(my_tup)

# Element can be accessed with []
print(my_tup[2])

# Unpacking tuples
a,b,c = my_tup
print(b)

my_tup2 = 7,8,(9,10)
a,b,(c,d) = my_tup2
print(c)

Any sequence can be converted to a tuple

print(tuple(['can','I','be','a','tuple','?']))

tuple_string = tuple('string')
print(tuple_string)

Tuple method

my_tup3 = ('a','e','f','c','g','b','c','d','e','a','a','a')
my_tup3.count('a')

'Set' - a collection which is unordered and unindexed. In Python sets are written with curly brackets

my_set = {'apples','bananas','carrots'}
print (my_set)

# Check if a set is a subset of another set
my_set_2 = {'apples'}
my_set_2.issubset(my_set)

Set methods - mathematical set operations

set_a = {1,6,8,9,3}
set_b = {0,6,8,10,29,46,74,66}
print(set_a | set_b) #union
print(set_a ^ set_b)#symmetric difference
print(set_a & set_b) # and
print(set_a - set_b) # difference

Dictionary - a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values

thisdict = {
  "brand": "Nike",
  "model": "classic",
  "price": 198
}

print(thisdict.keys())
print(thisdict.values())

print(thisdict['model'])

# Check if a dict contains a key
print("brand" in thisdict)

# One dict can be merged into anothor dict
thisdict.update({'color': 'blk'})

Categorizing a list of color by it's first letter with a 'for loop'

colors = ['black','red','yellow','blue','grey','orange','green']
by_letter = {}
for color in colors:
    letter = color[0]
    if letter not in by_letter:
        by_letter[letter] = [color]
    else:
        by_letter[letter].append(color)
            
by_letter

List / set / dict comprehension allows you to form a new list / set / dict by filtering the elements.

#[expr for val in collection if condition]

the_city = ['yellow','taxi','New York']
print( [x.upper() for x in the_city if len(x)>7] )


lengths = {len(x) for x in the_city}
print(lengths)

mapping = {k : index for index, k in enumerate(the_city)}
print(mapping)

🤪 Check the codes with outputs for 'tuple, set, dict and comprehensions' in my Github: Tuple, set, dictionary and comprehensions.ipynb

Array - Numpy array is a fast, flexible container for large data sets in Python

#Numpy array is a fast, flexible container for large data sets in Python.

import numpy as np
my_array = np.array([1, 2, 3])
my_array

#Multi-dimentional array
my_ndarray = np.array([[3,4,5],[6,7,8]])
print("my_ndarray = ",my_ndarray)
print("np.zeros = ",np.zeros((3,6)))

Basic indexing and slicing

my_ndarray = np.array([[3,4,5],[6,7,8],[7,2,8],[6,6,6]])

print(my_ndarray[0][1])
print(my_ndarray[0,1])
print(my_ndarray[:1])
print(my_ndarray[:2,1:])
print(my_ndarray[:,:1])

Boolean Indexing

names = np.array(['one','two','two','six','five','seven','two','one'])
data = np.random.randn(8,5)
print(names)
print(data)
print('\n')
print(names == 'one')
print(data[names == 'one'])
print('\n')
print(data[names == 'one', 3])

Fancy Indexing

my_array_2 = np.empty((7,5))
for i in range(7):
    my_array_2[i]=i
         
print(my_array_2)
print('\n')

# Select a subset of the rows in a particular order
print(my_array_2[[5,6,2,3]])
print('\n')
print(my_array_2[[-5,-6,-2,-3]])
print('\n')

# Reshape
my_array_3 = np.arange(16).reshape((2,8))
print(my_array_3)

Inner matrix product and Transposing

# Inner matrix product
my_array_4 = np.random.randn(6,3)
print(my_array_4)
print(np.dot(my_array_4.T,my_array_4))
print('\n')

# Transposing
my_array_5 = np.arange(16).reshape(((2,2,4)))
print(my_array_5)
my_array_5.transpose((1,0,2))

Conditional logic as array operations

my_array_6 = np.random.randn(4,4)
print(my_array_6)
np.where(my_array_6>0,2,my_array_6)

Mathematical and statistical methods

my_array_7 = np.random.rand(5,5)
for i in range(5):
    my_array_7[i]=i
print(my_array_7)
print('\n')
print(my_array_7.mean())

my_array_7.cumsum(0)

Linear algebra

x = np.array([[1,2,3],[4,5,6]])
y = np.array([[5,6],[11,21],[8,9]])
print (x, '\n\n', y)
print('\n')
print(x.dot(y))

from numpy.linalg import inv,qr
math = x.T.dot(x)
print('\n')
print(math)
print('\n')
r = qr(math)
print(r)

🤪 Check the codes with outputs for 'array' in my Github: Python Data Structures Cheat Sheet - np.array.ipynb

Pandas Series - one dimensional array-like object containing an array of data and index

#class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
import pandas as pd
my_pandas_series = pd.Series(my_array)
my_pandas_series

Indexing

import pandas as pd
my_pandas_series = pd.Series(data = [3,2,4,6,10], index = ['a','b','c','d','e'])
print(my_pandas_series[:2])
print('\n')
print(my_pandas_series['b'])
print('\n')
print(my_pandas_series[my_pandas_series > 3])

Create Pandas Series from a dict

state_pop = {'California':39500000, 'new york':19450000, 'new jersey':8800000}
s_state = pd.Series(state_pop)
print(s_state)
states = ['California','new york','new jersey','ohio']
s_state_reIndex = pd.Series(state_pop,index = states)
print('\n')

Detect missing data

print(s_state_reIndex)
print(s_state_reIndex.isnull())
print('\n')

Data alignment

ohio_CA_pop = {'ohio': 11600000,'California':39500000}
ohio_CA = pd.Series(ohio_CA_pop)
print(ohio_CA)
print(s_state_reIndex + ohio_CA)

Ranking

s_state.rank(ascending=False,method='max')

DataFrame - two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns

data_dic = {'name':['joey','sam','nancy','monica'],
            'age': [33,29,25,31]}
my_df = pd.DataFrame (data_dic)
my_df.index.name='index';my_df.columns.name = 'Students'
print(my_df)
print('\n')

# 'values' returns the data contained in the Dataframe as a 2D array
print(my_df.values)

Indexing

print(my_df["name"][1])
print('\n')
print(my_df["name"][:-1])

Reindexing - Calling reindex may introduce missing values if any index values were not present

print(my_df)
my_df_2 = my_df.reindex([0,1,2,3,4])

Sorting and ranking

my_df_3 = pd.DataFrame ({'c': [4,6,9,-1],'a':[0,0,1,1],'b': [3,6,-2,10]})
print(my_df_3.sort_index(by =['a','b']))
print('\n')
print(my_df_3.rank(axis=1))

Summarizing and computing descriptive statistics

print(my_df_3.sum())
print('\n')
print(my_df_3.mean())
print('\n')
print(my_df_3.idxmax())
print('\n')

🤪 Check the codes with outputs for 'Series' and 'DataFrame' in my Github: Pandas Series and DataFrame.ipynb

See the .ipynb files in my GitHub Repository for the codes and results. Play around with it and check back for the new updates!

🤪Happy sharing!

START SMALL

Python Data Structures and Sequences Cheatsheet (Updated on 6/1/2020)

Recent Posts

Comentários