Data structures are very important to data analysis with python. They are simple but powerful and flexible. I prefer to store them in the Jupyter notebooks for a quick look-up. In this blog, I'll share some of my write-ups and continue updating the content to add whatever helpful.
---Updated June 1st, 2020---
'List' - a collection which is ordered and changeable. Allows duplicate members. Content can be modified.
#Two ways to define a list
my_list = ['Niu','Gu','Shi']
food = ['apples','beef','potatos','pasta']
my_list_2 = list (food)
# Combining lists
my_list_combined = my_list + my_list_2
my_list_combined.extend(['a','b','c'])
my_list_combined
Adding / Removing / Checking elements in a list
my_list.append('Santa')
# Insert an element at a specific location
my_list.insert(1,'(1,2)')
print (my_list)
# Remove an element at a particular index
my_list.pop(1)
my_list
# Element can also be removed by value
my_list.remove('Santa')
my_list
# Check if a list contains a value
'Gu' in my_list
Sort a list
list_a = [1,5,7,3,5,9,7,33,4,77,8,5,701,109]
list_a.sort()
list_a
# Sort strings by lengths
my_list.sort(key=len)
my_list
Binary search & Maintain a sorted list
import bisect as bi
# Find the location to be inserted
bi.bisect(list_a,55)
# Find the location and insert a new element in a sorted list
bi.insort(list_a,888)
list_a
Slicing ( Play around the data with different slice notations passed to the indexing operator [ ] )
my_list[1:2]
my_list[:1]
my_list[2:]
my_list[-2]
my_list[-2:]
my_list[::2]
my_list[::-2]
my_list[-2::]
my_list[-2:-1]
🤪 Check the codes with outputs for 'list' in my Github: Python Data Structures Cheat Sheet - list.ipynb
'Tuple' - a one-dimensional, fixed-length, immutable sequence of Python Objects
my_tup = 4,5,6
print(my_tup)
# Element can be accessed with []
print(my_tup[2])
# Unpacking tuples
a,b,c = my_tup
print(b)
my_tup2 = 7,8,(9,10)
a,b,(c,d) = my_tup2
print(c)
Any sequence can be converted to a tuple
print(tuple(['can','I','be','a','tuple','?']))
tuple_string = tuple('string')
print(tuple_string)
Tuple method
my_tup3 = ('a','e','f','c','g','b','c','d','e','a','a','a')
my_tup3.count('a')
'Set' - a collection which is unordered and unindexed. In Python sets are written with curly brackets
my_set = {'apples','bananas','carrots'}
print (my_set)
# Check if a set is a subset of another set
my_set_2 = {'apples'}
my_set_2.issubset(my_set)
Set methods - mathematical set operations
set_a = {1,6,8,9,3}
set_b = {0,6,8,10,29,46,74,66}
print(set_a | set_b) #union
print(set_a ^ set_b)#symmetric difference
print(set_a & set_b) # and
print(set_a - set_b) # difference
Dictionary - a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values
thisdict = {
"brand": "Nike",
"model": "classic",
"price": 198
}
print(thisdict.keys())
print(thisdict.values())
print(thisdict['model'])
# Check if a dict contains a key
print("brand" in thisdict)
# One dict can be merged into anothor dict
thisdict.update({'color': 'blk'})
Categorizing a list of color by it's first letter with a 'for loop'
colors = ['black','red','yellow','blue','grey','orange','green']
by_letter = {}
for color in colors:
letter = color[0]
if letter not in by_letter:
by_letter[letter] = [color]
else:
by_letter[letter].append(color)
by_letter
List / set / dict comprehension allows you to form a new list / set / dict by filtering the elements.
#[expr for val in collection if condition]
the_city = ['yellow','taxi','New York']
print( [x.upper() for x in the_city if len(x)>7] )
lengths = {len(x) for x in the_city}
print(lengths)
mapping = {k : index for index, k in enumerate(the_city)}
print(mapping)
🤪 Check the codes with outputs for 'tuple, set, dict and comprehensions' in my Github: Tuple, set, dictionary and comprehensions.ipynb
Array - Numpy array is a fast, flexible container for large data sets in Python
#Numpy array is a fast, flexible container for large data sets in Python.
import numpy as np
my_array = np.array([1, 2, 3])
my_array
#Multi-dimentional array
my_ndarray = np.array([[3,4,5],[6,7,8]])
print("my_ndarray = ",my_ndarray)
print("np.zeros = ",np.zeros((3,6)))
Basic indexing and slicing
my_ndarray = np.array([[3,4,5],[6,7,8],[7,2,8],[6,6,6]])
print(my_ndarray[0][1])
print(my_ndarray[0,1])
print(my_ndarray[:1])
print(my_ndarray[:2,1:])
print(my_ndarray[:,:1])
Boolean Indexing
names = np.array(['one','two','two','six','five','seven','two','one'])
data = np.random.randn(8,5)
print(names)
print(data)
print('\n')
print(names == 'one')
print(data[names == 'one'])
print('\n')
print(data[names == 'one', 3])
Fancy Indexing
my_array_2 = np.empty((7,5))
for i in range(7):
my_array_2[i]=i
print(my_array_2)
print('\n')
# Select a subset of the rows in a particular order
print(my_array_2[[5,6,2,3]])
print('\n')
print(my_array_2[[-5,-6,-2,-3]])
print('\n')
# Reshape
my_array_3 = np.arange(16).reshape((2,8))
print(my_array_3)
Inner matrix product and Transposing
# Inner matrix product
my_array_4 = np.random.randn(6,3)
print(my_array_4)
print(np.dot(my_array_4.T,my_array_4))
print('\n')
# Transposing
my_array_5 = np.arange(16).reshape(((2,2,4)))
print(my_array_5)
my_array_5.transpose((1,0,2))
Conditional logic as array operations
my_array_6 = np.random.randn(4,4)
print(my_array_6)
np.where(my_array_6>0,2,my_array_6)
Mathematical and statistical methods
my_array_7 = np.random.rand(5,5)
for i in range(5):
my_array_7[i]=i
print(my_array_7)
print('\n')
print(my_array_7.mean())
my_array_7.cumsum(0)
Linear algebra
x = np.array([[1,2,3],[4,5,6]])
y = np.array([[5,6],[11,21],[8,9]])
print (x, '\n\n', y)
print('\n')
print(x.dot(y))
from numpy.linalg import inv,qr
math = x.T.dot(x)
print('\n')
print(math)
print('\n')
r = qr(math)
print(r)
🤪 Check the codes with outputs for 'array' in my Github: Python Data Structures Cheat Sheet - np.array.ipynb
Pandas Series - one dimensional array-like object containing an array of data and index
#class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
import pandas as pd
my_pandas_series = pd.Series(my_array)
my_pandas_series
Indexing
import pandas as pd
my_pandas_series = pd.Series(data = [3,2,4,6,10], index = ['a','b','c','d','e'])
print(my_pandas_series[:2])
print('\n')
print(my_pandas_series['b'])
print('\n')
print(my_pandas_series[my_pandas_series > 3])
Create Pandas Series from a dict
state_pop = {'California':39500000, 'new york':19450000, 'new jersey':8800000}
s_state = pd.Series(state_pop)
print(s_state)
states = ['California','new york','new jersey','ohio']
s_state_reIndex = pd.Series(state_pop,index = states)
print('\n')
Detect missing data
print(s_state_reIndex)
print(s_state_reIndex.isnull())
print('\n')
Data alignment
ohio_CA_pop = {'ohio': 11600000,'California':39500000}
ohio_CA = pd.Series(ohio_CA_pop)
print(ohio_CA)
print(s_state_reIndex + ohio_CA)
Ranking
s_state.rank(ascending=False,method='max')
DataFrame - two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns
data_dic = {'name':['joey','sam','nancy','monica'],
'age': [33,29,25,31]}
my_df = pd.DataFrame (data_dic)
my_df.index.name='index';my_df.columns.name = 'Students'
print(my_df)
print('\n')
# 'values' returns the data contained in the Dataframe as a 2D array
print(my_df.values)
Indexing
print(my_df["name"][1])
print('\n')
print(my_df["name"][:-1])
Reindexing - Calling reindex may introduce missing values if any index values were not present
print(my_df)
my_df_2 = my_df.reindex([0,1,2,3,4])
Sorting and ranking
my_df_3 = pd.DataFrame ({'c': [4,6,9,-1],'a':[0,0,1,1],'b': [3,6,-2,10]})
print(my_df_3.sort_index(by =['a','b']))
print('\n')
print(my_df_3.rank(axis=1))
Summarizing and computing descriptive statistics
print(my_df_3.sum())
print('\n')
print(my_df_3.mean())
print('\n')
print(my_df_3.idxmax())
print('\n')
🤪 Check the codes with outputs for 'Series' and 'DataFrame' in my Github: Pandas Series and DataFrame.ipynb
See the .ipynb files in my GitHub Repository for the codes and results. Play around with it and check back for the new updates!
🤪Happy sharing!
Comentários