1 of 7

Pandas

About Pandas

Pandas is a great tool for holding/manipulating/plotting labeled datasets. It is similar to Numpy, but differs in two important ways:

The rows and columns can be indexed by 'labels' (strings) OR numbers, unlike numpy arrays which always use number indices
The contents can be different types - numpy requires each element is the same datatype

Why pandas?

It is a great way to:

Read data from a file into a structured easy-to-access format
Clean data - deal with missing values, etc.
Select data of interest
Analyze/plot
Output results

Pandas provides two basic data structures - Series (1-D) and DataFrames (2-D). You can think of DataFrames as a grouping of Series, each with a column label. What's nice is that you can 'slice' these Dataframes either across rows or down columns using the row and column labels, much like you would with dictionaries.

You can learn more about pandas [here]

Making DataFrames

Provide a list or numpy array

The column and row labels will simply use the numerical index

import pandas as pd
import numpy as np

z = np.array([[1,2,3,4,5],[6,7,8,9,10]])
z

pd.DataFrame(z) #note the difference to a numpy array z above

my_list = [['a', 'b', 'c'], [10,5,2.5], [3,2,1]]
print(my_list)
df = pd.DataFrame(my_list)
df #note the output here

[['a', 'b', 'c'], [10, 5, 2.5], [3, 2, 1]]

df.shape

(3, 3)

Provide a dictionary

dictionary = {'a':[10,3], 'b':[5,2], 'c':[2.5,1]}
df = pd.DataFrame(dictionary)
df #note the difference with the prior dataframe you made

df.shape #note the new shape

(2, 3)

df.columns #this is how to get a list of the column headers

Index(['a', 'b', 'c'], dtype='object')

dictionary = {'a':{'row1':3, 'row2':2}, 'b':{'row1':5,'row2':2}, 'c':{'row1':2.5,'row2':1}}
df = pd.DataFrame(dictionary)
df #note the new index!

df.index #this is how you get a list of the row labels

Index(['row1', 'row2'], dtype='object')

dictionary2 = {'a':{'row1':3, 'row2':2}, 'b':{'row3':5,'row4':2}, 'c':{'row5':2.5,'row6':1}}
df2 = pd.DataFrame(dictionary2)
df2

Read a CSV file

#pandas has some great methods to read .csv files
pd.read_csv?

ms=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec.csv")
ms

Read an Excel file

#pandas has read_excel method to read excel files
pd.read_excel?

excelf=pd.read_excel("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\excelfile.csv")
excelf

Inspecting DataFrames

Pandas provides some simple methods to look at your dataframes:

[your_dataframe_name].head(5) will provide the first 5 rows
[your_dataframe_name].tail(10) will provide the last 10 rows
[your_dataframe_name].describe() is a quick way to get summary statistics on a per-column basis

You can find more useful pandas functions []

(216, 183)

Slicing DataFrames

You can access subsets of your dataframe (views) in a few different ways, but we will focus on two here.

Name-based indexing

You provide a row_index and a column_index - they can be slices or lists or whatever to the .loc[row_names, col_names] indexer
example: [your_dataframe_name].loc[my_row_names, my col_names].

Index-based indexing

You provide the row and column numbers to the .iloc[row_numbers, col_numbers] 
example: [your_dataframe_name].iloc[my_row_numbers, my col_numbers]

ms.loc[:,'Protein Name'] #get all row (:), 'Protein Name' column

#How would you get the first 10 rows using .loc (note that here the row "names" are just numbers

ms.loc[:9, 'Protein Name']

ms.loc[0:10,['Protein Name', 'Protein Gene']] #what will this return?

# Note that you can pass any list of column names to the column indexer
ms.loc[:8,[col for col in ms.columns if "Protein" in col]] #what is this doing?

Side topic: get familiar with [List Comprehension]

my_list =[ ]
for col in ms.columns:
    if "Protein" in col:
        my_list.append(col)

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

my_list = [col for col in ms.columns if "Protein" in col]

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

ms.loc[:5,my_list] #what is this doing?

list(ms.columns) #this provides the full list of the columns in the dataframe

# write a line to access all columns related to sample BT2_HFX_6
ms.loc[:,[col for col in ms.columns if "BT2_HFX_6" in col]]

# Now let's try indexing with .iloc
ms.iloc[:5,3:9] #note the difference in how iloc and loc work!>

ms.iloc[:20,'Precursor Charge'] #Will this work?

ms.iloc[:20,4]

Selecting from DataFrames

You can "search/select" data by generating "boolean" arrays based on some criteria. This works by effectively generating a column of True/False values that Pandas uses to select particular rows (those that are true). There are a few ways to generate this true/false selection column.

Value-based selections

You provide a selection criteria for a particular column. Example:

# generates the true/false array
my_dataframe['my_column']>=some_value

Is-in based selections

You provide a list of values you want to search for. Example:

subset_of_rows = my_dataframe['column_name'].isin([list_of_values])

Other

There are lots of ways to do this - you can learn more here

Boolean Indexing

ms['Precursor Charge']==3

This is boolean indexing - you can make very complicated selection criteria to just pull out the data you want

selection_criteria = ms['Precursor Charge']==3 #now we have saved the selection criteria

selection_criteria

ms[selection_criteria] #note that only the "True" rows are selected

ms[ms['Precursor Charge']==3]

# Try to select all of the rows with "light Precursor Mz" greater than 800, and do it in one line.
ms[ms['light Precursor Mz']>800]

ms[ms['Peptide Modified Sequence'].str.contains('Q')][['Protein Preferred Name', 'Peptide Modified Sequence']]

ms[ms['Peptide Modified Sequence'].str.contains('SV')]

# Edit the above to only get peptides with the motif 'SV' and only output interested columns
ms[ms['Peptide Modified Sequence'].str.contains('SV')][['Protein Preferred Name', 'Peptide Modified Sequence']]

# now let's try using "isin"
ms[ms['Protein Preferred Name'].isin(['RL27_ECOLI'])]

Editing DataFrames

You can trivially add new columns or change the values in existing column.

Add a new column

Be sure that the column you are adding has the same indices as the old dataframe.
This is most easily accomplished by manipulating an old column and saving the value
Example: dataframe['new_column_name'] = ms['old_column']*value
Example: dataframe['new_column_name'] = ms['old_column1']*ms['old_column2']

Alter a column

You select a set of cells using the tools from above change their values
Note that dataframes are MUTABLE!
dataframe.loc[selection,selection] = 7

Dealing with missing data

The dataframe.dropna() and .fillna() funtions are super helpful in removing/replacing missing values
Example: only_complete_rows = dataframe.dropna(how='any')
Example: replace_with_0 = dataframe.fillna(value=0.0)

Selecting from DataFrames

Value-based selections

You provide a selection criteria for a particular column. Example:

# generates the true/false array
my_dataframe['my_column']>=some_value

Is-in based selections

You provide a list of values you want to search for. Example:

subset_of_rows = my_dataframe['column_name'].isin([list_of_values])

Other

There are lots of ways to do this - you can learn more here

Boolean Indexing

ms['Precursor Charge']==3

This is boolean indexing - you can make very complicated selection criteria to just pull out the data you want

selection_criteria = ms['Precursor Charge']==3 #now we have saved the selection criteria

selection_criteria

ms[selection_criteria] #note that only the "True" rows are selected

ms[ms['Precursor Charge']==3]

# Try to select all of the rows with "light Precursor Mz" greater than 800, and do it in one line.
ms[ms['light Precursor Mz']>800]

ms[ms['Peptide Modified Sequence'].str.contains('Q')][['Protein Preferred Name', 'Peptide Modified Sequence']]

ms[ms['Peptide Modified Sequence'].str.contains('SV')]

# Edit the above to only get peptides with the motif 'SV' and only output interested columns
ms[ms['Peptide Modified Sequence'].str.contains('SV')][['Protein Preferred Name', 'Peptide Modified Sequence']]

# now let's try using "isin"
ms[ms['Protein Preferred Name'].isin(['RL27_ECOLI'])]

Making DataFrames

Provide a list or numpy array

The column and row labels will simply use the numerical index

import pandas as pd
import numpy as np

z = np.array([[1,2,3,4,5],[6,7,8,9,10]])
z

pd.DataFrame(z) #note the difference to a numpy array z above

my_list = [['a', 'b', 'c'], [10,5,2.5], [3,2,1]]
print(my_list)
df = pd.DataFrame(my_list)
df #note the output here

[['a', 'b', 'c'], [10, 5, 2.5], [3, 2, 1]]

df.shape

(3, 3)

Provide a dictionary

dictionary = {'a':[10,3], 'b':[5,2], 'c':[2.5,1]}
df = pd.DataFrame(dictionary)
df #note the difference with the prior dataframe you made

df.shape #note the new shape

(2, 3)

df.columns #this is how to get a list of the column headers

Index(['a', 'b', 'c'], dtype='object')

dictionary = {'a':{'row1':3, 'row2':2}, 'b':{'row1':5,'row2':2}, 'c':{'row1':2.5,'row2':1}}
df = pd.DataFrame(dictionary)
df #note the new index!

df.index #this is how you get a list of the row labels

Index(['row1', 'row2'], dtype='object')

dictionary2 = {'a':{'row1':3, 'row2':2}, 'b':{'row3':5,'row4':2}, 'c':{'row5':2.5,'row6':1}}
df2 = pd.DataFrame(dictionary2)
df2

Read a CSV file

#pandas has some great methods to read .csv files
pd.read_csv?

ms=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec.csv")
ms

Read an Excel file

#pandas has read_excel method to read excel files
pd.read_excel?

excelf=pd.read_excel("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\excelfile.csv")
excelf

Slicing DataFrames

You can access subsets of your dataframe (views) in a few different ways, but we will focus on two here.

Name-based indexing

You provide a row_index and a column_index - they can be slices or lists or whatever to the .loc[row_names, col_names] indexer
example: [your_dataframe_name].loc[my_row_names, my col_names].

Index-based indexing

You provide the row and column numbers to the .iloc[row_numbers, col_numbers] 
example: [your_dataframe_name].iloc[my_row_numbers, my col_numbers]

ms.loc[:,'Protein Name'] #get all row (:), 'Protein Name' column

#How would you get the first 10 rows using .loc (note that here the row "names" are just numbers

ms.loc[:9, 'Protein Name']

ms.loc[0:10,['Protein Name', 'Protein Gene']] #what will this return?

# Note that you can pass any list of column names to the column indexer
ms.loc[:8,[col for col in ms.columns if "Protein" in col]] #what is this doing?

Side topic: get familiar with [List Comprehension]

my_list =[ ]
for col in ms.columns:
    if "Protein" in col:
        my_list.append(col)

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

my_list = [col for col in ms.columns if "Protein" in col]

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

ms.loc[:5,my_list] #what is this doing?

list(ms.columns) #this provides the full list of the columns in the dataframe

# write a line to access all columns related to sample BT2_HFX_6
ms.loc[:,[col for col in ms.columns if "BT2_HFX_6" in col]]

# Now let's try indexing with .iloc
ms.iloc[:5,3:9] #note the difference in how iloc and loc work!>

ms.iloc[:20,'Precursor Charge'] #Will this work?

ms.iloc[:20,4]