Pandas is a great tool for holding/manipulating/plotting labeled datasets. It is similar to Numpy, but differs in two important ways:
The rows and columns can be indexed by 'labels' (strings) OR numbers, unlike numpy arrays which always use number indices
The contents can be different types - numpy requires each element is the same datatype
Why pandas?
It is a great way to:
Read data from a file into a structured easy-to-access format
Clean data - deal with missing values, etc.
Select data of interest
Analyze/plot
Output results
Pandas provides two basic data structures - Series (1-D) and DataFrames (2-D). You can think of DataFrames as a grouping of Series, each with a column label. What's nice is that you can 'slice' these Dataframes either across rows or down columns using the row and column labels, much like you would with dictionaries.
You can learn more about pandas [here]
You can "search/select" data by generating "boolean" arrays based on some criteria. This works by effectively generating a column of True/False values that Pandas uses to select particular rows (those that are true). There are a few ways to generate this true/false selection column.
You provide a selection criteria for a particular column. Example:
# generates the true/false array
my_dataframe['my_column']>=some_value
You provide a list of values you want to search for. Example:
subset_of_rows = my_dataframe['column_name'].isin([list_of_values])
There are lots of ways to do this - you can learn more here
ms['Precursor Charge']==3
This is boolean indexing - you can make very complicated selection criteria to just pull out the data you want
selection_criteria = ms['Precursor Charge']==3 #now we have saved the selection criteria
selection_criteria
ms[selection_criteria] #note that only the "True" rows are selected
ms[ms['Precursor Charge']==3]
# Try to select all of the rows with "light Precursor Mz" greater than 800, and do it in one line.
ms[ms['light Precursor Mz']>800]
ms[ms['Peptide Modified Sequence'].str.contains('Q')][['Protein Preferred Name', 'Peptide Modified Sequence']]
ms[ms['Peptide Modified Sequence'].str.contains('SV')]
# Edit the above to only get peptides with the motif 'SV' and only output interested columns
ms[ms['Peptide Modified Sequence'].str.contains('SV')][['Protein Preferred Name', 'Peptide Modified Sequence']]
# now let's try using "isin"
ms[ms['Protein Preferred Name'].isin(['RL27_ECOLI'])]
Pandas provides some simple methods to look at your dataframes:
[your_dataframe_name].head(5)
will provide the first 5 rows
[your_dataframe_name].tail(10)
will provide the last 10 rows
[your_dataframe_name].describe()
is a quick way to get summary statistics on a per-column basis
You can find more useful pandas functions []
(216, 183)
You can trivially add new columns or change the values in existing column.
Be sure that the column you are adding has the same indices as the old dataframe.
This is most easily accomplished by manipulating an old column and saving the value
Example: dataframe['new_column_name'] = ms['old_column']*value
Example: dataframe['new_column_name'] = ms['old_column1']*ms['old_column2']
You select a set of cells using the tools from above change their values
Note that dataframes are MUTABLE!
dataframe.loc[selection,selection] = 7
The dataframe.dropna()
and .fillna()
funtions are super helpful in removing/replacing missing values
Example: only_complete_rows = dataframe.dropna(how='any')
Example: replace_with_0 = dataframe.fillna(value=0.0)
ms.head() #this will give the first 5 rows by default. You can add any number in the () to get that number of rows
ms.tail(10) #and the last 10 rows
ms.describe() #this is a quick way to get summary statistics on a per-column basis
#What do you notice about the number of columns returned by describe vs that in the entire dataframe...
ms.shape
ms.columns
missing = []
des_cols = ms.describe().columns
for col in ms.columns:
if col in des_cols:
print('found: '+ col)
else:
missing.append(col)
missing
pd.set_option('display.max_rows', 50) #This will set the number of rows you can "see" in the jupyter notebook when you inspect a dataframe
pd.set_option('display.max_columns', 200) #This will set the number of columns you can "see" in the jupyter notebook when you inspect a dataframe
ms.describe() #notice the difference in the number of columns you can see
# what is this line doing?
ms['light +1 charge mass']=ms['light Precursor Mz']*ms['Precursor Charge'] - ((ms['Precursor Charge']-1)*1.0078)
ms[['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]
#think through what this line is doing
ms.loc[ms['light +1 charge mass']>2000,['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]
ms.loc[ms['light +1 charge mass']>2000,'light +1 charge mass'] = 'way too big!'
#you can quickly save your work at a .csv using the command .to_csv(path_to_file)
ms.to_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec_new.csv")
#look in your directory for a new .csv file!
The column and row labels will simply use the numerical index
import pandas as pd
import numpy as np
z = np.array([[1,2,3,4,5],[6,7,8,9,10]])
z
pd.DataFrame(z) #note the difference to a numpy array z above
my_list = [['a', 'b', 'c'], [10,5,2.5], [3,2,1]]
print(my_list)
df = pd.DataFrame(my_list)
df #note the output here
[['a', 'b', 'c'], [10, 5, 2.5], [3, 2, 1]]
df.shape
(3, 3)
dictionary = {'a':[10,3], 'b':[5,2], 'c':[2.5,1]}
df = pd.DataFrame(dictionary)
df #note the difference with the prior dataframe you made
df.shape #note the new shape
(2, 3)
df.columns #this is how to get a list of the column headers
Index(['a', 'b', 'c'], dtype='object')
dictionary = {'a':{'row1':3, 'row2':2}, 'b':{'row1':5,'row2':2}, 'c':{'row1':2.5,'row2':1}}
df = pd.DataFrame(dictionary)
df #note the new index!
df.index #this is how you get a list of the row labels
Index(['row1', 'row2'], dtype='object')
dictionary2 = {'a':{'row1':3, 'row2':2}, 'b':{'row3':5,'row4':2}, 'c':{'row5':2.5,'row6':1}}
df2 = pd.DataFrame(dictionary2)
df2
#pandas has some great methods to read .csv files
pd.read_csv?
ms=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec.csv")
ms
#pandas has read_excel method to read excel files
pd.read_excel?
excelf=pd.read_excel("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\excelfile.csv")
excelf
You can access subsets of your dataframe (views) in a few different ways, but we will focus on two here.
Name-based indexing
You provide a row_index and a column_index - they can be slices or lists or whatever to the .loc[row_names, col_names] indexer
example: [your_dataframe_name].loc[my_row_names, my col_names].
Index-based indexing
You provide the row and column numbers to the .iloc[row_numbers, col_numbers]
example: [your_dataframe_name].iloc[my_row_numbers, my col_numbers]
ms.loc[:,'Protein Name'] #get all row (:), 'Protein Name' column
#How would you get the first 10 rows using .loc (note that here the row "names" are just numbers
ms.loc[:9, 'Protein Name']
ms.loc[0:10,['Protein Name', 'Protein Gene']] #what will this return?
# Note that you can pass any list of column names to the column indexer
ms.loc[:8,[col for col in ms.columns if "Protein" in col]] #what is this doing?
Side topic: get familiar with [List Comprehension]
my_list =[ ]
for col in ms.columns:
if "Protein" in col:
my_list.append(col)
my_list
['Protein Name', 'Protein Preferred Name', 'Protein Gene']
my_list = [col for col in ms.columns if "Protein" in col]
my_list
['Protein Name', 'Protein Preferred Name', 'Protein Gene']
ms.loc[:5,my_list] #what is this doing?
list(ms.columns) #this provides the full list of the columns in the dataframe
# write a line to access all columns related to sample BT2_HFX_6
ms.loc[:,[col for col in ms.columns if "BT2_HFX_6" in col]]
# Now let's try indexing with .iloc
ms.iloc[:5,3:9] #note the difference in how iloc and loc work!>
ms.iloc[:20,'Precursor Charge'] #Will this work?
ms.iloc[:20,4]