All pages
Powered by GitBook
1 of 7

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Pandas

About Pandas

Pandas is a great tool for holding/manipulating/plotting labeled datasets. It is similar to Numpy, but differs in two important ways:

  1. The rows and columns can be indexed by 'labels' (strings) OR numbers, unlike numpy arrays which always use number indices

  2. The contents can be different types - numpy requires each element is the same datatype

Why pandas?

It is a great way to:

  1. Read data from a file into a structured easy-to-access format

  2. Clean data - deal with missing values, etc.

  3. Select data of interest

  4. Analyze/plot

  5. Output results

Pandas provides two basic data structures - Series (1-D) and DataFrames (2-D). You can think of DataFrames as a grouping of Series, each with a column label. What's nice is that you can 'slice' these Dataframes either across rows or down columns using the row and column labels, much like you would with dictionaries.

You can learn more about pandas [here]

Selecting from DataFrames

You can "search/select" data by generating "boolean" arrays based on some criteria. This works by effectively generating a column of True/False values that Pandas uses to select particular rows (those that are true). There are a few ways to generate this true/false selection column.

Value-based selections

You provide a selection criteria for a particular column. Example:

# generates the true/false array
my_dataframe['my_column']>=some_value

Is-in based selections

You provide a list of values you want to search for. Example:

subset_of_rows = my_dataframe['column_name'].isin([list_of_values])

Other

There are lots of ways to do this - you can learn more here

Boolean Indexing

ms['Precursor Charge']==3

This is boolean indexing - you can make very complicated selection criteria to just pull out the data you want

selection_criteria = ms['Precursor Charge']==3 #now we have saved the selection criteria
selection_criteria
ms[selection_criteria] #note that only the "True" rows are selected
ms[ms['Precursor Charge']==3]
# Try to select all of the rows with "light Precursor Mz" greater than 800, and do it in one line.
ms[ms['light Precursor Mz']>800]
ms[ms['Peptide Modified Sequence'].str.contains('Q')][['Protein Preferred Name', 'Peptide Modified Sequence']]
ms[ms['Peptide Modified Sequence'].str.contains('SV')]
# Edit the above to only get peptides with the motif 'SV' and only output interested columns
ms[ms['Peptide Modified Sequence'].str.contains('SV')][['Protein Preferred Name', 'Peptide Modified Sequence']]
# now let's try using "isin"
ms[ms['Protein Preferred Name'].isin(['RL27_ECOLI'])]

Inspecting DataFrames

Pandas provides some simple methods to look at your dataframes:

  • [your_dataframe_name].head(5) will provide the first 5 rows

  • [your_dataframe_name].tail(10) will provide the last 10 rows

  • [your_dataframe_name].describe() is a quick way to get summary statistics on a per-column basis

You can find more useful pandas functions []

(216, 183)

Editing DataFrames

You can trivially add new columns or change the values in existing column.

Add a new column

  1. Be sure that the column you are adding has the same indices as the old dataframe.

  2. This is most easily accomplished by manipulating an old column and saving the value

  3. Example: dataframe['new_column_name'] = ms['old_column']*value

  4. Example: dataframe['new_column_name'] = ms['old_column1']*ms['old_column2']

Alter a column

  1. You select a set of cells using the tools from above change their values

  2. Note that dataframes are MUTABLE!

  3. dataframe.loc[selection,selection] = 7

Dealing with missing data

  1. The dataframe.dropna() and .fillna() funtions are super helpful in removing/replacing missing values

  2. Example: only_complete_rows = dataframe.dropna(how='any')

  3. Example: replace_with_0 = dataframe.fillna(value=0.0)

ms.head() #this will give the first 5 rows by default. You can add any number in the () to get that number of rows
ms.tail(10) #and the last 10 rows
ms.describe() #this is a quick way to get summary statistics on a per-column basis
#What do you notice about the number of columns returned by describe vs that in the entire dataframe...
ms.shape
ms.columns
missing = []
des_cols = ms.describe().columns
for col in ms.columns:
    if col in des_cols:
        print('found: '+ col)
    else:
        missing.append(col)
missing
pd.set_option('display.max_rows', 50) #This will set the number of rows you can "see" in the jupyter notebook when you inspect a dataframe
pd.set_option('display.max_columns', 200) #This will set the number of columns you can "see" in the jupyter notebook when you inspect a dataframe
ms.describe() #notice the difference in the number of columns you can see
here
# what is this line doing?
ms['light +1 charge mass']=ms['light Precursor Mz']*ms['Precursor Charge'] - ((ms['Precursor Charge']-1)*1.0078)
ms[['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]
#think through what this line is doing
ms.loc[ms['light +1 charge mass']>2000,['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]
ms.loc[ms['light +1 charge mass']>2000,'light +1 charge mass'] = 'way too big!'
#you can quickly save your work at a .csv using the command .to_csv(path_to_file)
ms.to_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec_new.csv")
#look in your directory for a new .csv file!

Making DataFrames

Provide a list or numpy array

The column and row labels will simply use the numerical index

import pandas as pd
import numpy as np
z = np.array([[1,2,3,4,5],[6,7,8,9,10]])
z
pd.DataFrame(z) #note the difference to a numpy array z above
my_list = [['a', 'b', 'c'], [10,5,2.5], [3,2,1]]
print(my_list)
df = pd.DataFrame(my_list)
df #note the output here

[['a', 'b', 'c'], [10, 5, 2.5], [3, 2, 1]]

df.shape

(3, 3)

Provide a dictionary

dictionary = {'a':[10,3], 'b':[5,2], 'c':[2.5,1]}
df = pd.DataFrame(dictionary)
df #note the difference with the prior dataframe you made
df.shape #note the new shape

(2, 3)

df.columns #this is how to get a list of the column headers

Index(['a', 'b', 'c'], dtype='object')

dictionary = {'a':{'row1':3, 'row2':2}, 'b':{'row1':5,'row2':2}, 'c':{'row1':2.5,'row2':1}}
df = pd.DataFrame(dictionary)
df #note the new index!
df.index #this is how you get a list of the row labels

Index(['row1', 'row2'], dtype='object')

dictionary2 = {'a':{'row1':3, 'row2':2}, 'b':{'row3':5,'row4':2}, 'c':{'row5':2.5,'row6':1}}
df2 = pd.DataFrame(dictionary2)
df2

Read a CSV file

#pandas has some great methods to read .csv files
pd.read_csv?
ms=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec.csv")
ms

Read an Excel file

#pandas has read_excel method to read excel files
pd.read_excel?
excelf=pd.read_excel("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\excelfile.csv")
excelf

Slicing DataFrames

You can access subsets of your dataframe (views) in a few different ways, but we will focus on two here.

Name-based indexing

You provide a row_index and a column_index - they can be slices or lists or whatever to the .loc[row_names, col_names] indexer
example: [your_dataframe_name].loc[my_row_names, my col_names].

Index-based indexing

You provide the row and column numbers to the .iloc[row_numbers, col_numbers] 
example: [your_dataframe_name].iloc[my_row_numbers, my col_numbers]
ms.loc[:,'Protein Name'] #get all row (:), 'Protein Name' column
#How would you get the first 10 rows using .loc (note that here the row "names" are just numbers
ms.loc[:9, 'Protein Name']
ms.loc[0:10,['Protein Name', 'Protein Gene']] #what will this return?
# Note that you can pass any list of column names to the column indexer
ms.loc[:8,[col for col in ms.columns if "Protein" in col]] #what is this doing?

Side topic: get familiar with [List Comprehension]

my_list =[ ]
for col in ms.columns:
    if "Protein" in col:
        my_list.append(col)
my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

my_list = [col for col in ms.columns if "Protein" in col]
my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

ms.loc[:5,my_list] #what is this doing?
list(ms.columns) #this provides the full list of the columns in the dataframe
# write a line to access all columns related to sample BT2_HFX_6
ms.loc[:,[col for col in ms.columns if "BT2_HFX_6" in col]]
# Now let's try indexing with .iloc
ms.iloc[:5,3:9] #note the difference in how iloc and loc work!>
ms.iloc[:20,'Precursor Charge'] #Will this work?
ms.iloc[:20,4]