1 of 18

Data Processing with Python

Acknowledgement

The backbone of the teaching material is developed by professor Joey Davis in Department of Biology at MIT. Many sections were directly adopted from Professor Davis's teaching material. Additional material were added from user guide and tutorial of Pandas, Matplotlib, and Seaborn which were clarified in the corresponding sections of the teaching material.

Pandas

About Pandas

Pandas is a great tool for holding/manipulating/plotting labeled datasets. It is similar to Numpy, but differs in two important ways:

The rows and columns can be indexed by 'labels' (strings) OR numbers, unlike numpy arrays which always use number indices
The contents can be different types - numpy requires each element is the same datatype

Why pandas?

It is a great way to:

Read data from a file into a structured easy-to-access format
Clean data - deal with missing values, etc.
Select data of interest
Analyze/plot
Output results

Pandas provides two basic data structures - Series (1-D) and DataFrames (2-D). You can think of DataFrames as a grouping of Series, each with a column label. What's nice is that you can 'slice' these Dataframes either across rows or down columns using the row and column labels, much like you would with dictionaries.

You can learn more about pandas [here]

Making DataFrames

Provide a list or numpy array

The column and row labels will simply use the numerical index

import pandas as pd
import numpy as np

z = np.array([[1,2,3,4,5],[6,7,8,9,10]])
z

pd.DataFrame(z) #note the difference to a numpy array z above

my_list = [['a', 'b', 'c'], [10,5,2.5], [3,2,1]]
print(my_list)
df = pd.DataFrame(my_list)
df #note the output here

[['a', 'b', 'c'], [10, 5, 2.5], [3, 2, 1]]

df.shape

(3, 3)

Provide a dictionary

dictionary = {'a':[10,3], 'b':[5,2], 'c':[2.5,1]}
df = pd.DataFrame(dictionary)
df #note the difference with the prior dataframe you made

df.shape #note the new shape

(2, 3)

df.columns #this is how to get a list of the column headers

Index(['a', 'b', 'c'], dtype='object')

dictionary = {'a':{'row1':3, 'row2':2}, 'b':{'row1':5,'row2':2}, 'c':{'row1':2.5,'row2':1}}
df = pd.DataFrame(dictionary)
df #note the new index!

df.index #this is how you get a list of the row labels

Index(['row1', 'row2'], dtype='object')

dictionary2 = {'a':{'row1':3, 'row2':2}, 'b':{'row3':5,'row4':2}, 'c':{'row5':2.5,'row6':1}}
df2 = pd.DataFrame(dictionary2)
df2

Read a CSV file

#pandas has some great methods to read .csv files
pd.read_csv?

ms=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec.csv")
ms

Read an Excel file

#pandas has read_excel method to read excel files
pd.read_excel?

excelf=pd.read_excel("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\excelfile.csv")
excelf

Inspecting DataFrames

Pandas provides some simple methods to look at your dataframes:

[your_dataframe_name].head(5) will provide the first 5 rows
[your_dataframe_name].tail(10) will provide the last 10 rows
[your_dataframe_name].describe() is a quick way to get summary statistics on a per-column basis

You can find more useful pandas functions [here]

ms.head() #this will give the first 5 rows by default. You can add any number in the () to get that number of rows

ms.tail(10) #and the last 10 rows

ms.describe() #this is a quick way to get summary statistics on a per-column basis

#What do you notice about the number of columns returned by describe vs that in the entire dataframe...
ms.shape

(216, 183)

ms.columns

missing = []
des_cols = ms.describe().columns
for col in ms.columns:
    if col in des_cols:
        print('found: '+ col)
    else:
        missing.append(col)

missing

pd.set_option('display.max_rows', 50) #This will set the number of rows you can "see" in the jupyter notebook when you inspect a dataframe
pd.set_option('display.max_columns', 200) #This will set the number of columns you can "see" in the jupyter notebook when you inspect a dataframe

ms.describe() #notice the difference in the number of columns you can see

Slicing DataFrames

You can access subsets of your dataframe (views) in a few different ways, but we will focus on two here.

Name-based indexing

You provide a row_index and a column_index - they can be slices or lists or whatever to the .loc[row_names, col_names] indexer
example: [your_dataframe_name].loc[my_row_names, my col_names].

Index-based indexing

You provide the row and column numbers to the .iloc[row_numbers, col_numbers] 
example: [your_dataframe_name].iloc[my_row_numbers, my col_numbers]

ms.loc[:,'Protein Name'] #get all row (:), 'Protein Name' column

#How would you get the first 10 rows using .loc (note that here the row "names" are just numbers

ms.loc[:9, 'Protein Name']

ms.loc[0:10,['Protein Name', 'Protein Gene']] #what will this return?

# Note that you can pass any list of column names to the column indexer
ms.loc[:8,[col for col in ms.columns if "Protein" in col]] #what is this doing?

Side topic: get familiar with [List Comprehension]

my_list =[ ]
for col in ms.columns:
    if "Protein" in col:
        my_list.append(col)

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

my_list = [col for col in ms.columns if "Protein" in col]

my_list

['Protein Name', 'Protein Preferred Name', 'Protein Gene']

ms.loc[:5,my_list] #what is this doing?

list(ms.columns) #this provides the full list of the columns in the dataframe

# write a line to access all columns related to sample BT2_HFX_6
ms.loc[:,[col for col in ms.columns if "BT2_HFX_6" in col]]

# Now let's try indexing with .iloc
ms.iloc[:5,3:9] #note the difference in how iloc and loc work!>

ms.iloc[:20,'Precursor Charge'] #Will this work?

ms.iloc[:20,4]

Selecting from DataFrames

You can "search/select" data by generating "boolean" arrays based on some criteria. This works by effectively generating a column of True/False values that Pandas uses to select particular rows (those that are true). There are a few ways to generate this true/false selection column.

Value-based selections

You provide a selection criteria for a particular column. Example:

# generates the true/false array
my_dataframe['my_column']>=some_value

Is-in based selections

You provide a list of values you want to search for. Example:

subset_of_rows = my_dataframe['column_name'].isin([list_of_values])

Other

There are lots of ways to do this - you can learn more here

Boolean Indexing

ms['Precursor Charge']==3

This is boolean indexing - you can make very complicated selection criteria to just pull out the data you want

selection_criteria = ms['Precursor Charge']==3 #now we have saved the selection criteria

selection_criteria

ms[selection_criteria] #note that only the "True" rows are selected

ms[ms['Precursor Charge']==3]

# Try to select all of the rows with "light Precursor Mz" greater than 800, and do it in one line.
ms[ms['light Precursor Mz']>800]

ms[ms['Peptide Modified Sequence'].str.contains('Q')][['Protein Preferred Name', 'Peptide Modified Sequence']]

ms[ms['Peptide Modified Sequence'].str.contains('SV')]

# Edit the above to only get peptides with the motif 'SV' and only output interested columns
ms[ms['Peptide Modified Sequence'].str.contains('SV')][['Protein Preferred Name', 'Peptide Modified Sequence']]

# now let's try using "isin"
ms[ms['Protein Preferred Name'].isin(['RL27_ECOLI'])]

Editing DataFrames

You can trivially add new columns or change the values in existing column.

Add a new column

Be sure that the column you are adding has the same indices as the old dataframe.
This is most easily accomplished by manipulating an old column and saving the value
Example: dataframe['new_column_name'] = ms['old_column']*value
Example: dataframe['new_column_name'] = ms['old_column1']*ms['old_column2']

Alter a column

You select a set of cells using the tools from above change their values
Note that dataframes are MUTABLE!
dataframe.loc[selection,selection] = 7

Dealing with missing data

The dataframe.dropna() and .fillna() funtions are super helpful in removing/replacing missing values
Example: only_complete_rows = dataframe.dropna(how='any')
Example: replace_with_0 = dataframe.fillna(value=0.0)

# what is this line doing?
ms['light +1 charge mass']=ms['light Precursor Mz']*ms['Precursor Charge'] - ((ms['Precursor Charge']-1)*1.0078)

ms[['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]

#think through what this line is doing
ms.loc[ms['light +1 charge mass']>2000,['Peptide Modified Sequence', 'light Precursor Mz', 'Precursor Charge', 'light +1 charge mass']]

ms.loc[ms['light +1 charge mass']>2000,'light +1 charge mass'] = 'way too big!'

#you can quickly save your work at a .csv using the command .to_csv(path_to_file)
ms.to_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\mass_spec_new.csv")
#look in your directory for a new .csv file!

Matplotlib

About Matplotlib

Matplotlib is the primary plotting library in Python. It makes easy things easy, and hard things possible. You can provide it lists or numpy arrays and it can generate virtually any plot you'd like.

Please refer to the Matplotlib introduction page .

Basic Plotting

Getting Ready

Import packages

Read in data

Line Plot

For list of named colors see

Change plotting style and add legend

Bar Plot

add a horizontal line

plot horizontal bars

Histogram Plot

Box Plot

Define a title

Boxplot by celltype

Better layout by excluding the automatic title

More sophisticated plotting can better reveal the trend

Advanced Plotting

Getting Ready

dat2=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\meanByClass.txt", sep='\s+')

dat2

Explore a fake gene expression data modified from iris.csv

rpkm=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\fakeExpressionDat.csv")

rpkm

Advanced Line Plot

plt.figure(); dat2.plot(); plt.legend(loc='best')

Get rid of the legend

dat2.plot(legend=False)

Separate the features

dat2.plot(subplots=True, figsize=(6, 6)); plt.legend(loc='best')

Plotting on a Secondary Y-axis

plt.figure()
dat2.WtTypeA.plot(color="b")
dat2.WtTypeB.plot(color="turquoise")
dat2.KOTypeA.plot(color="r")
dat2.KOTypeB.plot(color="pink")
dat2.replicate.plot(secondary_y=True, style='g')

Plot a subset of columns

plt.figure()
dat2.WtTypeA.plot(color="b")
dat2.WtTypeB.plot(color="turquoise")
dat2.KOTypeA.plot(color="r")
dat2.KOTypeB.plot(color="pink")

Selective Plotting on Secondary Y-axis

plt.figure()
dat3=dat2.drop(['replicate'], axis = 1)
ax = dat3.plot(secondary_y=['wtTypeA', 'KOTypeA'])
ax.set_ylabel('TypeB scale')
ax.right_ax.set_ylabel('TypeA scale')

Targeting different subplots by passing an ax argument

fig, axes = plt.subplots(nrows=2, ncols=2)
dat2['WtTypeA'].plot(ax=axes[0,0]); axes[0,0].set_title('WtTypeA')
dat2['KOTypeA'].plot(ax=axes[0,1]); axes[0,1].set_title('KOTypeA')
dat2['WtTypeB'].plot(ax=axes[1,0]); axes[1,0].set_title('WtTypeB')
dat2['KOTypeB'].plot(ax=axes[1,1]); axes[1,1].set_title('KOTypeB')

Adjusting spacing between subplots

fig, axes = plt.subplots(nrows=2, ncols=2)
dat2['WtTypeA'].plot(ax=axes[0,0]); axes[0,0].set_title('WtTypeA')
dat2['KOTypeA'].plot(ax=axes[0,1]); axes[0,1].set_title('KOTypeA')
dat2['WtTypeB'].plot(ax=axes[1,0]); axes[1,0].set_title('WtTypeB')
dat2['KOTypeB'].plot(ax=axes[1,1]); axes[1,1].set_title('KOTypeB')
plt.subplots_adjust(left=0.1,
                    bottom=0.1,
                    right=0.9,
                    top=0.9,
                    wspace=0.4,
                    hspace=0.4)

Advanced Bar Plots

Looking at one replicate a time

plt.figure();
dat2.iloc[1].plot(kind='bar'); plt.axhline(0, color='k')

Looking at all replicates at the same time

plt.figure();
dat2.plot(kind='bar'); plt.axhline(0, color='k')

plt.figure();
dat2.plot(kind='bar', colormap='Greens')

stacked boxes

dat3.plot(kind='bar', stacked=True);

Advanced Histogram

plt.figure()
dat.hist(by="genotype", figsize=(6, 4),bins=20)

Scatter Plot

from pandas.plotting import scatter_matrix
rpkm=pd.read_csv("C:\\Users\duan\Desktop\IntroductionToMatplotlib\\fakeExpressionDat.csv")
rpkm

scatter_matrix(rpkm, alpha=0.9, figsize=(6, 6), diagonal='kde')

Parallel Coordinates

Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='gist_rainbow')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='spring')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='autumn')

Andrews Curves

Andrews Curves are smoothed versions of Parallel Coordinates

from pandas.plotting import andrews_curves

plt.figure()
andrews_curves(rpkm, 'pathway')
plt.show()

A potential issue when plotting a large number of columns is that it can be difficult to distinguish some series due to repetition in the default colors. To remedy this, we can either loop through different colors using rainbow() function. Or DataFrame plotting supports the use of the colormap= argument, which accepts either a Matplotlib colormap or a string that is a name of a colormap registered with Matplotlib

plt.figure()
andrews_curves(rpkm, 'pathway',color = [cm.rainbow(i) for i in np.linspace(0, 1, 3)])
plt.show()

plt.figure()
andrews_curves(rpkm, 'pathway',colormap='jet')
plt.show()

plt.figure()
andrews_curves(rpkm, 'pathway',colormap="winter")
plt.show()

RadViz

from pandas.plotting import radviz
plt.figure()
radviz(rpkm, 'pathway')
plt.show()

from pandas.plotting import radviz
plt.figure()
radviz(rpkm, 'pathway',colormap="Set1")
plt.show()

Seaborn

About Seaborn

Seaborn is an extension of matplotlib and pandas. It makes a well-defined set of hard things easy too Seaborn works directly with Pandas dataframes as inputs if you have lists or numpy arrays, you should probably convert those to pandas dataframes to plot (or use matplotlib)

See Seaborn introduction

Seaborn has an unreal number of built-in plots. It's best to just explore , but we'll go through some quick examples

Basic Plotting

Getting Started

import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
sns.set_context('talk') #on your screen, replace 'talk' with 'notebook' or 'paper' or 'poster'
%pylab inline

Line Plot

Please see seaborn.lineplot() function [here]

Bar Plot

Please see seaborn.barplot() function [here]

Histogram Plot

Please see seaborn.histplot() function [here]

Box Plot

Please see seaborn.boxplot() function [here]

Visualizing Statistics

Visualizing Proteomics Data

Read in data

ms=pd.read_csv("C:\\Users\duan\Desktop\IntroductionToSeaborn\mass_spec_new.csv")

ms

a = sns.histplot(ms['light Precursor Mz'], bins=20) #simple, right?

a = sns.histplot(data=ms,x='light Precursor Mz', bins=20, kde="TRUE",color="lightseagreen",alpha=0.1)#make the plot more elegant

a = sns.histplot(data=ms,x='light +1 charge mass', bins=20, kde="TRUE",color="red",alpha=0.1)  #O no, what could be wrong?

v = ms['light +1 charge mass'].values #get all of the values
v.sort() #sort the values
v #print the values - what's wrong?

ms['light +1 charge mass']=ms['light Precursor Mz']*ms['Precursor Charge'] - ((ms['Precursor Charge']-1)*1.0078)

a = sns.histplot(data=ms,x='light +1 charge mass', bins=20, kde="TRUE",color="red",alpha=0.1)

#Let's see if +2 and +3 charged peptides exhibit a different distribution.
a = sns.histplot(data=ms.loc[ms['Precursor Charge'] == 3], x='light +1 charge mass', bins=10, label='+3',kde="TRUE",color="chocolate",alpha=0.2)
sns.histplot(data=ms.loc[ms['Precursor Charge'] == 2], x='light +1 charge mass', bins=10, ax=a, label='+2',kde="TRUE",color="orchid",alpha=0.1)
a.legend()

#Here, I made a new column that is the ratio of light-to-heavy intensity...the details aren't important. What is important is that we have a new column of values we can plot
ms['Total Area Ratio BT2_HFX_5'] = ms['light BT2_HFX_5 Total Area']/ms['15N BT2_HFX_5 Total Area']
ms['Total Area Ratio BT2_HFX_7'] = ms['light BT2_HFX_7 Total Area']/ms['15N BT2_HFX_7 Total Area']

sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
swarm = sns.swarmplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:100,:], size=6)
pylab.tight_layout()
pylab.savefig('swarmplot.png')

sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
box = sns.boxplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:100,:])
pylab.tight_layout()
pylab.savefig('boxplot.pdf') #this will save your figure!

sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
box = sns.violinplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:30,:])
#pylab.tight_layout()
pylab.savefig('boxplot.pdf') #this will save your figure!

Visualizing RNAseq Data

Getting Started

import pandas as pd
import numpy as np
import seaborn as sns
import glob
import matplotlib.pyplot as plt
sns.set_context('paper')
sns.set_style("whitegrid")

Volcano Plot

This is what we aim to reproduce basing on the file volcano_data.tsv. Let's read the volcano_data.tsv file into a pandas dataframe. glob.glob('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\*.tsv') # get a list of files in your directory ending in .tsv

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\KOvsWTdiffExp.tsv', sep='\t')

inspect the 'head' of the file

vol.head()

vol.shape

(4900, 6)

Notice that the qvalues in the file need to be log-transformed to match the figure Create a new column in the vol dataframe where 'log10_q' = -log_base_10(qval) for each gene...the numpy function np.log10 is helpful

vol['log10_q'] = -np.log10(vol['padj']) #calculate the -log2 of qvalues

output a summary of the data (use the .describe() function)

vol.describe()

We want to plot and color the genes that increase in knockout, decrease with knockout, and show no significant change (q-val > 0.05). Let's categorize our data

create a new column called 'data_category' with entires 'increases in Knockout', 'decreases in Knockout', and 'not significant' set these values appropriately for each gene hint - inspect the 'log2FoldChange' value or the 'padj' fields to determine each case

sns.scatterplot(data=vol, y='log10_q', x='log2FoldChange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases in Knockout':"red", 'decreases in Knockout':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

save a list of the significant genes and their qvalues (any gene_ids with qval<0.05), and output this list to a .csv file

sig_genes = vol.loc[vol['padj']<.05, ['geneID', 'padj', 'log2FoldChange']]
sig_genes.shape

(2582, 3)

sig_genes.head(8)

Save significant genes to a csv file

sig_genes.to_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\significant_hits.csv')

Heatmap

Next, we are going to reproduce a heatmap below. This clustered heat map aims to show groups of genes whose transcript levels are coordinated across age or mutant background

The plotting will be based on rpkm.tsv which contains ~600 significant genes we want to inspect.

read in the 'rpkm.tsv' file as a pandas dataframe, save it as dataset

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\rpkm.tsv', sep='\t')

briefly inspect the dataframe for shape, general entries, and summary statistics

dataset.shape

(600,7)

dataset.head()

dataset.describe()

People often plot using 'row median centered' each gene This means they divided each row by the median value across that entire row. They also log-transformed that result

Calculate the median value for each gene (row)

row_medians = dataset.median(axis=1,numeric_only=True)

Now create a copy of the dataset, and save it as dataset_row_norm Row median center and log2-transform each gene in dataset_row_norm

dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))

look at the summary statistics now Also, look at a few random gene rows and make sure the result in sensible Finally, look at the .head() to see how it is indexed

dataset_row_norm.describe()

dataset_row_norm.iloc[200,1:].median() #test some random rows, make sure the median value is 0

-0.04326176484815142

dataset_row_norm.head()

change the indexing to use the gene name instead of the row number

dataset_row_norm = dataset_row_norm.set_index('geneID')

Now let's plot the full dataset. Use seaborn clustermap to generate a heat map and cluster each row. Look at the seaborn clustermap documentation to figure out what arguments to pass

import fastcluster

sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

Another Example

sometimes you need additional work to make a nice heatmap. See the example below:

Read in the example data

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\TPM_reads_raw.tsv', sep='\t')

Examine the data

dataset.shape

(5823, 36)

dataset.head()

dataset.describe()

Prepare the data for heatmap plotting

row_medians = dataset.median(axis=1,numeric_only=True) #calculate the median of each row
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
dataset_row_norm = dataset_row_norm.set_index('gene')

Heatmap plotting

sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

It is necessary to 'zoom' in on the genes that showed significant changes. Sometimes people 'capped' their fold changes at -1.5 and +1.5, we'll do the same.

Start by copying our row-normalized dataframe to a new dataframe and setting all values greater than 1.5 to 1.5, and all less than -1.5 to -1.5 Save this as capped_row_norm_dataset

capped_row_norm_dataset = dataset_row_norm.copy()
capped_row_norm_dataset[capped_row_norm_dataset>1.5] = 1.5
capped_row_norm_dataset[capped_row_norm_dataset<-1.5] = -1.5

Look at the summary statistics to see if this worked

Look at the summary statistics to see if this worked

Read volcano plot file

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\volcano_data.tsv', sep='\t')

Inspect volcano plot file

vol.head()

vol.shape

(13547, 4)

Prepare for volcano plotting

vol['log10_q'] = -np.log10(vol['qval']) #calculate the -log2 of qvalues

vol.describe()

vol.loc[vol['log2foldchange']>0,'data_category'] = 'increases with age'
vol.loc[vol['log2foldchange']<0,'data_category'] = 'decreases with age'
vol.loc[vol['qval']>0.05,'data_category'] = 'not significant'

Volcano plotting

sns.scatterplot(data=vol, y='log10_q', x='log2foldchange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases with age':"red", 'decreases with age':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

Identify significant genes

sig_genes = vol.loc[vol['qval']<.05, ['gene_id', 'qval', 'log2foldchange']]

Now we need to just pull out the genes that show significant age-dependence Make a list of all the gene names that are in both dataset_row_norm and the list of significant genes (sig_genes) from above. Save this list as genes_to_cluster

#get the genenames that are the same between the datasets
genes_to_cluster = [gene for gene in capped_row_norm_dataset.index if gene in sig_genes.values]

use the .loc function to pull out just the rows of genes we want to cluster, and see how many genes that is

capped_row_norm_dataset.loc[genes_to_cluster, :].shape

(1731, 35)

Almost there - now make your heatmap!

sns.clustermap(capped_row_norm_dataset.loc[genes_to_cluster, :], row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(8,8))

Visualizing RNAseq Data

Getting Started

import pandas as pd
import numpy as np
import seaborn as sns
import glob
import matplotlib.pyplot as plt
sns.set_context('paper')
sns.set_style("whitegrid")

Volcano Plot

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\KOvsWTdiffExp.tsv', sep='\t')

inspect the 'head' of the file

vol.head()

vol.shape

(4900, 6)

vol['log10_q'] = -np.log10(vol['padj']) #calculate the -log2 of qvalues

output a summary of the data (use the .describe() function)

vol.describe()

We want to plot and color the genes that increase in knockout, decrease with knockout, and show no significant change (q-val > 0.05). Let's categorize our data

sns.scatterplot(data=vol, y='log10_q', x='log2FoldChange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases in Knockout':"red", 'decreases in Knockout':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

save a list of the significant genes and their qvalues (any gene_ids with qval<0.05), and output this list to a .csv file

sig_genes = vol.loc[vol['padj']<.05, ['geneID', 'padj', 'log2FoldChange']]
sig_genes.shape

(2582, 3)

sig_genes.head(8)

Save significant genes to a csv file

sig_genes.to_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\significant_hits.csv')

Heatmap

Next, we are going to reproduce a heatmap below. This clustered heat map aims to show groups of genes whose transcript levels are coordinated across age or mutant background

The plotting will be based on rpkm.tsv which contains ~600 significant genes we want to inspect.

read in the 'rpkm.tsv' file as a pandas dataframe, save it as dataset

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\rpkm.tsv', sep='\t')

briefly inspect the dataframe for shape, general entries, and summary statistics

dataset.shape

(600,7)

dataset.head()

dataset.describe()

People often plot using 'row median centered' each gene This means they divided each row by the median value across that entire row. They also log-transformed that result

Calculate the median value for each gene (row)

row_medians = dataset.median(axis=1,numeric_only=True)

Now create a copy of the dataset, and save it as dataset_row_norm Row median center and log2-transform each gene in dataset_row_norm

dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))

look at the summary statistics now Also, look at a few random gene rows and make sure the result in sensible Finally, look at the .head() to see how it is indexed

dataset_row_norm.describe()

dataset_row_norm.iloc[200,1:].median() #test some random rows, make sure the median value is 0

-0.04326176484815142

dataset_row_norm.head()

change the indexing to use the gene name instead of the row number

dataset_row_norm = dataset_row_norm.set_index('geneID')

Now let's plot the full dataset. Use seaborn clustermap to generate a heat map and cluster each row. Look at the seaborn clustermap documentation to figure out what arguments to pass

import fastcluster

sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

Another Example

sometimes you need additional work to make a nice heatmap. See the example below:

Read in the example data

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\TPM_reads_raw.tsv', sep='\t')

Examine the data

dataset.shape

(5823, 36)

dataset.head()

dataset.describe()

Prepare the data for heatmap plotting

row_medians = dataset.median(axis=1,numeric_only=True) #calculate the median of each row
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
dataset_row_norm = dataset_row_norm.set_index('gene')

Heatmap plotting

sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

It is necessary to 'zoom' in on the genes that showed significant changes. Sometimes people 'capped' their fold changes at -1.5 and +1.5, we'll do the same.

Start by copying our row-normalized dataframe to a new dataframe and setting all values greater than 1.5 to 1.5, and all less than -1.5 to -1.5 Save this as capped_row_norm_dataset

capped_row_norm_dataset = dataset_row_norm.copy()
capped_row_norm_dataset[capped_row_norm_dataset>1.5] = 1.5
capped_row_norm_dataset[capped_row_norm_dataset<-1.5] = -1.5

Look at the summary statistics to see if this worked

Look at the summary statistics to see if this worked

Read volcano plot file

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\volcano_data.tsv', sep='\t')

Inspect volcano plot file

vol.head()

vol.shape

(13547, 4)

Prepare for volcano plotting

vol['log10_q'] = -np.log10(vol['qval']) #calculate the -log2 of qvalues

vol.describe()

vol.loc[vol['log2foldchange']>0,'data_category'] = 'increases with age'
vol.loc[vol['log2foldchange']<0,'data_category'] = 'decreases with age'
vol.loc[vol['qval']>0.05,'data_category'] = 'not significant'

Volcano plotting

sns.scatterplot(data=vol, y='log10_q', x='log2foldchange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases with age':"red", 'decreases with age':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

Identify significant genes

sig_genes = vol.loc[vol['qval']<.05, ['gene_id', 'qval', 'log2foldchange']]

#get the genenames that are the same between the datasets
genes_to_cluster = [gene for gene in capped_row_norm_dataset.index if gene in sig_genes.values]

use the .loc function to pull out just the rows of genes we want to cluster, and see how many genes that is

capped_row_norm_dataset.loc[genes_to_cluster, :].shape

(1731, 35)

Almost there - now make your heatmap!

sns.clustermap(capped_row_norm_dataset.loc[genes_to_cluster, :], row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(8,8))

Advanced Plotting

Getting Ready

dat2=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\meanByClass.txt", sep='\s+')

dat2

Explore a fake gene expression data modified from iris.csv

rpkm=pd.read_csv("C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\fakeExpressionDat.csv")

rpkm

Advanced Line Plot

plt.figure(); dat2.plot(); plt.legend(loc='best')

Get rid of the legend

dat2.plot(legend=False)

Separate the features

dat2.plot(subplots=True, figsize=(6, 6)); plt.legend(loc='best')

Plotting on a Secondary Y-axis

plt.figure()
dat2.WtTypeA.plot(color="b")
dat2.WtTypeB.plot(color="turquoise")
dat2.KOTypeA.plot(color="r")
dat2.KOTypeB.plot(color="pink")
dat2.replicate.plot(secondary_y=True, style='g')

Plot a subset of columns

plt.figure()
dat2.WtTypeA.plot(color="b")
dat2.WtTypeB.plot(color="turquoise")
dat2.KOTypeA.plot(color="r")
dat2.KOTypeB.plot(color="pink")

Selective Plotting on Secondary Y-axis

plt.figure()
dat3=dat2.drop(['replicate'], axis = 1)
ax = dat3.plot(secondary_y=['wtTypeA', 'KOTypeA'])
ax.set_ylabel('TypeB scale')
ax.right_ax.set_ylabel('TypeA scale')

Targeting different subplots by passing an ax argument

fig, axes = plt.subplots(nrows=2, ncols=2)
dat2['WtTypeA'].plot(ax=axes[0,0]); axes[0,0].set_title('WtTypeA')
dat2['KOTypeA'].plot(ax=axes[0,1]); axes[0,1].set_title('KOTypeA')
dat2['WtTypeB'].plot(ax=axes[1,0]); axes[1,0].set_title('WtTypeB')
dat2['KOTypeB'].plot(ax=axes[1,1]); axes[1,1].set_title('KOTypeB')

Adjusting spacing between subplots

fig, axes = plt.subplots(nrows=2, ncols=2)
dat2['WtTypeA'].plot(ax=axes[0,0]); axes[0,0].set_title('WtTypeA')
dat2['KOTypeA'].plot(ax=axes[0,1]); axes[0,1].set_title('KOTypeA')
dat2['WtTypeB'].plot(ax=axes[1,0]); axes[1,0].set_title('WtTypeB')
dat2['KOTypeB'].plot(ax=axes[1,1]); axes[1,1].set_title('KOTypeB')
plt.subplots_adjust(left=0.1,
                    bottom=0.1,
                    right=0.9,
                    top=0.9,
                    wspace=0.4,
                    hspace=0.4)

Advanced Bar Plots

Looking at one replicate a time

plt.figure();
dat2.iloc[1].plot(kind='bar'); plt.axhline(0, color='k')

Looking at all replicates at the same time

plt.figure();
dat2.plot(kind='bar'); plt.axhline(0, color='k')

plt.figure();
dat2.plot(kind='bar', colormap='Greens')

stacked boxes

dat3.plot(kind='bar', stacked=True);

Advanced Histogram

plt.figure()
dat.hist(by="genotype", figsize=(6, 4),bins=20)

Scatter Plot

from pandas.plotting import scatter_matrix
rpkm=pd.read_csv("C:\\Users\duan\Desktop\IntroductionToMatplotlib\\fakeExpressionDat.csv")
rpkm

scatter_matrix(rpkm, alpha=0.9, figsize=(6, 6), diagonal='kde')

Parallel Coordinates

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='gist_rainbow')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='spring')

from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(rpkm, 'pathway',colormap='autumn')

Andrews Curves

Andrews Curves are smoothed versions of Parallel Coordinates

from pandas.plotting import andrews_curves

plt.figure()
andrews_curves(rpkm, 'pathway')
plt.show()

plt.figure()
andrews_curves(rpkm, 'pathway',color = [cm.rainbow(i) for i in np.linspace(0, 1, 3)])
plt.show()

plt.figure()
andrews_curves(rpkm, 'pathway',colormap='jet')
plt.show()

plt.figure()
andrews_curves(rpkm, 'pathway',colormap="winter")
plt.show()

RadViz

from pandas.plotting import radviz
plt.figure()
radviz(rpkm, 'pathway')
plt.show()

from pandas.plotting import radviz
plt.figure()
radviz(rpkm, 'pathway',colormap="Set1")
plt.show()

Data Processing with Python

Acknowledgement

Pandas

About Pandas

Making DataFrames

Provide a list or numpy array

Provide a dictionary

Read a CSV file

Read an Excel file

Inspecting DataFrames

Slicing DataFrames

Selecting from DataFrames

Value-based selections

Is-in based selections

Other

Boolean Indexing

Editing DataFrames

Add a new column

Alter a column

Dealing with missing data

Matplotlib

About Matplotlib

Basic Plotting

Getting Ready

Import packages

Read in data

Line Plot

Bar Plot

Histogram Plot

Box Plot

Advanced Plotting

Getting Ready

Advanced Line Plot

Advanced Bar Plots

Advanced Histogram

Scatter Plot

Parallel Coordinates

Andrews Curves

RadViz

Seaborn

About Seaborn

Basic Plotting

Getting Started

Line Plot

Bar Plot

Histogram Plot

Box Plot

Visualizing Statistics

Statistical Relationships

Distributions of Data

Categorical Data

Regression Models

Visualizing Proteomics Data

Visualizing RNAseq Data

Getting Started

Volcano Plot

Heatmap

Another Example

Data Processing with Python

Acknowledgement

Pandas

Matplotlib

Inspecting DataFrames

Seaborn

Making DataFrames

Provide a list or numpy array

Provide a dictionary

Read a CSV file

Read an Excel file

About Seaborn

About Pandas

Visualizing Statistics

Statistical Relationships

Distributions of Data

Categorical Data

Regression Models

Basic Plotting

Getting Ready

Import packages

Read in data