Read in data
ms=pd.read_csv("C:\\Users\duan\Desktop\IntroductionToSeaborn\mass_spec_new.csv")msa = sns.histplot(ms['light Precursor Mz'], bins=20) #simple, right?a = sns.histplot(data=ms,x='light Precursor Mz', bins=20, kde="TRUE",color="lightseagreen",alpha=0.1)#make the plot more eleganta = sns.histplot(data=ms,x='light +1 charge mass', bins=20, kde="TRUE",color="red",alpha=0.1) #O no, what could be wrong?v = ms['light +1 charge mass'].values #get all of the values
v.sort() #sort the values
v #print the values - what's wrong?
ms['light +1 charge mass']=ms['light Precursor Mz']*ms['Precursor Charge'] - ((ms['Precursor Charge']-1)*1.0078)a = sns.histplot(data=ms,x='light +1 charge mass', bins=20, kde="TRUE",color="red",alpha=0.1)#Let's see if +2 and +3 charged peptides exhibit a different distribution.
a = sns.histplot(data=ms.loc[ms['Precursor Charge'] == 3], x='light +1 charge mass', bins=10, label='+3',kde="TRUE",color="chocolate",alpha=0.2)
sns.histplot(data=ms.loc[ms['Precursor Charge'] == 2], x='light +1 charge mass', bins=10, ax=a, label='+2',kde="TRUE",color="orchid",alpha=0.1)
a.legend()#Here, I made a new column that is the ratio of light-to-heavy intensity...the details aren't important. What is important is that we have a new column of values we can plot
ms['Total Area Ratio BT2_HFX_5'] = ms['light BT2_HFX_5 Total Area']/ms['15N BT2_HFX_5 Total Area']
ms['Total Area Ratio BT2_HFX_7'] = ms['light BT2_HFX_7 Total Area']/ms['15N BT2_HFX_7 Total Area']sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
swarm = sns.swarmplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:100,:], size=6)
pylab.tight_layout()
pylab.savefig('swarmplot.png')sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
box = sns.boxplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:100,:])
pylab.tight_layout()
pylab.savefig('boxplot.pdf') #this will save your figure!sns.set_context('poster') #you don't need to do this - it's just to make the figures easier to see
f = pylab.figure(figsize=(20,10))
box = sns.violinplot(x='Protein Gene', y='Total Area Ratio BT2_HFX_5', data=ms.loc[0:30,:])
#pylab.tight_layout()
pylab.savefig('boxplot.pdf') #this will save your figure!










Seaborn is an extension of matplotlib and pandas. It makes a well-defined set of hard things easy too Seaborn works directly with Pandas dataframes as inputs if you have lists or numpy arrays, you should probably convert those to pandas dataframes to plot (or use matplotlib)
See Seaborn introduction here
Seaborn has an unreal number of built-in plots. It's best to just explore the documentation online, but we'll go through some quick examples
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
sns.set_context('talk') #on your screen, replace 'talk' with 'notebook' or 'paper' or 'poster'
%pylab inlinePlease refer to Visualizing statistical relationships
Please refer to Visualizing distributions of data
Please refer to Plotting with categorical data
Please refer to Visualizing regression models
import pandas as pd
import numpy as np
import seaborn as sns
import glob
import matplotlib.pyplot as plt
sns.set_context('paper')
sns.set_style("whitegrid")This is what we aim to reproduce basing on the file volcano_data.tsv. Let's read the volcano_data.tsv file into a pandas dataframe. glob.glob('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\*.tsv') # get a list of files in your directory ending in .tsv
vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\KOvsWTdiffExp.tsv', sep='\t')inspect the 'head' of the file
vol.head()vol.shape(4900, 6)
Notice that the qvalues in the file need to be log-transformed to match the figure Create a new column in the vol dataframe where 'log10_q' = -log_base_10(qval) for each gene...the numpy function np.log10 is helpful
vol['log10_q'] = -np.log10(vol['padj']) #calculate the -log2 of qvaluesoutput a summary of the data (use the .describe() function)
vol.describe()We want to plot and color the genes that increase in knockout, decrease with knockout, and show no significant change (q-val > 0.05). Let's categorize our data
create a new column called 'data_category' with entires 'increases in Knockout', 'decreases in Knockout', and 'not significant' set these values appropriately for each gene hint - inspect the 'log2FoldChange' value or the 'padj' fields to determine each case
sns.scatterplot(data=vol, y='log10_q', x='log2FoldChange', hue='data_category', legend='brief',
palette={'not significant':'grey', 'increases in Knockout':"red", 'decreases in Knockout':"blue"},
edgecolor='grey',s=80, linewidth=0.25)save a list of the significant genes and their qvalues (any gene_ids with qval<0.05), and output this list to a .csv file
sig_genes = vol.loc[vol['padj']<.05, ['geneID', 'padj', 'log2FoldChange']]
sig_genes.shape(2582, 3)
sig_genes.head(8)Save significant genes to a csv file
sig_genes.to_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\significant_hits.csv')Next, we are going to reproduce a heatmap below. This clustered heat map aims to show groups of genes whose transcript levels are coordinated across age or mutant background
The plotting will be based on rpkm.tsv which contains ~600 significant genes we want to inspect.
read in the 'rpkm.tsv' file as a pandas dataframe, save it as dataset
dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\rpkm.tsv', sep='\t')briefly inspect the dataframe for shape, general entries, and summary statistics
dataset.shape(600,7)
dataset.head()dataset.describe()People often plot using 'row median centered' each gene This means they divided each row by the median value across that entire row. They also log-transformed that result
Calculate the median value for each gene (row)
row_medians = dataset.median(axis=1,numeric_only=True)Now create a copy of the dataset, and save it as dataset_row_norm Row median center and log2-transform each gene in dataset_row_norm
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))look at the summary statistics now Also, look at a few random gene rows and make sure the result in sensible Finally, look at the .head() to see how it is indexed
dataset_row_norm.describe()dataset_row_norm.iloc[200,1:].median() #test some random rows, make sure the median value is 0-0.04326176484815142
dataset_row_norm.head()change the indexing to use the gene name instead of the row number
dataset_row_norm = dataset_row_norm.set_index('geneID')Now let's plot the full dataset. Use seaborn clustermap to generate a heat map and cluster each row. Look at the seaborn clustermap documentation to figure out what arguments to pass
import fastclustersns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))sometimes you need additional work to make a nice heatmap. See the example below:
Read in the example data
dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\TPM_reads_raw.tsv', sep='\t')Examine the data
dataset.shape(5823, 36)
dataset.head()dataset.describe()Prepare the data for heatmap plotting
row_medians = dataset.median(axis=1,numeric_only=True) #calculate the median of each row
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
dataset_row_norm = dataset_row_norm.set_index('gene')Heatmap plotting
sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))It is necessary to 'zoom' in on the genes that showed significant changes. Sometimes people 'capped' their fold changes at -1.5 and +1.5, we'll do the same.
Start by copying our row-normalized dataframe to a new dataframe and setting all values greater than 1.5 to 1.5, and all less than -1.5 to -1.5 Save this as capped_row_norm_dataset
capped_row_norm_dataset = dataset_row_norm.copy()
capped_row_norm_dataset[capped_row_norm_dataset>1.5] = 1.5
capped_row_norm_dataset[capped_row_norm_dataset<-1.5] = -1.5Look at the summary statistics to see if this worked
Look at the summary statistics to see if this workedRead volcano plot file
vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\volcano_data.tsv', sep='\t')Inspect volcano plot file
vol.head()vol.shape(13547, 4)
Prepare for volcano plotting
vol['log10_q'] = -np.log10(vol['qval']) #calculate the -log2 of qvaluesvol.describe()vol.loc[vol['log2foldchange']>0,'data_category'] = 'increases with age'
vol.loc[vol['log2foldchange']<0,'data_category'] = 'decreases with age'
vol.loc[vol['qval']>0.05,'data_category'] = 'not significant'Volcano plotting
sns.scatterplot(data=vol, y='log10_q', x='log2foldchange', hue='data_category', legend='brief',
palette={'not significant':'grey', 'increases with age':"red", 'decreases with age':"blue"},
edgecolor='grey',s=80, linewidth=0.25)Identify significant genes
sig_genes = vol.loc[vol['qval']<.05, ['gene_id', 'qval', 'log2foldchange']]Now we need to just pull out the genes that show significant age-dependence Make a list of all the gene names that are in both dataset_row_norm and the list of significant genes (sig_genes) from above. Save this list as genes_to_cluster
#get the genenames that are the same between the datasets
genes_to_cluster = [gene for gene in capped_row_norm_dataset.index if gene in sig_genes.values] use the .loc function to pull out just the rows of genes we want to cluster, and see how many genes that is
capped_row_norm_dataset.loc[genes_to_cluster, :].shape(1731, 35)
Almost there - now make your heatmap!
sns.clustermap(capped_row_norm_dataset.loc[genes_to_cluster, :], row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(8,8))


















