# Visualizing RNAseq Data

### Getting Started

```
import pandas as pd
import numpy as np
import seaborn as sns
import glob
import matplotlib.pyplot as plt
sns.set_context('paper')
sns.set_style("whitegrid")
```

### Volcano Plot

<figure><img src="/files/KP6gLCTOzreFsgqNrExR" alt=""><figcaption></figcaption></figure>

This is what we aim to reproduce basing on the file volcano\_data.tsv. Let's read the volcano\_data.tsv file into a pandas dataframe. `glob.glob('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\*.tsv') # get a list of files in your directory ending in .tsv`

<figure><img src="/files/la3EApPh1UrOixpAvMXf" alt=""><figcaption></figcaption></figure>

```
vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\KOvsWTdiffExp.tsv', sep='\t')
```

inspect the 'head' of the file

```
vol.head()
```

<figure><img src="/files/Z4amJpWj04XQhIz7FNp9" alt=""><figcaption></figcaption></figure>

```
vol.shape
```

(4900, 6)

Notice that the qvalues in the file need to be log-transformed to match the figure Create a new column in the vol dataframe where 'log10\_q' = -log\_base\_10(qval) for each gene...the numpy function np.log10 is helpful

```
vol['log10_q'] = -np.log10(vol['padj']) #calculate the -log2 of qvalues
```

output a summary of the data (use the .describe() function)

```
vol.describe()
```

<figure><img src="/files/oMiWne8rlPXO3cCU6Vtw" alt=""><figcaption></figcaption></figure>

We want to plot and color the genes that increase in knockout, decrease with knockout, and show no significant change (q-val > 0.05). Let's categorize our data

create a new column called 'data\_category' with entires 'increases in Knockout', 'decreases in Knockout', and 'not significant' set these values appropriately for each gene hint - inspect the 'log2FoldChange' value or the 'padj' fields to determine each case

```
sns.scatterplot(data=vol, y='log10_q', x='log2FoldChange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases in Knockout':"red", 'decreases in Knockout':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)
```

<figure><img src="/files/CEFtez9vwgMlHpfocWmU" alt=""><figcaption></figcaption></figure>

save a list of the significant genes and their qvalues (any gene\_ids with qval<0.05), and output this list to a .csv file

```
sig_genes = vol.loc[vol['padj']<.05, ['geneID', 'padj', 'log2FoldChange']]
sig_genes.shape
```

(2582, 3)

```
sig_genes.head(8)
```

<figure><img src="/files/NSvMphqdqr09P6sdhsfC" alt=""><figcaption></figcaption></figure>

Save significant genes to a csv file

```
sig_genes.to_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\significant_hits.csv')
```

### Heatmap

Next, we are going to reproduce a heatmap below. This clustered heat map aims to show groups of genes whose transcript levels are coordinated across age or mutant background

<figure><img src="/files/lM5yGQBHU4Xe5u6Nbv5p" alt=""><figcaption></figcaption></figure>

The plotting will be based on rpkm.tsv which contains \~600 significant genes we want to inspect.

read in the 'rpkm.tsv' file as a pandas dataframe, save it as dataset

```
dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\rpkm.tsv', sep='\t')
```

briefly inspect the dataframe for shape, general entries, and summary statistics

```
dataset.shape
```

(600,7)

```
dataset.head()
```

<figure><img src="/files/Y75lfy5SWy0uHsu3FKdJ" alt=""><figcaption></figcaption></figure>

```
dataset.describe()
```

<figure><img src="/files/qQIY4c49ruiGNF9JGv33" alt=""><figcaption></figcaption></figure>

People often plot using 'row median centered' each gene This means they divided each row by the median value across that entire row. They also log-transformed that result

Calculate the median value for each gene (row)

```
row_medians = dataset.median(axis=1,numeric_only=True)
```

Now create a copy of the dataset, and save it as dataset\_row\_norm Row median center and log2-transform each gene in dataset\_row\_norm

```
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
```

look at the summary statistics now Also, look at a few random gene rows and make sure the result in sensible Finally, look at the .head() to see how it is indexed

```
dataset_row_norm.describe()
```

<figure><img src="/files/ItwlG4N30N0AvYaNaHzD" alt=""><figcaption></figcaption></figure>

```
dataset_row_norm.iloc[200,1:].median() #test some random rows, make sure the median value is 0
```

-0.04326176484815142

```
dataset_row_norm.head()
```

<figure><img src="/files/NcIQKcoUlYuKuF2A8Tks" alt=""><figcaption></figcaption></figure>

change the indexing to use the gene name instead of the row number

```
dataset_row_norm = dataset_row_norm.set_index('geneID')
```

Now let's plot the full dataset. Use seaborn clustermap to generate a heat map and cluster each row. Look at the seaborn clustermap documentation to figure out what arguments to pass

```
import fastcluster
```

```
sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))
```

<figure><img src="/files/hbu0mDbFZi9CSs2jUsVZ" alt=""><figcaption></figcaption></figure>

### Another Example

sometimes you need additional work to make a nice heatmap. See the example below:

Read in the example data

```
dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\TPM_reads_raw.tsv', sep='\t')
```

Examine the data

```
dataset.shape
```

(5823, 36)

```
dataset.head()
```

<figure><img src="/files/FiLXShvqqGdhNQCs5Odo" alt=""><figcaption></figcaption></figure>

```
dataset.describe()
```

<figure><img src="/files/vTgqM0s7xXap6qlOCx3e" alt=""><figcaption></figcaption></figure>

\
Prepare the data for heatmap plotting

```
row_medians = dataset.median(axis=1,numeric_only=True) #calculate the median of each row
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
dataset_row_norm = dataset_row_norm.set_index('gene')
```

Heatmap plotting

```
sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))
```

<figure><img src="/files/fo4dTFMr5xLuJplFw4LI" alt=""><figcaption></figcaption></figure>

It is necessary to 'zoom' in on the genes that showed significant changes. Sometimes people 'capped' their fold changes at -1.5 and +1.5, we'll do the same.

Start by copying our row-normalized dataframe to a new dataframe and setting all values greater than 1.5 to 1.5, and all less than -1.5 to -1.5 Save this as capped\_row\_norm\_dataset

```
capped_row_norm_dataset = dataset_row_norm.copy()
capped_row_norm_dataset[capped_row_norm_dataset>1.5] = 1.5
capped_row_norm_dataset[capped_row_norm_dataset<-1.5] = -1.5
```

Look at the summary statistics to see if this worked

```
Look at the summary statistics to see if this worked
```

<figure><img src="/files/HCdDy60JBKpMsfz8MOVE" alt=""><figcaption></figcaption></figure>

Read volcano plot file

```
vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\volcano_data.tsv', sep='\t')
```

Inspect volcano plot file

```
vol.head()
```

<figure><img src="/files/K5Dh0j7Q3zS06qmz2Xbb" alt=""><figcaption></figcaption></figure>

```
vol.shape
```

(13547, 4)

Prepare for volcano plotting

```
vol['log10_q'] = -np.log10(vol['qval']) #calculate the -log2 of qvalues
```

```
vol.describe()
```

<figure><img src="/files/Yq2uoxU5HtHJGjoI8puz" alt=""><figcaption></figcaption></figure>

```
vol.loc[vol['log2foldchange']>0,'data_category'] = 'increases with age'
vol.loc[vol['log2foldchange']<0,'data_category'] = 'decreases with age'
vol.loc[vol['qval']>0.05,'data_category'] = 'not significant'
```

Volcano plotting

```
sns.scatterplot(data=vol, y='log10_q', x='log2foldchange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases with age':"red", 'decreases with age':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)
```

<figure><img src="/files/4GNxAT6e79jR6lyvUKiO" alt=""><figcaption></figcaption></figure>

Identify significant genes

```
sig_genes = vol.loc[vol['qval']<.05, ['gene_id', 'qval', 'log2foldchange']]
```

Now we need to just pull out the genes that show significant age-dependence Make a list of all the gene names that are in both dataset\_row\_norm and the list of significant genes (sig\_genes) from above. Save this list as genes\_to\_cluster

```
#get the genenames that are the same between the datasets
genes_to_cluster = [gene for gene in capped_row_norm_dataset.index if gene in sig_genes.values] 
```

use the .loc function to pull out just the rows of genes we want to cluster, and see how many genes that is

```
capped_row_norm_dataset.loc[genes_to_cluster, :].shape
```

(1731, 35)

Almost there - now make your heatmap!

```
sns.clustermap(capped_row_norm_dataset.loc[genes_to_cluster, :], row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(8,8))
```

<figure><img src="/files/sKU8cxWN1PvzV2oEwWnJ" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://igb.mit.edu/mini-courses/python/data-processing-with-python/seaborn/visualizing-rnaseq-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
