基于python的scanpy模块的乳腺癌单细胞数据分析

Python
935
1
1
2022-11-07
标签   Python进阶

考虑到咱们生信技能树粉丝对单细胞数据挖掘的需求,我开通了一个专栏《100个单细胞转录组数据降维聚类分群图表复现》,也亲自示范了几个,不过自己带娃,读博,时间精力有限,所以把剩余的90多个任务安排了学徒,实习生,学员。真的是太棒了,群策群力!

值得注意的是本次投稿并不是基于R语言,而是python哦!

下面申小忱的投稿

作者:xjtu申小忱

这次我们来复现一篇单细胞的文章。这篇我们只来复现细胞图谱和拟时序分析 像细胞通讯,还有富集分析还是很简单的。大家可以继续走下去,然后我们来交流讨论! 这篇全篇基于python复现。

首先介绍一下scanpy 网上有很多推文,但是大多是抄的教程,没有自己的理解。我用一句话介绍一下scanpy =pandas+dic。大家只要记住这个就可以完美驾驭!因为平时做深度学习 最常用的库就是 torch+numpy+pandas 所以无论做什么掌握pandas都是关键的!

导库 (如果你还不会安装python的模块,需要自己学一下基础语法哦)

import pandas as pd 
import numpy as np
import pandas as pd
import scanpy as sc 
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc 
这里说明一下,因为如果大家要发文章肯定需要dpi要求。所以这里可以提前设置好。相等于全局变量。
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
#sc.logging.print_versions()
#results_file = './write/pa.h5ad'
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(3, 3), facecolor='white')  
读入数据 为3个txt文件。大概每个1个G左右 分为 乳腺癌 阳性淋巴结 阴性淋巴结 3个文件
raw_UMIcounts = pd.read_table("./GSM4798908_B2019-1.expression_matrix.txt", header=0, index_col=0)
#raw_UMIcounts.head(10)
raw_UMIcounts_pl = pd.read_table("./GSM4798909_B2019-2.expression_matrix.txt", header=0, index_col=0)
raw_UMIcounts_nl= pd.read_table("./GSM4798910_B2019-3.expression_matrix.txt", header=0, index_col=0)
pi_gene_bar= raw_UMIcounts.columns
pi_gene_bar=pd.Series(pi_gene_bar)+"_primary"
raw_UMIcounts.columns= pi_gene_bar
raw_UMIcounts_pl.columns = pd.Series(raw_UMIcounts_pl.columns)+"_positive"
raw_UMIcounts_nl.columns = pd.Series(raw_UMIcounts_nl.columns)+"_negtive"

因为我发现作者就是简单合并的所以这里 就是简单合并 相当于R里面的merge(我会类比R让大家看懂!)
counts_join=raw_UMIcounts.join(raw_UMIcounts_pl,how="inner")
counts_join= counts_join.join(raw_UMIcounts_nl,how="inner")
print(11323+11356+11467)

一共是34146个细胞!

34146
counts_join.T.to_csv("./counts_join.csv")
这一步就相当于构建了seurat对象!对等理解!
adata = sc.read_csv("./counts_join.csv", first_column_names=True)
adata.write('./join_adata.h5ad')

保存成h5ad 文件 这样的话下次读取会很快
adata=sc.read_h5ad("./join_adata.h5ad")
看看前20个表达量高的基因
sc.pl.highest_expr_genes(adata, n_top=20, )

img

下面进行质控 所有内容和在R里是一样的!图片也是一样的!
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=5)
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

img

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

img

img

adata = adata[adata.obs.n_genes_by_counts <3000, :]
adata = adata[adata.obs.pct_counts_mt <20, :]
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

img

这里选择20这个阈值过滤线粒体,虽然有些高,但是原文作者是50过滤,我觉得较为离谱,所以这里选择了比作者小的20.

a=[]
for i in adata.obs.index:
    a.append(i[13:])
from collections import Counter
count = Counter(a)
print(count)
Counter({'positive': 11147, 'negtive': 10036, 'primary': 9621})

这是过滤之后三种样本剩余细胞情况,总体细胞和原作者一致,各个类型还是有一定差异的!

adata.obs["disease_group"]=a
Trying to set attribute `.obs` of view, copying.
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

img

这里是筛选高变基因流程和seurat 也是一样的!
adata = adata[:, adata.var.highly_variable]
Counter(adata.var.highly_variable)
Counter({True: 3907})
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
C:\Users\shenxiaochen\anaconda3\lib\site-packages\anndata\_core\anndata.py:1228: ImplicitModificationWarning: Initializing view as actual.
  warnings.warn(
Trying to set attribute `.obs` of view, copying.
... storing 'disease_group' as categorical
sc.pp.scale(adata, max_value=10)

标准化数据用于下一步分析 同:seurat

sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)

img

pca 降维结果。

results_file= './first_ana.h5ad'
adata.write(results_file)
adata=sc.read_h5ad("./first_ana.h5ad")
这里我们选择20个pc
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=20)
computing neighbors
    using 'X_pca' with  = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
sc.tl.umap(adata,min_dist=0.3)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:14)
adata
AnnData object with n_obs × n_vars = 30804 × 3907
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'disease_group' 
    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'pca', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
sc.pl.umap(adata, color="disease_group",)

画一下不同疾病组别的umap的图!和原文保持一致!

img

import scanpy as sc
import pandas as pd
from matplotlib.pyplot import rc_context
with rc_context({'figure.figsize': (5, 5)}):
    sc.pl.umap(adata, color='disease_group', add_outline=True,
               legend_fontsize=12, legend_fontoutline=2,frameon=False,
               title='clustering of cells', palette='Paired')

可以看下图,原文不同疾病分组的umap和我复现的是一致的!只是他是躺着的。。。

img

img

sc.tl.leiden(adata)
running Leiden clustering
    finished: found 23 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:04)

这里有两种分群算法,一种是leiden 一种是louvain. python 里默认的是leiden R seurat 是 louvain。看过原文,作者说 leiden 是 louvain 的进化版。所以建议大家还是用leiden。

这里的参数同样和 seurat 一样也是resolution python里默认的是1.原文作者也是1.所以我就没写,大家平时更换不同的参数。。因为原文是基于seurat的, 所以我也是选择了原文的 louvain.

sc.tl.louvain(adata)
running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished: found 20 clusters and added
    'louvain', the cluster labels (adata.obs, categorical) (0:00:05)
#with rc_context({'figure.figsize': (10, 5)}):
sc.pl.umap(adata, color=['leiden','louvain'], add_outline=True,legend_loc='on data',
            legend_fontsize=12, legend_fontoutline=2,frameon=False,
            title=['clustering of cells','clustering of cells'], palette='Set1')

img

左边是 leiden 右边是 louvain 大家可以对比一下

sc.tl.rank_genes_groups(adata, 'louvain', method='wilcoxon')
#sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:17)


C:\Users\shenxiaochen\anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2
  self.stats[group_name, 'logfoldchanges'] = np.log2(

这一步相当于 seurat 里面的find_all_markers

result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
marker=pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']})
marker.to_csv("./marker_louvain.csv")

下面进行分群注释!

cluster2annotation = {
     '0':'Bcells',
     '1':'myofibroblasts',
     '2': 'NaiveT',
     '3':'NaiveT',
     '4':'cancercells', 
     '5': 'Bcells',
     '6':'NaiveT', 
     '7': 'NaiveT',
     '8':'myofibroblasts',
    '9':'CD8Effector',
    '10':'myeloid',
    '11':'Bcells',
    '12': 'endothelial',
    '13':'Macrophages',
    '14': 'cancerstemcells',
    '15':'myeloid',
    '16':'CXCL14cancer',
    '17':'plasma',
    '18':'matureDC',
    '19':'pericytes'
    
}

# add a new `.obs` column called `cell type` by mapping clusters to annotation using pandas `map` function
adata.obs['cell type'] = adata.obs['louvain'].map(cluster2annotation).astype('category')

根据作者文中提到的一些marker 和 cellmarker ,PanglaoDB 数据库进行注释!这两个数据库是我经常光顾的。。原因就是简单!!!

marker_genes_dict = {
    'cancercells': ['KRT19'],
    'cancerstemcells': ['KRT19','TOP2A'],
    'CXCL14cancer': ['CXCL14'],
    #'NaiveT': [], 
    #'CD8Effector': [], 
    'Bcells': ['CD79A','CD79B'],
    'Macrophages': ['LYZ','IL1B'],
    'myeloid':['LYZ'],
    'matureDC':['LAMP3','CCR7'],
    'plasma':['JCHAIN','IGHG3','MZB1'],
    'endothelial':['MCAM','PECAM1'],
    'pericytes':['ACTA2','TAGLN','MCAM'],
    'myofibroblasts':['LUM','DCN','TAGLN'],
    'cyclingcells':['TOP2A']
}
sc.pl.dotplot(adata, marker_genes_dict, 'louvain', dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_louvain']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.

画几张marker 基因的图!这些用R都是可以实现。。我曾经无聊的时候写过一写复现的脚本。。后来发现github上也有。。。。

img

png

ax = sc.pl.stacked_violin(adata, marker_genes_dict, groupby='louvain', swap_axes=False, dendrogram=True,cmap='Paired_r')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.

img

sc.pl.tracksplot(adata, marker_genes_dict, groupby='louvain', swap_axes=False, dendrogram=True,cmap='Paired_r')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.

img

adata.var_names
Index(['CCND1', 'ACTA2', 'POSTN', 'GADD45G', 'PSMA7', 'PDCD11', 'IGFBP7',
       'ING5', 'RPL32P3', 'AC241952.1',
       ...
       'AC125807.1', 'CD1A', 'H2AFZP3', 'IGLV1-44', 'CLEC17A', 'EEF1A1P9',
       'AC073111.2', 'AL355076.2', 'AC103563.7', 'IGHV5-78'],
      dtype='object', length=3907)
with rc_context({'figure.figsize': (4.5, 3)}):
    sc.pl.violin(adata, ['CD79A', 'CD79B'], groupby='louvain' )

img

sc.pl.umap(adata, color='louvain', legend_loc='on data',
           frameon=False, legend_fontsize=10, legend_fontoutline=2,title="")

img

sc.pl.umap(adata, color='cell type', legend_loc='on data',
           frameon=False, legend_fontsize=5, legend_fontoutline=0.5,save="jmzeng.jpg")
WARNING: saving figure to file figures\umapjmzeng.jpg.pdf

根据对应marker注释每一群,和原文基本一致!

img

img

adata.obs['cell type'].cat.categories

注释到如下几种细胞类型!

Index(['Bcells', 'CD8Effector', 'CXCL14cancer', 'Macrophages', 'NaiveT',
       'cancercells', 'cancerstemcells', 'endothelial', 'matureDC', 'myeloid',
       'myofibroblasts', 'pericytes', 'plasma'],
      dtype='object')
adata_new=adata[(adata.obs['cell type']== 'CXCL14cancer')|(adata.obs['cell type']=='cancercells')|(adata.obs['cell type']== 'cancerstemcells'), :]
adata_new.obs['cell type'].cat.categories

然后作者选择如下三种细胞类型,做了拟时序分析!go! go! go!

Index(['CXCL14cancer', 'cancercells', 'cancerstemcells'], dtype='object')
adata_new

这里选用paga 大家都知道的应该是monocle3 但是在一篇 nature methods 当中,作者是认为 paga 才是拟时序分析的最优选择!

View of AnnData object with n_obs × n_vars = 2883 × 3907
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'disease_group', 'leiden', 'louvain', 'cell type'
    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'pca', 'neighbors', 'umap', 'disease_group_colors', 'leiden', 'louvain', 'leiden_colors', 'louvain_colors', 'rank_genes_groups', 'dendrogram_louvain', 'cell type_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
sc.tl.paga(adata_new, groups='cell type')
running PAGA


Trying to set attribute `.uns` of view, copying.


    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
sc.pl.paga(adata_new, color=['cell type','CXCL14'])
--> added 'pos', the PAGA positions (adata.uns['paga'])

img

#adata_new.uns['iroot'] = np.flatnonzero(adata_new.obs['cell type']  == 'cancerstemcells')[0]
sc.tl.draw_graph(adata_new, init_pos='paga')
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:17)
adata_new.uns['iroot'] = np.flatnonzero(adata_new.obs['cell type']  == 'cancerstemcells')[0]
sc.tl.dpt(adata_new)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
    finished (0:00:00)
    eigenvalues of transition matrix
    [1.         0.9954424  0.9901249  0.98625064 0.98440194 0.9764193 
     0.97149456 0.96843475 0.9636642  0.9555143  0.94893146 0.9446058 
     0.94269824 0.9330055  0.9283923 ]
    finished: added
    'X_diffmap', diffmap coordinates (adata.obsm)
    'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
computing Diffusion Pseudotime using n_dcs=10
    finished: added
    'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)

这里看出, cancer_stem_cells 分化为 cancer_cells 和 cxcl14_cancer_cells 和原文分析的基本一致!cxcl14_cancer_cells 主要存在于阳性淋巴结和阴性淋巴结!cxcl14_cancer_cells和 cancer_cells时序上基本一致!

sc.pl.draw_graph(adata_new, color=['cell type', 'dpt_pseudotime','disease_group'], legend_loc='on data')

img

img

sc.tl.paga(adata, groups='cell type')
sc.pl.paga(adata, color=['cell type'],node_size_scale=0.5)
running PAGA
    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns) 
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
--> added 'pos', the PAGA positions (adata.uns['paga'])

下面我又做了作者没有做的所有细胞类型的拟时序分析!

img

sc.tl.draw_graph(adata, init_pos='paga')
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:05:47)
sc.pl.draw_graph(adata, color=['cell type'], legend_loc='on data',legend_fontsize='xx-small',legend_fontweight='normal')

img

adata.uns['iroot'] = np.flatnonzero(adata.obs['cell type']  == 'NaiveT')[0]
sc.tl.dpt(adata) 
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
    finished (0:00:00)
    eigenvalues of transition matrix
    [1.         0.9967204  0.9959651  0.9953239  0.9936516  0.9933531 
     0.98944426 0.98545295 0.9851446  0.9843129  0.9827176  0.98058397 
     0.9788297  0.97733855 0.97462684]
    finished: added
    'X_diffmap', diffmap coordinates (adata.obsm)
    'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
computing Diffusion Pseudotime using n_dcs=10
    finished: added
    'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)
sc.pl.draw_graph(adata, color=['cell type', 'dpt_pseudotime','disease_group'], legend_loc='on data',legend_fontsize='xx-small',legend_fontweight='normal')

img

可以看到从naiveT细胞的逐渐分化轨迹!

最后想和大家说一句话,是我导师教导我的分享给大家。学无止境,得饶人处且饶人,开阔胸襟,是做人的一种境界!