Let’s load the data we’ll use for the package.
library(lionmap)
library(magrittr)
data("train_ex_data_bpcells")
data("train_ex_metadata")
data("test_ex_data_bpcells")
data("test_ex_metadata")
The first thing we need to do is to create a tree that describes the hierarchy of our cell classifications. A tree describes the structure of the data (see the Creating Custom Trees page for more info), and is one of the unique aspects of this package - the ability to create a structure that embeds info on plausible classifications based on previously known info on the relationships between celltypes, and to use this structure to make better classifications. For this basic tutorial, we’ll just assume there’s no hierarchical structure, and that celltypes are equally likely to be assigned to any of the possible classes (aka an equal tree). I’ve created a helper function, CreateEqualTree, to do this, which just takes in a vector of possible cell classes, and creates an equal tree for you.
possible_cell_classes = train_ex_metadata$seurat_annotations %>% unique()
equal_tree = CreateEqualTree(cell_labels = possible_cell_classes)
#OPTIONAL: let's visualize the tree we created (have to install the ggtree package on bioconductor to do this)
#ggtree(equal_tree) +geom_nodelab(geom = "label", node = "all")
Awesome, now we can find marker genes using our reference dataset. The FindMarkerGenes function does this for us. We’ll need to input the following info:
Our reference GE dataset in BPCells format (ref_bpcells)
Our dataframe containing info on each cell (ref_metadata)
The tree we created (tree)
The name of the column in the metadata giving the cell type (metadata_cluster_column)
The name of the column in the metadata giving the cell ID (metadata_cell_id_column)
And optionally we can change the ngenes parameter, which allows us to choose the number of genes we find that distinguishes each pair of corresponding cells, or the n_cells_sampled parameter, which allows us to choose the number of cells we sample to find differences for each particular pair of celltypes.
marker_genes = FindMarkerGenes(ref_bpcells = train_ex_data_bpcells, ref_metadata = train_ex_metadata, tree = equal_tree, metadata_cluster_column = "seurat_annotations", metadata_cell_id_column = "cell_label")
If it works, we should get a bunch of nonsense printed to the output talking about writing temporary files on your computer. There’s also a few automated tests that check whether everything is formatted properly, and they’ll throw error messages with some info on what’s going wrong if something is caught. Under the hood, the file is subsetting the GE matrix to contain 500 cells from two cell classes, writing that data to a temporary file on your computer, and then using that dataset to find marker genes that distinguish the classes. It does this for each of the possible pairs of cell classes you could create.
We’ll also get all the marker genes that distinguish each pair of cell classes! Let’s take a look at the output for the first 5 matchups, stored in the marker_genes variable. Write the following to explore the output and then read on ahead for more info.
str(marker_genes[[1]][1:5])
Now we can use this set of marker genes to create our list of models!
We will use the GetModels function to do this.
Just like before, we’ll need to input the following info, just with the addition of the marker gene list we just created:
The marker gene list (marker_genes)
Our reference GE dataset in BPCells format (ref_bpcells)
Our dataframe containing info on each cell (ref_metadata)
The tree we created (tree)
The name of the column in the metadata giving the cell type (metadata_cluster_column)
The name of the column in the metadata giving the cell ID (metadata_cell_id_column)
And for this function we can optionally change the n_cells_sampled parameter, which is the number of cells sampled from each celltype to create models that separate each pairwise class (default set to 500), the models_to_include parameter which lets us change the models we’re creating (should only be used by more advanced users), and the npcs parameter, which changes the number of principal components used when creating models.
models <- GetModels(marker_genes = marker_genes, ref_bpcells = train_ex_data_bpcells, ref_metadata = train_ex_metadata, tree = equal_tree, metadata_cluster_column = "seurat_annotations", metadata_cell_id_column = "cell_label")
If everything goes well, the output will indicate the tests checking proper formatting passed, and will then list the models as they are being created.
Let’s have a look at the structure of the list with the str function:
str(models[[1]][1:5], max.level = 2)
The structure of this object, just like the marker genes, is a list of lists. Each sublist provides models that can be used to classify a cell between one of two classes, and we have models that differentiate all possible classes at each level of the celltype hierarchy (equal_tree). It also contains information on how to transform data we want to classify to be in the same PC-space as the models we just created. We’ll use all these models in the next step, to classify cells to specific celltypes.
To classify cells to previously defined celltypes, we’ll use the Classify function and will need to enter the following:
Our dataset with raw (not normalized/processed) GE values of cells we want to classify (bpcells_query)
The list of models produced by GetModels (models)
The tree we created (tree_struc)
And we can optionally set the parameter prop_max_threshold, which allows us to set the confidence threshold we need to have in order to classify a cell.
The output is a vector giving the celltype classifications for each cell in the query dataset.
query_classifications = Classify(bpcells_query = test_ex_data_bpcells,models = models,tree_struc = equal_tree)
Awesome! Now we have a vector of classifications, formatted so the name of each element is the cell label and the value is the classification of that cell. Let’s view the first few elements.
head(query_classifications)
We can view the number of classifications to each type using the table function.
table(query_classifications)
We’ll see that one cell has the classification “Rootnode”. This indicates that there wasn’t enough evidence to properly classify the cell. If we used a custom hierarchy, classification occurs at each level of the hierarchy, and stops at a given level of the tree if there’s not enough evidence to continue.