R language suite for building kinship trees-fastreeR 1.4.0

Introduction

fastreeR (version 1.4.0)is an R suite for directly computing distance matrices between samples from VCF or FASTA files, building kinship trees, or performing hierarchical clustering. Its main goal is to provide convenient and fast functions to help users quickly generate kinship trees or cluster analysis results from sequence data. fastreeR suite github locationplease click me.

1. Install fastreeR 1.4.0

need to advanceInstall R software and RStudio, then installfastreeR (version 1.4.0),Methods as below:

if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("fastreeR")

2. Additional kits to be installed

becausefastreeR (version 1.4.0)kit is required in theExecute in JAVA environment, so you need to install additionalJAVA suite. The required suite is as follows:

install.packages("rJava") install.packages("xlsxjars") install.packages("xlsx")

In addition, also needDownload JAVA, and set JAVA to the R environment, JAVA_HOME pathPlease check the corresponding JAVA storage location of your computer.

# for 64-bit version #rJAVA must run in JAVA environment
Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre-1.8')

3. Load all R packages to be used

Before using any R package, you need to firstuse library () Load the package into.

library(fastreeR) library(rJava)

4. Preliminary work before analysis

usefastreeRWhen analyzing the data of different samples, all the files must be merged into one file for analysis. Here we take the .vcf file as an example. Assuming that there are three sets of whole exome sequencing (WES) results today, we must usevcftoolsMerge them, the code is as follows:

vcftools --gzvcf sample1.vcf.gz --chr chr16 --recode --out sample1.vcf vcftools --gzvcf sample2.vcf.gz --chr chr16 --recode --out sample2.vcf vcftools --gzvcf sample3. vcf.gz --chr chr16 --recode --out sample2.vcf

--recode Indicates the content of the output screening – chr indicates the name of the chromosome to be extracted – out indicates the output result

then go furthercompress the file:

bgzip sample1.vcf.recode.vcf bgzip sample2.vcf.recode.vcf bgzip sample3.vcf.recode.vcf

Then use the tabix command to fileindexing:

tabix -p vcf sample1.vcf.recode.vcf.gz tabix -p vcf sample2.vcf.recode.vcf.gz tabix -p vcf sample3.vcf.recode.vcf.gz

at lastMerge all archives into one file:

vcf-merge sample1.vcf.recode.vcf.gz sample2.vcf.recode.vcf.gz sample3.vcf.recode.vcf.gz > out.vcf

At this point, the pre-work of the sample is complete!!!

5. Analysis of samples with FastreeR

The main functions of fastreeR are four:

  1. Sample Statistics: Used to calculate the sample data statistics in the VCF file, such as the number of samples, the number of SNPs, the rate of missing values, etc.
  2. Calculate distances from vcf: According to the genotype data in the VCF file, calculate the distance matrix between samples for subsequent construction of kinship tree or hierarchical clustering.
  3. Histogram of distances: Convert the distance matrix into a histogram, which is used to visually display the distance distribution between samples.
  4. Plot tree from fastreeR::dist2tree / Plot tree from stats::hclust: Use the dist2tree function that comes with the fastreeR suite or the hclust function in the R basic package stats to convert the distance matrix into a kinship tree and visualize it.
# load package library(fastreeR) library(rJava) # load data tempVcf <- (inputFile = "C:/Users/USER/Desktop/out.vcf") myVcfIstats <- fastreeR::vcf2istats(inputFile = tempVcf) head (myVcfIstats) plot(myVcfIstats[,7:9]) # Histogram of distances myVcfDist <- fastreeR::vcf2dist(inputFile = tempVcf, threads = 2) graphics::hist(myVcfDist, breaks = 100, main=NULL, xlab = "Distance", xlim = c(0,max(myVcfDist))) # Plot tree myVcfTree <- fastreeR::dist2tree(inputDist = myVcfDist) plot(ape::read.tree(text = myVcfTree), direction = "down" , cex = 0.3) ape::add.scale.bar() ape::axisPhylo(side = 2) # stats::hclust myVcfTreeStats <- stats::hclust(myVcfDist) plot(myVcfTreeStats, ann = FALSE, cex = 0.3 )

After running the above code, you can get a tree diagram similar to the following:

The figure is a dendrogram drawn using the stats::hclust command, extracted fromfastreeRText of the teaching process.

epilogue

The R programming language written in this article wasfastreeR (version 1.4.0)It has been confirmed in the kit that it is completely feasible. If you think the writing is okay, please forward it to your friends who are also struggling with money!

I am very grateful for your sharing!!!
MillionQuesn
Million Quesn

A foreigner living in Taiwan, sharing the highlights of a sudden flash of inspiration.

Articles: 46

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Alert: Content selection is disabled!!