Article directory
foreword
Since sequencing technology (Sequencing) has been widely used in personalized medicine, diagnosis of genetic diseases, identification of cancer gene mutations, tracking and research of pathogens, etc., and the price of sequencing has become lower and lower in recent years, many experiments The laboratory will use Next Generation Sequencing (Next Generation Sequencing), RNA-sequencing and other methods to explore genome structure, variation, expression and regulatory mechanisms. But when there are dozens or hundreds of samples, the data collation at the front end is enough to make people dizzy. In addition, when the laboratory has insufficient money to purchase analysis software, most of the bosses can only sacrifice the analysts. eyes to do these things manually. Since the laboratory has no money and does not want to break the eyes, the R language is our best friend. The following is the method of merging all the sequencing files for your reference.
pre-work
This article will usetidyverse
andjanitor
These two packages are used for data processing and cleaning,Teaching files can be downloaded here.
# download file install.packages("tidyverse") install.packages("janitor") # check version packageVersion("tidyverse") packageVersion("janitor")
finished downloadingtidyverse
andjanitor
Can be used after kitpackageVersion()
Function check version, used in this articletidyverse
The kit version is v2.0.0,
The kit version is v2.2.0, and then we will directly enter the operation process!janitor
Example of Merging Multiple Sequence Files
Here are the steps to merge files:
- Load the required packages: load first
tidyverse
andjanitor
These two kits. - List CSV files: Create a file path list, including all files in the specified directory ending with
.csv
CSV file. - data processing:use
purrr
in the kitmap_df()
function to process each CSV file in the list, and for each file, do the following:- a. to use
readr
in the kitread_csv()
Function to read a CSV file. - b. to use
janitor
in the kitclean_names()
Function to clean up field names. - c. Add a file named
symbol
A new field from the containing thegene_homo_sapiens_refseq_gr_ch37_p13_genes
The value of the field. - d. Screening
reference_allele
The data row whose field is equal to No. - e. use
dplyr
in the kitacross()
Function converts all fields to character type.
- a. to use
- Modify file_name:Will
file_name
field modified to keep only the base name of the file (excluding directory paths and.csv
extension). - merged result:use
Combine the processed dataframes of each CSV file into one large dataframe.map_df()
functionfile_name
field is used to track which source file each row of data came from. - view results:use
View()
The function inspects the merged and processed data frames (Merge). - export data: the processed
Merge
The data frame is saved in the specified path in the form of CSV file and RDS file (R Data Storage).Note that you need to replace the path of your own computer.- The CSV file is saved as "C:/Users/USER/Desktop/merge.csv".
- The RDS file is saved as "C:/Users/USER/Desktop/merge.rds".
# input package library(tidyverse) library(janitor) # input folder output_list <- list.files(path="C:/Users/USER/Desktop/WES_sample", pattern="*.csv$", full.names = T ) # Merge all .CSV files Merge <- output_list %>% setNames(nm = .) %>% map_df(~read_csv(.x)%>% clean_names() %>% mutate(symbol=gene_homo_sapiens_refseq_gr_ ch37_p13_genes)%>% subset (reference_allele=="No") %>% mutate(across(.fns = as.character)), .id = "file_name") %>% mutate(file_name=gsub(".*?/", "", file_name)) %>% mutate(file_name=gsub(".csv", "", file_name)) # View the integration result View(Merge) # Save the integration result write.csv(file = "C:/Users/USER/Desktop /merge.csv", x = Merge, row.names = FALSE) saveRDS(Merge, file = "C:/Users/USER/Desktop/merge.rds")
After the above command is executed, we can get a result of merging 65 pieces of data. The generated CSV file contains the merged and processed CNV data of all CSV files in the specified directory, and the RDS file will be used in a subsequent time. Re-reading into R format saves the same data.
How to extract the required data from the archive
Once we've consolidated all of our data into one file, we can then use theMerge
The data is extracted from the file, the following is one of the extraction methods:
- Define the list of genes to be extracted: The name of the gene to be extracted is preceded by a pipe symbol (
|
) separated and assigned to the variablegene.list
. For example,gene.list="NOC2L|PERM1|OR4F5"
Indicates the result of extracting the three genes NOC2L, PERM1 and OR4F5. - process result:use
dplyr
in the kitmutate()
function pairMerge
Data frame for data processing. - filter results:use
subset()
function, based onsymbol
Is the field included in thegene.list
The conditions in and filter to keep only the data that meet the conditions. - view results:use
View()
Function to view the processed result data frame in tabular formResults
.
# Select the gene gene.list="NOC2L|PERM1|OR4F5" # Results <- Merge %>% mutate(amino_acid_change_in_longest_transcript=gsub(".*?:", "",amino_acid_change_in_longest_transcript)) %>% mutate( coding_region_change_in_longest_transcript=gsub(".*?:", "",coding_region_change_in_longest_transcript)) %>% subset(str_detect(symbol,gene.list)) # View(Results)
epilogue
As a result, 35 of the 65 records in the Merge dataset were extracted. To be honest, the data extracted using this method are sometimes mixed with other data with similar gene names, which need to be screened and checked by yourself. Although there are some small shortcomings, it also saves us a lot of time. If you have a better way to extract data, please share it in the message below, I will be very grateful! ! !