How to merge multiple sequenced files using R

foreword

Since sequencing technology (Sequencing) has been widely used in personalized medicine, diagnosis of genetic diseases, identification of cancer gene mutations, tracking and research of pathogens, etc., and the price of sequencing has become lower and lower in recent years, many experiments The laboratory will use Next Generation Sequencing (Next Generation Sequencing), RNA-sequencing and other methods to explore genome structure, variation, expression and regulatory mechanisms. But when there are dozens or hundreds of samples, the data collation at the front end is enough to make people dizzy. In addition, when the laboratory has insufficient money to purchase analysis software, most of the bosses can only sacrifice the analysts. eyes to do these things manually. Since the laboratory has no money and does not want to break the eyes, the R language is our best friend. The following is the method of merging all the sequencing files for your reference.

pre-work

This article will usetidyverseandjanitorThese two packages are used for data processing and cleaning,Teaching files can be downloaded here.

# download file install.packages("tidyverse") install.packages("janitor") # check version packageVersion("tidyverse") packageVersion("janitor")

finished downloadingtidyverseandjanitorCan be used after kitpackageVersion()Function check version, used in this articletidyverseThe kit version is v2.0.0,janitorThe kit version is v2.2.0, and then we will directly enter the operation process!

Example of Merging Multiple Sequence Files

Here are the steps to merge files:

  1. Load the required packages: load firsttidyverseandjanitorThese two kits.
  2. List CSV files: Create a file path list, including all files in the specified directory ending with.csvCSV file.
  3. data processing:usepurrrin the kitmap_df()function to process each CSV file in the list, and for each file, do the following:
    • a. to usereadrin the kitread_csv()Function to read a CSV file.
    • b. to usejanitorin the kitclean_names()Function to clean up field names.
    • c. Add a file namedsymbolA new field from the containing thegene_homo_sapiens_refseq_gr_ch37_p13_genesThe value of the field.
    • d. Screeningreference_alleleThe data row whose field is equal to No.
    • e. usedplyrin the kitacross()Function converts all fields to character type.
  4. Modify file_name:Willfile_namefield modified to keep only the base name of the file (excluding directory paths and.csvextension).
  5. merged result:usemap_df()functionCombine the processed dataframes of each CSV file into one large dataframe.file_namefield is used to track which source file each row of data came from.
  6. view results:useView()The function inspects the merged and processed data frames (Merge).
  7. export data: the processedMergeThe data frame is saved in the specified path in the form of CSV file and RDS file (R Data Storage).Note that you need to replace the path of your own computer.
    • The CSV file is saved as "C:/Users/USER/Desktop/merge.csv".
    • The RDS file is saved as "C:/Users/USER/Desktop/merge.rds".
# input package library(tidyverse) library(janitor) # input folder output_list <- list.files(path="C:/Users/USER/Desktop/WES_sample", pattern="*.csv$", full.names = T ) # Merge all .CSV files Merge <- output_list %>% setNames(nm = .) %>% map_df(~read_csv(.x)%>% clean_names() %>% mutate(symbol=gene_homo_sapiens_refseq_gr_ ch37_p13_genes)%>% subset (reference_allele=="No") %>% mutate(across(.fns = as.character)), .id = "file_name") %>% mutate(file_name=gsub(".*?/", "", file_name)) %>% mutate(file_name=gsub(".csv", "", file_name)) # View the integration result View(Merge) # Save the integration result write.csv(file = "C:/Users/USER/Desktop /merge.csv", x = Merge, row.names = FALSE) saveRDS(Merge, file = "C:/Users/USER/Desktop/merge.rds")

After the above command is executed, we can get a result of merging 65 pieces of data. The generated CSV file contains the merged and processed CNV data of all CSV files in the specified directory, and the RDS file will be used in a subsequent time. Re-reading into R format saves the same data.

How to extract the required data from the archive

Once we've consolidated all of our data into one file, we can then use theMergeThe data is extracted from the file, the following is one of the extraction methods:

  1. Define the list of genes to be extracted: The name of the gene to be extracted is preceded by a pipe symbol (|) separated and assigned to the variablegene.list. For example,gene.list="NOC2L|PERM1|OR4F5"Indicates the result of extracting the three genes NOC2L, PERM1 and OR4F5.
  2. process result:usedplyrin the kitmutate()function pairMergeData frame for data processing.
  3. filter results:usesubset()function, based onsymbolIs the field included in thegene.listThe conditions in and filter to keep only the data that meet the conditions.
  4. view results:useView()Function to view the processed result data frame in tabular formResults.
# Select the gene gene.list="NOC2L|PERM1|OR4F5" # Results <- Merge %>% mutate(amino_acid_change_in_longest_transcript=gsub(".*?:", "",amino_acid_change_in_longest_transcript)) %>% mutate( coding_region_change_in_longest_transcript=gsub(".*?:", "",coding_region_change_in_longest_transcript)) %>% subset(str_detect(symbol,gene.list)) # View(Results)

epilogue

As a result, 35 of the 65 records in the Merge dataset were extracted. To be honest, the data extracted using this method are sometimes mixed with other data with similar gene names, which need to be screened and checked by yourself. Although there are some small shortcomings, it also saves us a lot of time. If you have a better way to extract data, please share it in the message below, I will be very grateful! ! !

I am very grateful for your sharing!!!
MillionQuesn
Million Quesn

A foreigner living in Taiwan, sharing the highlights of a sudden flash of inspiration.

Articles: 46

Leave a Reply

Your email address will not be published. Required fields are marked *