Model
[1]Obtain transcriptome information of multiple species belonging to subclass of species to be studied from public databases
[2]Assemble transcripts from scratch from downloaded transcriptome data
(1) The fastp v0.21 software[1] was used to perform quality control on the transcriptome data of 112 species, removing low-quality sequences and connector sequences with default parameters;
(2) The data after quality control was assembled from scratch using Trinity v2.11[2], and the parameters were set as default parameters;
(3) Then extract the longest transcript based on the custom Python script;
(4) Using CD-HIT v4.8.1[3] for transcribed de-redundancy, parameter set to "-c 0.8-aS 0.8-d 0-T 40";
(5) TransDecoder software[2] was used to predict the longest open reading frame of transcript sequence.
Construction of phylogenetic relationships
[3]The transcriptome data of all species were integrated to generate the orthologous orthogonal group Matrix 1 to infer the phylogenetic relationships of the species to be studied
Preparation of materials: we collected the special equipments and a large number of spiders with particular characters,then we extracted a lot of gossamer of these spiders.
Designing different condition and operation: we concocted buffer solution with PH gradient and dissolve quantitative gossamer in the different solutions respectively.
Microscopic examination: we observed structural changes of spider silk protein before and after the reaction between the protein and solutions.
[4]The data set Matrix 1 was tested for sequence saturation by DAMBE5 software[8]. After the analysis results were obtained, the graph was mapped and the graph file was edited and output
[5]Construct phylogenetic tree
(1) Using software IQ-TREE[9], prepare sequence data Matrix1 to be saved in FASTA format file
(2) Use the command "iqtree -s input.fasta -m MFP-bb 1000-alRT 1000" to build the maximum likelihood tree, where "bb" represents the number of repetitions for non-parametric bootstrap analysis. "alrt" represents the number of repetitions used for non-parametric SH-like analysis
(3) Evolutionary tree visualization. In the output file to check the maximum likelihood tree, and use exhibited v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/) and iTOL software for visualization and beautify the phylogenetic tree
[1] Species transcriptome data collection:
In this study, Major ampullate gland (Ma), Minor ampullate gland (MA), minor ampullate gland (MA), major ampullate gland (MA), minor ampullate gland (MA), and minor ampullate gland (MA) of five species of Araneus araneus, Araneus maculatus, Araneus araneus, and Black widow Araneus were collected. Mi, Tubuliform gland (Tu), Flagelliform gland (Fl), Aggregate gland (Aggregate gland, Tu) Ag, Aciniform gland & Pyriform gland (Ac&Py), Hemolymph (Hem), Pedipalp (Ped), Leg, Epidermis, Epi, Venom gland (Ven), Fat (Fat) and Ovary (Ova), 13 tissue transcriptome raw data.
[2]Quantitative transcriptome expression
The TPM (Transcripts Per Million) value of the gene expression profile represents the amount of gene or transcript expression in the RNA sequencing data, which can avoid the influence between gene length and sequencing depth in order to more accurately compare the expression levels of different genes. The process is as follows:
(1) Use fastp v0.21[10] for quality control of species transcriptome data, remove low quality sequences and connector sequences, parameter default;
(2) After de novo assembly of the collateral-free transcriptome using Trinity v2.2.1[12] software, the longest transcript of each unigene was retained as the reference database for transcriptome comparison;
(3) TransDecoder [11] software was used to annotate the structure of the longest ORF and generate gff files;
(4) The pre-processed sample reads were compared to the reference database based on the HISAT v2.2.1[12] tool, and BAM file was generated after the result was converted;
(5) Finally, StringTie v2.1.5[13] was used to quantify the TPM value of the transcript.
[3]Define genetic age
Phylostratigraphy, developed by Domazet-Lo et al[14]., is a method based on phylogenetic studies of species by comparing all of their genes and relatives from known or representative genomes of distant and close species. Where genes with similar homology are divided into corresponding phylogenetic hierarchies, representing the approximate time interval of origin of genes in the species under study. The steps are as follows:
(1) Define the hierarchy of system evolution. Will this study rebuild the garden spider superfamily phylogenetic tree Taxonomy database with NCBI web site (https://www.ncbi.nlm.nih.gov/taxonomy) in the combined trapezoid phylogenetic tree, and then divided into different level system evolution. The first level (PS1) is a Cellular organism, and the last level is to study the lineage of a species.
(2) Define the age of gene evolution. First, we built the local database, including the NCBI site non redundant protein database (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), the transcriptome assembly from the start and the representative species genome data of different evolutionary level. Then, using the blastp command in DIAMOND[102] software, the threshold E-value was set to 1e-4[15], all genes or transcripts of the studied species were compared with the local database step by step, and the evolutionary hierarchy of the species to which the genes were first compared was defined as the age of the genes of the studied species. The genes that could not be compared in the database were identified as genealogy-specific genes of the species under study..
[4]Calculation of transcriptome age index
The calculation of the Transcriptome Age Index (TAI) was based on the calculation formula developed by Domazet-Lo et al.[121] by weighting the gene age and gene expression, that is, the TAI index represents the weighted arithmetic average of the transcriptome age, and the higher the TAI value, the younger the transcriptome. In this study, Hajk-Georg et al.[16] published R package myTAI for TAI analysis to perform transcriptome age index analysis from the following four aspects.
(1) Global TAI calculation. The calculation formula is shown as follows:
psi represents the gene age of gene i, ei represents the expression level of gene i at the ontogenetic stage or tissue, and n is the total number of genes in the studied species. A lower TAI value indicates a larger transcriptome age, while a higher TAI value indicates a younger transcriptome age.
(2) Tissue TAI change curve and significance analysis. FlatLineTest function in myTAI was used to quantify the statistical significance of the overall TAI pattern, and then the TAI change curve of spider tissue was drawn by PlotSignature function.
(3) The contribution value of each gene to the global TAI was calculated. Use the pMatrix function in the myTAI package to calculate the relative contribution of each gene of the species to the overall TAI. The sum of the contribution values of all genes in a given tissue is the overall TAI value for that tissue.
(4) Calculate the relative TAIR contribution (TAI)[17] of each phylogenetic level. The formula for calculating the relative contribution of evolutionary level to TAI is shown as follows:
ENS Represents the sum of expression levels of all genes at evolutionary level N in tissue S, represents the expression amount of gene i in tissue S, and n represents the number of genes at evolutionary level N.
[5]Analysis of young gene expression
Younger transcriptomes in a species' tissue often indicate higher levels of gene expression, reflecting that the tissue plays an important role in individual development or adaptation to the environment. This study identified tissues with younger transcriptomes through TAI analysis, and will further explore the expression of young genes in these tissues using the following three strategies. The methods are as follows:
(1) In young tissues, we identified all genes with expression levels greater than zero (TPM > 0) as contributing genes to the function of young tissues. Therefore, in this study, all genes with zero expression levels were filtered first, and the average expression levels of ancient genes and young genes in young tissues were compared.
(2) Then, according to the distribution of genes in the phylogenetic hierarchy, the expression levels of ancient genes and young genes in different evolutionary hierarchies were compared;
(3) Finally, the relative contributions of different evolutionary levels to TAI were calculated, and the relative contributions of each level to the global TAI of young tissues were compared.
[1]Weighted gene co-expression network
(1) Data preprocessing: The TPM expression value was normalized by log2 (TPM+1) method, and then the genes with large median absolute deviation were removed (mad > 0.01).
(2) Co-expression network construction: the weighted correlation coefficient between genes was calculated and converted into adjacency matrix, and then the topological overlap matrix (TOM) statistic was converted into distance matrix. Finally, hierarchical clustering tree was constructed
(3) Module recognition: the Dynamic Tree Cut algorithm is used to identify modules, and Module eigengenes (ME) are combined to combine very similar modules, which are represented by the branches of the cluster tree and different colors
(4) Identification of trait related modules: The correlation between Module eigengenes and traits is used to effectively identify important modules. In WGCNA, the module with the highest Module Significance (MS) score among all modules is usually defined as the critical module and selected for further analysis
(5) Identification of important genes in key models: Using Modularmembership (MM) to measure the importance of genes in modules, GS and MM scatter plots are drawn
(6) Co-expression network visualization: Use Cytoscape[20] software to graphically display the network and perform analysis and editing.
[2]GO analysis
(1) blastp comparison: blastp comparison of all genes of the species;
(2) Screening genes: For the genes in the key modules, select related genes for enrichment analysis
(3) GO enrichment: After the transformation of the Gene ID, the Gene Ontology (GO) enrichment analysis was performed using the clusterProfiler[21] package enrichGO function
(4) Visualization of enrichment analysis results: enrichment analysis was performed using R
[3]Glycosylation site analysis
The NetOGlyc v4.0[22] tool was used to predict O-glycosylation modification sites for genes in the glycoprotein co-expression module.
[1] Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics (Oxford, England), 2018, 34(17): i884-i890.
[2] Haas B J, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J]. Nature protocols, 2013, 8(8): 1494-1512.
[3] Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data [J]. Bioinformatics (Oxford, England), 2012, 28(23): 3150-3152.
[4] Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND [J]. Nature methods, 2015, 12(1): 59-60.
[5] Edgar R C. MUSCLE: multiple sequence alignment with high accuracy and high throughput [J]. Nucleic acids research, 2004, 32(5): 1792-1797.
[6] Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments [J]. Nucleic acids research, 2006, 34(Web Server issue): W609-612.
[7] Capella-Gutierrez S, Silla-Martinez J M, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses [J]. Bioinformatics (Oxford, England), 2009, 25(15): 1972-1973.
[8] Xia X. DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution [J]. Mol Biol Evol, 2013, 30(7): 1720-1728.
[9] Nguyen L T, Schmidt H A, von Haeseler A, et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies [J]. Mol Biol Evol, 2015, 32(1): 268-274.
[10] Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics (Oxford, England), 2018, 34(17): i884-i890.
[11] Haas B J, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J]. Nature protocols, 2013, 8(8): 1494-1512.
[12] Kim D, Langmead B, Salzberg S L. HISAT: a fast spliced aligner with low memory requirements [J]. Nature methods, 2015, 12(4): 357-360.
[13] Pertea M, Pertea G M, Antonescu C M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads [J]. Nature biotechnology, 2015, 33(3): 290-295.
[14] Domazet-Loso T, Brajkovij, Tautz D. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages [J]. Trends in genetics : TIG, 2007, 23(11): 533-539.
[15] Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND [J]. Nature methods, 2015, 12(1): 59-60.
[16] Drost H G, Gabel A, Liu J, et al. myTAI: evolutionary transcriptomics with R [J]. Bioinformatics (Oxford, England), 2018, 34(9): 1589-1590.
[17] Wang J, Zhang L, Lian S, et al. Evolutionary transcriptomics of metazoan biphasic life cycle supports a single intercalation origin of metazoan larvae [J]. Nature ecology & evolution, 2020, 4(5): 725-736.
[18] Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis [J]. BMC bioinformatics, 2008, 9: 559.
[19] Wright R M, Aglyamova G V, Meyer E, et al. Gene expression associated with white syndromes in a reef building coral, Acropora hyacinthus [J]. BMC genomics, 2015, 16(1): 371.
[20] Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks [J]. Genome Res, 2003, 13(11): 2498-2504.
[21] Yu G, Wang L G, Han Y, et al. clusterProfiler: an R package for comparing biological themes among gene clusters [J]. Omics : a journal of integrative biology, 2012, 16(5): 284-287.
[22] Hansen J E, Lund O, Tolstrup N, et al. NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility [J]. Glycoconjugate journal, 1998, 15(2): 115-130.