Top Image



Model





I. Construction of phylogenetic relationships

[1]Obtain transcriptome information of multiple species belonging to subclass of species to be studied from public databases


[2]Assemble transcripts from scratch from downloaded transcriptome data

    (1) The fastp v0.21 software[1] was used to perform quality control on the transcriptome data of 112 species, removing low-quality sequences and connector sequences with default parameters;

  • (2) The data after quality control was assembled from scratch using Trinity v2.11[2], and the parameters were set as default parameters;

  • (3) Then extract the longest transcript based on the custom Python script;

  • (4) Using CD-HIT v4.8.1[3] for transcribed de-redundancy, parameter set to "-c 0.8-aS 0.8-d 0-T 40";

  • (5) TransDecoder software[2] was used to predict the longest open reading frame of transcript sequence.

#

Construction of phylogenetic relationships


[3]The transcriptome data of all species were integrated to generate the orthologous orthogonal group Matrix 1 to infer the phylogenetic relationships of the species to be studied


[4]The data set Matrix 1 was tested for sequence saturation by DAMBE5 software[8]. After the analysis results were obtained, the graph was mapped and the graph file was edited and output


[5]Construct phylogenetic tree

II. Age index analysis of species transcriptome

[1] Species transcriptome data collection:

In this study, Major ampullate gland (Ma), Minor ampullate gland (MA), minor ampullate gland (MA), major ampullate gland (MA), minor ampullate gland (MA), and minor ampullate gland (MA) of five species of Araneus araneus, Araneus maculatus, Araneus araneus, and Black widow Araneus were collected. Mi, Tubuliform gland (Tu), Flagelliform gland (Fl), Aggregate gland (Aggregate gland, Tu) Ag, Aciniform gland & Pyriform gland (Ac&Py), Hemolymph (Hem), Pedipalp (Ped), Leg, Epidermis, Epi, Venom gland (Ven), Fat (Fat) and Ovary (Ova), 13 tissue transcriptome raw data.


[2]Quantitative transcriptome expression

    The TPM (Transcripts Per Million) value of the gene expression profile represents the amount of gene or transcript expression in the RNA sequencing data, which can avoid the influence between gene length and sequencing depth in order to more accurately compare the expression levels of different genes. The process is as follows:

  • (1) Use fastp v0.21[10] for quality control of species transcriptome data, remove low quality sequences and connector sequences, parameter default;

  • (2) After de novo assembly of the collateral-free transcriptome using Trinity v2.2.1[12] software, the longest transcript of each unigene was retained as the reference database for transcriptome comparison;

  • (3) TransDecoder [11] software was used to annotate the structure of the longest ORF and generate gff files;

  • (4) The pre-processed sample reads were compared to the reference database based on the HISAT v2.2.1[12] tool, and BAM file was generated after the result was converted;

  • (5) Finally, StringTie v2.1.5[13] was used to quantify the TPM value of the transcript.


[3]Define genetic age

Phylostratigraphy, developed by Domazet-Lo et al[14]., is a method based on phylogenetic studies of species by comparing all of their genes and relatives from known or representative genomes of distant and close species. Where genes with similar homology are divided into corresponding phylogenetic hierarchies, representing the approximate time interval of origin of genes in the species under study. The steps are as follows:

  • (1) Define the hierarchy of system evolution. Will this study rebuild the garden spider superfamily phylogenetic tree Taxonomy database with NCBI web site (https://www.ncbi.nlm.nih.gov/taxonomy) in the combined trapezoid phylogenetic tree, and then divided into different level system evolution. The first level (PS1) is a Cellular organism, and the last level is to study the lineage of a species.

  • (2) Define the age of gene evolution. First, we built the local database, including the NCBI site non redundant protein database (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), the transcriptome assembly from the start and the representative species genome data of different evolutionary level. Then, using the blastp command in DIAMOND[102] software, the threshold E-value was set to 1e-4[15], all genes or transcripts of the studied species were compared with the local database step by step, and the evolutionary hierarchy of the species to which the genes were first compared was defined as the age of the genes of the studied species. The genes that could not be compared in the database were identified as genealogy-specific genes of the species under study..


[4]Calculation of transcriptome age index

The calculation of the Transcriptome Age Index (TAI) was based on the calculation formula developed by Domazet-Lo et al.[121] by weighting the gene age and gene expression, that is, the TAI index represents the weighted arithmetic average of the transcriptome age, and the higher the TAI value, the younger the transcriptome. In this study, Hajk-Georg et al.[16] published R package myTAI for TAI analysis to perform transcriptome age index analysis from the following four aspects.

III. Gene function analysis

[1]Weighted gene co-expression network

    (1) Data preprocessing: The TPM expression value was normalized by log2 (TPM+1) method, and then the genes with large median absolute deviation were removed (mad > 0.01).

      (2) Co-expression network construction: the weighted correlation coefficient between genes was calculated and converted into adjacency matrix, and then the topological overlap matrix (TOM) statistic was converted into distance matrix. Finally, hierarchical clustering tree was constructed

        (3) Module recognition: the Dynamic Tree Cut algorithm is used to identify modules, and Module eigengenes (ME) are combined to combine very similar modules, which are represented by the branches of the cluster tree and different colors

          (4) Identification of trait related modules: The correlation between Module eigengenes and traits is used to effectively identify important modules. In WGCNA, the module with the highest Module Significance (MS) score among all modules is usually defined as the critical module and selected for further analysis

            (5) Identification of important genes in key models: Using Modularmembership (MM) to measure the importance of genes in modules, GS and MM scatter plots are drawn

              (6) Co-expression network visualization: Use Cytoscape[20] software to graphically display the network and perform analysis and editing.


              [2]GO analysis

                (1) blastp comparison: blastp comparison of all genes of the species;

              • (2) Screening genes: For the genes in the key modules, select related genes for enrichment analysis

              • (3) GO enrichment: After the transformation of the Gene ID, the Gene Ontology (GO) enrichment analysis was performed using the clusterProfiler[21] package enrichGO function

              • (4) Visualization of enrichment analysis results: enrichment analysis was performed using R


              [3]Glycosylation site analysis

                The NetOGlyc v4.0[22] tool was used to predict O-glycosylation modification sites for genes in the glycoprotein co-expression module.





References

[1] Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics (Oxford, England), 2018, 34(17): i884-i890.
[2] Haas B J, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J]. Nature protocols, 2013, 8(8): 1494-1512.
[3] Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data [J]. Bioinformatics (Oxford, England), 2012, 28(23): 3150-3152.
[4] Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND [J]. Nature methods, 2015, 12(1): 59-60.
[5] Edgar R C. MUSCLE: multiple sequence alignment with high accuracy and high throughput [J]. Nucleic acids research, 2004, 32(5): 1792-1797.
[6] Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments [J]. Nucleic acids research, 2006, 34(Web Server issue): W609-612.
[7] Capella-Gutierrez S, Silla-Martinez J M, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses [J]. Bioinformatics (Oxford, England), 2009, 25(15): 1972-1973.
[8] Xia X. DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution [J]. Mol Biol Evol, 2013, 30(7): 1720-1728.
[9] Nguyen L T, Schmidt H A, von Haeseler A, et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies [J]. Mol Biol Evol, 2015, 32(1): 268-274.
[10] Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics (Oxford, England), 2018, 34(17): i884-i890.
[11] Haas B J, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J]. Nature protocols, 2013, 8(8): 1494-1512.
[12] Kim D, Langmead B, Salzberg S L. HISAT: a fast spliced aligner with low memory requirements [J]. Nature methods, 2015, 12(4): 357-360.
[13] Pertea M, Pertea G M, Antonescu C M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads [J]. Nature biotechnology, 2015, 33(3): 290-295.
[14] Domazet-Loso T, Brajkovij, Tautz D. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages [J]. Trends in genetics : TIG, 2007, 23(11): 533-539.
[15] Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND [J]. Nature methods, 2015, 12(1): 59-60.
[16] Drost H G, Gabel A, Liu J, et al. myTAI: evolutionary transcriptomics with R [J]. Bioinformatics (Oxford, England), 2018, 34(9): 1589-1590.
[17] Wang J, Zhang L, Lian S, et al. Evolutionary transcriptomics of metazoan biphasic life cycle supports a single intercalation origin of metazoan larvae [J]. Nature ecology & evolution, 2020, 4(5): 725-736.
[18] Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis [J]. BMC bioinformatics, 2008, 9: 559.
[19] Wright R M, Aglyamova G V, Meyer E, et al. Gene expression associated with white syndromes in a reef building coral, Acropora hyacinthus [J]. BMC genomics, 2015, 16(1): 371.
[20] Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks [J]. Genome Res, 2003, 13(11): 2498-2504.
[21] Yu G, Wang L G, Han Y, et al. clusterProfiler: an R package for comparing biological themes among gene clusters [J]. Omics : a journal of integrative biology, 2012, 16(5): 284-287.
[22] Hansen J E, Lund O, Tolstrup N, et al. NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility [J]. Glycoconjugate journal, 1998, 15(2): 115-130.