1 Introduction

CVTree (Composition Vector Tree) is an implementation of alignment-free algorithms for generating dissimilarity matrices from large collections of DNA or amino acid sequences, preferably genome data, for phylogenetic studies. This method was proposed by Professor Bailin Hao and colleagues in 2004 (J. Qi, Wang, & Hao, 2004). In the CVTree algorithm, each genome sequence (including protein, RNA, or DNA) is represented by a composition vector, which is calculated by the difference between the observed frequencies of k-strings and the predicted frequencies based on a Markov model. The similarity between two sequences is measured by the cosine of their composition vectors. CVTree has been successfully applied to various domains, including Archaea and Bacteria (Gao, Qi, Sun, & Hao, 2007; J. Qi et al., 2004; Sun, Xu, & Hao, 2010; G. H. Zuo, Hao, & Staley, 2014a; Zuo, Xu, & Hao, 2013, 2015), viruses (Gao & Qi, 2007; Gao, Qi, Wei, Sun, & Hao, 2003), chloroplasts (Yu et al., 2005), 16S rRNA (Lu, 2025), fungi (Choi, Kim, Jeon, & Lee, 2013; O’Connell, Thon, Hacquard, & Amyotte, 2012; Wang, Xu, Gao, & L., 2009), and metagenomes (Liu et al., 2013; Zhang et al., 2016). The methodological aspects of the CV approach have been extensively described in the literature. In particular, the role and selection of the peptide length $K$ have been discussed in (Li, Xu, & Hao, 2010) and (G. H. Zuo, Li, & Hao, 2014b).

The CVTree algorithms are accessible through two platforms: the CVTree Web Server and the CVTree Standalone Version. The standalone version is published under the MIT license on both GitHub and Gitee (Zuo, 2021). Following the introduction of the classical CVTree method, our research group has successively released three versions of the CVTree web server (Ji Qi, Luo, & Hao, 2004; Xu & Hao, 2009; Zuo & Hao, 2015). In WebCVTree v3 (Zuo & Hao, 2015), we redesigned the data processing strategy and implemented parallelization for the core program. Additionally, an interactive, collapsible, and expandable CVTree Viewer based on HTML5 was incorporated. These enhancements enable biologists to study phylogeny inferred from thousands of genomes and compare results directly with taxonomy at all ranks in an almost automated manner. WebCVTree v4 represents the latest iteration of the CVTree Web Server. This version incorporates all prokaryotic genomes from NCBI RefSeq, features an improved data processing strategy, introduces a novel algorithm for comparing phylogenetic trees with taxonomy, and implements more efficient core programs. In addition to the interactive HTML5-based Tree Viewer from WebCVTree v3, a new tree drawing page has been added to generate publication-quality phylogenetic tree figures. The WebCVTree4 pipeline is hosted on Aliyun and accessible at http://cvtree.online/v4/. The WebCVTree4 web server can be accessed without login requirements on most modern browsers, including Firefox, Microsoft Edge, and Google Chrome. However, due to HTML5 implementation, some older browsers may not be fully supported.

1.1 New Features in WebCVTree v4

Integrated Taxonomy System: Implementation of a comprehensive taxonomy system based on the NCBI taxonomy database.
Comprehensive Prokaryotic Genomes: Inclusion of all RefSeq prokaryotic genomes for comprehensive analysis.
Phylogenetic Tree Editor: Built-in functionality for editing and customizing phylogenetic trees.
Detailed Lineage Editor: Advanced editor for modifying the taxonomic lineage of individual genomes.
Novel Comparison Algorithm: Implementation of a new algorithm for comparing phylogenetic trees with taxonomic classifications.
Enhanced Taxonomy Display: Updated visualization system for improved taxonomic data presentation.
New Web Infrastructure: Deployment on a new server infrastructure hosted on Aliyun at http://cvtree.online/v4/.
Optimized Server Performance: Implementation of more efficient server-side programs for enhanced computational speed.

1.2 Citation Guidelines

When using WebCVTree in your research, please cite the following relevant publications:

Zuo, G. (2025). WebCVTree4: A Novel Phylogenetic and Taxonomic Study Platform for Prokaryotes Using Composition Vectors and Whole Genomes. In Submission.
Zuo, G. (2025). CLTree: Annotating, Rooting, and Evaluating Phylogenetic Trees Based on Genome Lineages. In Submission.
Zuo, G. (2021). CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool Based on Composition Vectors of Genomes. Genomics Proteomics & Bioinformatics, 19, 662-667.
Qi, J., Wang, B., & Hao, B. (2004). Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. Journal of Molecular Evolution, 58, 1-11.

2 Web Interface

The workflow of WebCVTree v4 is illustrated in Figure 1.

The operational workflow of WebCVTree consists of three main stages: project initiation from the Home Page, parameter configuration and job submission. The parameter configuration encompasses four components: basic parameters (configured in the Project Page), selection of built-in gene sequences (configured in the Inbuilt Page), uploading of external gene sequences (configured in the Upload Page), and setting of genome classification information (configured in the Lineage Page). Once all parameters are configured, the job can be submitted to the server for processing. Upon completion, results are presented from two perspectives: taxonomic classification (TaxonView Page) and phylogenetic tree visualization (TreeView Page). Users can generate publication-ready, high-quality figures directly from the system for use in academic publications via the Output Tree page. For users interested in numerical values representing distances between specific species, this information is available on the Output Distance page. Furthermore, if the classification or phylogenetic tree results are unsatisfactory, users have the flexibility to modify the classification information and re-run the comparison between lineage and phylogenetic tree to iteratively refine their results.

2.1 Getting Started

The Start Page of WebCVTree4 is illustrated in Figure 2. To reload an existing project, enter its Project Number in the designated textbox and click the Load/Create Project button. To initiate a new project, leave the textbox empty and click Load/Create Project. Each user job is assigned a unique Project Number, and a dedicated workspace is allocated to the project. All project configurations, user-uploaded data, and computational results are stored in this workspace. Projects and their associated workspaces are retained for 7 days following the most recent run.

Adjacent to the Load/Create Project button is a button labeled Example. Clicking this button displays a preset example project that was utilized in the publication describing WebCVTree4. First-time users are strongly encouraged to explore this example to gain a comprehensive understanding of WebCVTree4’s functionality and workflow.

Next to the Example button is a button labeled Copy Project. Clicking this button duplicates a preset project into a new project with a unique Project Number. WebCVTree4 supports an incremental computational workflow by enabling users to replicate existing projects, modify parameters, and execute new runs that utilize previous results. This approach prevents unnecessary repetitive calculations and improves computational efficiency. Furthermore, by preserving all uploaded data within the project structure, the Copy Project functionality eliminates redundant data upload procedures. Note that a valid existing Project Number must be entered in the textbox before clicking the Copy Project button.

2.2 Setting Up A Project

Upon creating, loading, or copying a project, the Setup Page opens in the “Setting parameters” status (see Figure 3). The unique Project Number is displayed at the top of the page; users should retain this number for subsequent project access. The Setup Page begins with a Project Status bar indicating the current status of the project (see Figure 4). The main section of the Setup Page comprises three primary sections: Basic Parameters, Set Genomes, and Set Lineage.

2.2.1 The Project Status Bar

As illustrated in Figure 4, a project may exist in one of the following five states:

Initial State: The project is awaiting parameter configuration and job submission.
Running State: The project has been configured and submitted to the server, which is currently processing the job. In this state, the configuration interface is locked. A Cancel Project button appears on the right side of the Project Status Bar; clicking this button terminates the running task and unlocks the interface.
Completed State: The server has finished processing the job and successfully returned the results. In this state, a See Result button appears on the right side of the Project Status Bar, providing access to the results page.

In addition to the three normal workflow states described above, there are two exceptional states:

Project Reset: A new configuration has been applied to an already completed project and saved to the server, but not yet resubmitted. An example would be generating a new project via the Copy Project function and subsequently modifying its parameters.
Runtime Error: An error occurred during server processing due to specific issues. The configuration interface remains locked, and a Reset Project button appears on the right side of the Project Status Bar to exit the locked configuration state.

2.2.2 Basic Parameters

Users configure Basic Parameters in the “Basic Parameters” section of the Setup Page.

Sequence Type: Although protein sequences are preferred, DNA sequences may also be used.
Distance Method: Currently, WebCVTree4 provides only the classic CVTree method, also referred to as the “Hao method” in the new CVTree software. Users wishing to employ other CVTree methods should utilize the local version of the software, whose source code is available for download on both GitHub and Gitee. For detailed algorithmic information, please refer to the reference document (Zuo, 2021).
K-tuple Length: WebCVTree4 can generate trees for all selected K-tuple lengths in a single run. The default K-values range from 5 to 7 for proteins and from 6 to 18 with increments of 3 for coding DNAs, though any single K-value may be selected. When using protein sequences, the optimal K range is 4-5 for viruses, 5-6 for prokaryotes, and 6-7 for fungi (Li et al., 2010). K-values of 8 and 9 are available but generally unnecessary.
Email: Users may optionally provide an email address to receive notification when the project is completed. Alternatively, results can be reloaded at a later time using the Project Number as described above.

2.2.3 Choosing Inbuilt Genomes

The WebCVTree4 web server incorporates a comprehensive built-in database of genomes, which are categorized into several groups as indicated by the selectable buttons: Archaea (1,154 genomes), Bacteria (223,734 genomes), Tiny Genomes (81 genomes), and Eukarya (8 genomes). The numerals in parentheses represent the number of genomes within each group. For additional information regarding these genome categories, please refer to Section 3.1. Users can select inbuilt genomes through two methods:

By clicking the checkbox preceding each group, users can select or deselect an entire group. The outgroup will be automatically selected by the web server.
If a user wants to select genomes one by one and set an outgroup by oneself, please get into the Select Inbuilt Genomes Page by clicking on the Check Inbuilt Genome button.

This Inbuilt Genomes page (shown in Figure 5) consists of a long list of all builtin genomes. The first column is the checkbox for select the inbuilt genomes into the project. Entries in this table is sortable by clicking on the head of the table. For example, by clicking on “Genome Name”, all the genomes would appear in alphabetic order of their names; by clicking on “No. Genes”, the genomes will be ordered by the total number of genes.

The subsequent row pertains to the assembly level. As per NCBI information, there are four defined assembly levels (“Complete Genome”, “Chromosome”, “Scaffold”, “Contig”). A filter located above the table enables users to display only genomes of superior assembly quality, thereby avoiding the potential for lower-quality genomic sequences to introduce errors into the overall computational results.

The last column of the table possesses a pull-down list of taxonomic ranks with number of taxa in each rank: from Domian{3} to Species{34478} for the time of writing these lines. A user can pick up an item from the list to facilitate the selection of genomes.

After completing the selection one returns to the Project page (Figure 3) by clicking on Done & Back to Project page.

2.2.4 Upload Genomes

Upon clicking the Upload User's Genomes button, you can select the genome files for upload. Two important points to note here: first, you can select multiple genome sequences at once; second, the file extensions of the uploaded genomes must match the genome type selected in the Basic Parameters section, i.e. .faa for protein sequences or .ffn for the DNA sequences.

After the genomes are successfully uploaded, the “Uploaded” checkbox will display the count of uploaded genomes. Similar to the representation for built-in genomes, the number before the ‘/’ indicates the count of selected genomes in the follow calculation, and the number after the ‘/’ shows the total number of uploaded genomes. All user genomes, checked or unchecked, together with the configured project will be kept for 7 days after the last run.

Furthermore, once genomes are uploaded, a small blue tag will appear next to the checkbox. Clicking this tag opens a selection page (see Figure Figure 6) where you can choose which genomes to include in the calculation. By default, all uploaded genomes are selected.

Additionally, it should be noted that after upload, the genome names will be appended with an ‘UPLOAD’ suffix to distinguish them from the built-in genomes. Similarly, you can also choose to calculate only the uploaded genomes by unchecking the “Uploaded” checkbox.

2.2.5 Set Lineages of Genomes

The handling of Lineage information represents one of the most significant upgrades in WebCVTree4 compared to its previous version. In WebCVTree4, the lineage information for selected built-in genomes is pre-processed. For user-uploaded genomes, the system attempts to infer a possible lineage based on the file name by referencing NCBI taxonomy data. If the classification is unclear, the genome is labeled as “Unclassified”.

In summary, if no additional action is taken by the user, the system will automatically process all lineage information once the job is submitted. However, if users wish to preview the classification information in advance, they can click the Update Lineage button. The server will then update the lineage information for all selected genomes and display the counts for each taxonomic level in the Set Lineages block. Here, the number before the “+” indicates the count of genomes with a known classification at that level, while the number after the “+” shows how many genomes are currently “Unclassified” for that level.

To examine the detailed lineage information, users can click the Check Lineage button, which will bring up a page from the server as shown in Figure 7.

Upon entering the “Lineage Page”, the taxonomic information for all selected genomes—including both built-in and user-uploaded genomes—is displayed in a table.

On this page, you can filter the genomes based on their existing classifications. To do this, click the “SearchPanes” button to reveal all available taxonomic categories, as shown in Figure 8. The interface will then allow you to select specific taxa at each classification level to narrow down the genomes of interest. You can also use the search box on the right side to search the genomes by keywords.

These classification details can be manually modified in two ways:

Offline Editing via CSV Export: Click the Export to CSV button to download the entire table as a CSV file. You can edit the file locally using spreadsheet software and then upload the modified file back to the page via drag-and-drop.
Online Editing: Select one or more genomes in the table to activate the Edit button. Clicking it will open a pop-up dialog (shown in Figure 9). In this dialog, the lineage information for all selected genomes is merged. If all selected genomes share the same name at a specific taxonomic level, that name is displayed. If there are multiple different names, “Multi-valued” is shown instead. You can edit these names, and your changes will apply to the corresponding taxonomic level.

Important Note Regarding “Multi-valued”: If you leave a “Multi-valued” entry unchanged, the original distinct names for that level will be preserved for their respective genomes. However, if you enter a new name, it will overwrite and unify the taxonomic level for allselected genomes.

2.3 Run Project

When all paramenters are set the project is submitted for processing by clicking the button All parameters are fine, Run Project at the right side of basic parameters block (see Figure 3). After submission the “Setting parameters” status will be locked and the project status changes to that shown in the middle of Figure 4, namely, “Running $\cdots$”. The project will be done in a few minutes if only the inbuilt genomes are used. If many new genomes are uploaded, the waiting time might be much longer, depending on the size and number of genomes. One can safely close the page and the completion of job be notified later by email if an email has been entered when setting parameters. And users can also revisit the web server and reload the project by entering the Project Number. And if necessary one can cancel the project and reset parameters by clicking the button Cancel Project and reset the project.

When the project is completed, the Project Status bar changes to that shown in the third block of Figure 4. By clicking on the See Result button one is led to the Result Page. It shown be noted that, regarding the validation of the generated phylogenetic tree, we believe that the approach should not be limited to simple sampling tests, as the results obtained from such methods only demonstrate the self-consistency of the data. Instead, we advocate for the use of validation methods that are independent of the current dataset. Through years of research, we have found that comparing taxonomic information with phylogenetic relationships serves as an exceptionally effective validation technique. To this end, we have developed a novel algorithm that annotates the phylogenetic tree with taxonomic information, enabling a comprehensive comparison between the taxonomic system and the inferred phylogenetic relationships. For a detailed explanation of the algorithm and its underlying concepts, please refer to Section 4.2. The following section will focus on how these results are presented on the web interface. Overall, the presentation is divided into two perspectives: one based on the taxonomic view and the other based on the phylogenetic tree view.

2.4 Taxon View Page

The Result Page almost entirely consists of a long table summarizing taxa convergence except for two buttons in the upper-right corner: See Tree and Download Result. Here See Tree is the portal to the interactive tree display to be described in Section 2.5, Download Result is where the user can download the results to the local computer for further analysis and archiving. It may be used any time while on-line or after reloading a project.

The interface then presents three tabs: By Lineage, By Level, and Unclassified. In our process of comparing the phylogenetic tree with the taxonomic system, we treat genomes with unknown taxonomic units—specifically those labeled as Unclassified—separately from genomes that have complete taxonomic information. Therefore, the first two tabs (By Lineage and By Level) present the comparison between the phylogenetic tree and the taxonomic system for genomes with complete classification, but from two different perspectives. The Unclassified tab displays genomes with incomplete taxonomic information. If no genomes in the project lack classification data, this tab will not appear.

2.4.1 Summary of Taxa by Lineage

On the By Lineage Tab of this page, the taxonomic levels are displayed in a hierarchical order (see Figure 10). For clarity, when the page is first opened, all taxonomic units are collapsed to the highest level. Clicking the triangle ($\blacktriangleright$) next to a unit expands it to reveal the subordinate taxonomic levels, and the triangle icon rotates downward ($\blacktriangledown$). You can also search for specific taxonomic units using the search box at the top, or select a specific taxonomic level from the dropdown menu to collapse the hierarchy to that chosen level.

Convergence of taxa at various K-values provides an additional angle to look at the phylogeny. That is why WebCVTree4 calculates trees at several Ks in one run and produces a summary report. In the taxonomic view, the comparison results under different K-values are consolidated and displayed together. As illustrated in the figure, each taxonomic unit is represented in the following format:

<P>Bacillota:1105                --K6K7

A taxon name is followed by the number of genomes belonging to the taxon. Before the taxon name is the abbreviation of the taxon rank. Here, abbreviations are used for the ranks: $<$D$>$ Domain, $<$K$>$ Kingdom, $<$P$>$ Phylum, $<$C$>$ Class, $<$O$>$ Order, $<$F$>$ Family, $<$G$>$ Genus, and $<$S$>$ Species. The same set of abbreviations with an additional $<$T$>$ for sTrains are used in the CVTree Viewer. And then is the taxon name. The monophyly status of the taxon is given at the right. For example, there are 1105 genomes for the archaeal phylum Bacillota and it is monophyletic at all K’s except for $K=6$ and $K=7$, but not in $K=5$.

2.4.2 Summary of Taxa by Taxon Level

As shown in the figure, the “By Level” tab displays the comparison results organized by taxonomic rank and provides corresponding statistics (See Figure 11). On this page, users can utilize the dropdown menu to filter and display items of a specific category. The number following each taxonomic rank is formatted as X / Y. Here, the number after the slash (/) indicates the total number of itemsbelonging to that rank, while the number before the slash represents the count of items matching the currently selected category. The detailed Monophyly status of individual items is displayed in the same manner as in the By Lineage Tab. Furthermore, you can click on any taxonomic rank to view its detailed information.

Figure 11: The Taxa Convergence Summary by Rank

For example, when “Show Monophyly Taxa” is chosen, each taxonomic rank as a subtitle carries a statistic. For example, a line

Phylum: 5+35/41

tells that the total number of classes is 41 in the input dataset (after taking into account taxonomic revision, if any); among these 41 classes 5 are represented by only one genome hence are trivially monophyletic; the other 35 are represented by two or more genomes and are monophyletic at least for one K.

In the “Show Non-Monophyly Taxa” option, the corresponding line reads

Phylum: 1/41

indicating that there are $41 - 5 - 35 = 1$ classes, which are not monophyletic for whatsoever K. In a sense, these non-monophyletic taxa are worth further studying as they may hint on possible taxonomic revisions.

2.4.3 Summary of Unclassified Genomes

Items that lack classification information are grouped and displayed in a separate tab labeled Unclassified, as illustrated below Figure 12. This helps to look for taxa with incomplete lineage information.

Figure 12: Summary of Unclassified Genomes

2.5 Tree View Page

It is extremely difficult, if not impossible, to comprehend a tree made of thousands of leaves. To this end an interactive, collapsible and expandable, display has been developed since WebCVTree3. As skillful manipulation of the display is the key point to make the most of WebCVTree4. Therefore, we explain the interactive display in more details.

By clicking the button See Tree in the upper-right corner of the Result Page (see Figure 10), a TreeViewer page with default $K=6$ opens up.

A typical tree, plotted by using HTML5, is shown in Figure 13.

Figure 13: A typical page of CVTree Viewer

First of all, in Figure 13 all 235 Archaea, 5310 Bacteria and 8 Eukarya genomes are accomodated within a single screen. The {n + m} convention in number of genomes will be explained later. This picture was obtained after searching for the class name “Lactobacillales” and only its neighborhood has been expanded somehow, while all other branches were maximally collapsed except for one line exposed in purple. In total, four colors are used in this figure. Their meaning will be explained in the next subsection.

2.5.1 Circles in Tree

If a node is denoted by a blank circle ($\circ$), it is collapsible. One may click on the circle to have all the lower branchings shrinked; the collapsed branch is labeled by the highest-rank common taxon name. At the lowest level, a rightmost node may be marked by a solid circle ($\bullet$) preceding a taxon name; it tells that there are more than one genomes in that branch and it may be expanded by clicking on the solid circle. In contrast, a short line (—) in place of a would-be circle means that there is only one genome and it cannot be further expanded.

2.5.2 Color of Items

Taxon names may appear in one of four colors: red, blue, green and purple.

A red name indicates that the branch is monophyletic and collapsed. This includes the trivial case when a taxon is represented by a single genome. For example: <D>Archaea{235}.
A collapsed but not convergent branch such as <K>Pseudomonadati{503/3107} is shown in blue.
A taxon name in green matches the word a user types in place “Search Query”, i.e., <O>Lactobacillales in this picture.
Purple color is used to show a taxon with incomplete lineage information, with “Unclassified” items. In Figure 13 there is only one line in purple: <D>Eukaryota ... <T>GCF000146465v1-Encephalitozoon_intestinalis_ATCC_50506{0+1}.

2.5.3 The Taxonname {n + m } Convention

A taxon name such as <D>Eukaryota{7+1} indicates that 1 of the 8 bacterial genomes did not come with complete lineage information. Only taxa with complete lineage information are counted in the convergence report as an augend {n, while genomes without complete lineage information are indicated in the tree display as an addend +m}. This convention is useful for studying taxonomic assignment of newly sequenced genomes without proper lineage information. However, please note that lineage information for a given taxon may be complete but incorrect, thus requiring further modification.

2.5.4 Select Another Tree

A toolbar at the top of the page provides buttons for performing specific operations. The first button allows you to open phylogenetic trees generated under different K-values within this same page.

2.5.5 Select Root Node

One can pick up a branch and let it fills up the whole display window. For example, to single out the domain <D>$Archaea{235} one holds the shift key and clicks on the solid circle in front of the species name. The leaf $\bullet<$S$>$Archaea{235} will move to the leftmost position in the display and further clicking on the solid circle expands it to the whole window.

The aforementioned operation of picking up a branch may be performed in another way, namely, by using the “Select Node” option in the headline of the CVTree Viewer. By selecting a taxon name in the pull-down list, e.g., <P>Methanobacteriota:178, the phylum Methanobacteriota represented by 178 genomes for the time being is displayed in the whole window.

2.5.6 Display Style

In addition to expanding or collapsing branches on the phylogenetic tree by clicking the circles on the nodes, you can also perform bulk operations using the third button. For example, you can collapse the entire tree to a specific taxonomic rank, or display all items labeled as “Unclassified” or “Upload”.

2.5.7 Lineage Revision

The option “Lineage Modificaion” in the CVTree Viewer provides this function. By clicking on this option an empty “Lineage Modification” window opens up: it looks like Figure 14. This page is essentially the same as the “Check Lineage” page in the project settings, with the exception of the Redo Collaps button in the upper-right corner. Its operation is also consistent with the description in Section 2.2.5. When a Lineage Modification file is ready, one clicks on the Redo Collapse button in top-right. The system shows “Recollapsing is running. Please wait.” It takes a minute or two. Then it says “Recollapse successfully”. Both the taxa convergence table and the CVTree Viewer have been renewed. We emphasize that actual taxonomic revisions must comply with the International Code of Nomenclature of Bacteria (Lapage, Sneath, & Lessal, 1992) and follow the established practice in the microbiological community. The Lineage Modificarion function provided by WebCVTree4 is solely for trial purpose.

Figure 14: An example Lineage Modification file

2.5.8 Search Query

The quickest way to get to the point of interest in a tree is typing a taxon name in place of the “Search Query”. For example, one may type “Escherichia” and select an item from the pull-down list, e.g., Escherichia_coli.

2.6 Output and Edit Tree

When a tree view has been adjusted by appropriately collapsing and expanding, a print quality figure can be obtained by clicking on the output button in the upper-right corner of the CVTree Viewer page (Figure 13). It opens an output preview page (Figure 15). The main options for tree is:

On the right side of the page, there is a collapsible control panel. Using this panel, users can modify various aspects of the phylogenetic tree, such as tree style, branch lengths and spacing, text size and font, the colors of branches and labels, the display of collapsed branches, and the circles on nodes, and so on.
Click on the branch to select the brach, and the change the style, color and font style by the right pannel.
Double click the name of the terminal node, you can change the name of the branch.
Holding down the Shiftkey and clicking on a specific node, you can swap the two branches (clades) that descend from it.

A useful tip: When selecting branches, it can be difficult to click on the thin lines. While adjusting the figure, keep the open circles on the branches enabled for easier selection. You can remove them in the final step before exporting by the right pannel.

With the adjustments described above, it is generally possible to generate a phylogenetic tree figure suitable for publication. And then, one may select a format to save a figure. The default format is SVG (Scalable Vector Graphics), as the underlying plot is done in SVG. One may save the figure into PDF, eps, and png formats as well.

Figure 15: The Output Preview Page of WebCVTree4

2.7 Obtain Distance between Genomes

To optimize storage efficiency and accelerate computation, the distance files generated by the CVTree software are not saved in a standard format and require specialized commands for parsing. Consequently, WebCVTree does not provide an option to download these distance files.

For users who are interested in the specific numerical distances and wish to conduct repeated comparative analyses, we recommend running CVTree locally. The software can be easily downloaded from GitHub or Gitee. Additionally, a containerized version compatible with Singularity is available, which eliminates the need for complex compilation steps.

However, as a temporary solution within WebCVTree4, we do provide a method to access the distance matrix generated during the CVTree calculation process.

To obtain the distance matrix, the first step is to select the target genomes. This can be done in either of the following two ways:

Click the Get Distance button located in the upper-left corner of Figure 13. This will open the interface shown in Figure 16, where you can select the relevant genomes. Then, clicking Get the Distance will generate a distance list, as displayed in Figure 17. This list can be exported as a CSV file for further analysis.
Alternatively, if you are interested in the distances between species under a specific branch, you can hold down the Altkey and click on the corresponding node in the phylogenetic tree shown in Figure 13. This will also generate the same distance list, like Figure 17.

Figure 17: Output Distance Page of WebCVTree4

3 Inbuilt Data

3.1 Inbuilt Genomes

3.1.1 Prokaryotic Genomes

The WebCVTree4 webserver inbuilt all prokaryotic genomes of the NCBI RefSeq database.

Archaea of Refseq NCBI: ftp://ftp.ncbi.nih.gov/genomes/refseq/archaea
Bacteria of NCBI RefSeq: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria

3.1.2 Eukaryote Genomes

Eight eukaryotic genomes, 4 fungal and 4 non-fungal, are provided for serving as outgroup in tree construction. The non-fungal genomes are Caenorhabditis elegans, Arabidopsis thaliana, Plasmodium falciparum and Drosophila melanogaster.

3.1.3 Tiny Genomes

There are a few highly degenerated genomes of bacterial endosymbiont bacteria in the in-built database. Their proteomes are very small ($< 10000$ amino acids), hence the adjective “Tiny”. Due to lacking of many genes the position of these species in the phylogeny often turns out to be questionable, e.g., they tend to the root and occasionally violate the trifurcation of the three main domains of life. This is why we suggest not to include the “Tiny Genomes" in a study of mostly”free-living” organisms.

On the other hand, if one is interested in these highly degenerated genomes, then it should be reminded that the cut-off at $10^4$ amino acids is artificial and many slightly larger genomes, i.e., those from some insect symbionts in the family Enterobacteriaceae must be taken into account as well.

3.2 Taxonomic References

For prokaryotes direct comparison with taxonomy has become feasible only quite recently. On one hand, the completion of the second edition of The Bergey’s Manual of Systematic Bacteriology (Bergey’s Manual Trust, 2001-2012), which has been considered by many microbiologists as the best approximation to an official classification (Konstntinidis & Tiedje, 2005), provides a state-of-the-art framework for taxonomy together with current literature such as as IJSEM International Journal of Systematic and Evolutionary Microbiology. On the other hand, the development of the CVTree approach has provided prokaryotic phylogeny a convenient and comprehensible platform (Hao, 2011; Li et al., 2010).

Speaking about taxonomy one must admit that there is no generally accepted standard for prokaryotic taxonomy. The temptation to become a standard makes the Bergey’s systematics a more conservative source. For example, there were deadlines and other restrictions for inclusion in The Manual. Many newly sequenced genomes do not have neither a standing in bacterial nomenclature nor a validly published name. These organisms are not reflected in Bergey’s Manual or in current literature. In contrast, the NCBI taxonomy, though disclaimed to be a taxonomic reference, is, in fact, more dynamic and up-to-date. At least, for any sequence deposited into GenBank there is a piece of lineage information in NCBI taxonomy, no matter how incomplete it might be. Therefore, WebCVTree4 takes initial lineage information from NCBI.

4 Algorithm

Behind WebCVTree are two core algorithms: CVTree and CLTree.

CVTree (Compsition Vector Tree): starts with genomic sequences and constructs a phylogenetic tree using a model based on K-mer frequencies.
CLTree (Collapse Lineage Tree): on the other hand, is an algorithm that compares the phylogenetic tree with a taxonomic system. It annotates the phylogenetic tree with taxonomic information. Furthermore, based on this taxonomic information, CLTree determines a suitable root for the unrooted tree.

4.1 CVTree: Composition Vector Tree

As the CVTree method has been extensively documented in the literature (Hao, 2011; Li et al., 2010; J. Qi et al., 2004), we present only a brief overview here.

4.1.1 Frequency or Probability of Appearance of K-Strings

The alignment-free approach of genome comparison is realized by extending
single nucleotide or single amino acid counting to that of longer K-strings. Among early work along this line we mention the use of dinulceotide relative abundance as a genomic signature (Karlin & Burge, 1995). Given a DNA or amino acid sequence of length $L$, we count the number of appearance of (overlapping) strings of a fixed length $K$ in the sequence. The counting may be performed for a complete genome or for a collection of translated amino acid sequences. There are in total $N$ possible types of such strings: $N=4^K$ for DNA and $N=20^K$ for amino acid sequences.

For concreteness consider the case of one protein sequence of length $L$. Denote the frequency of appearance of the $K$-String $a_1a_2 \cdots a_K$ by
$f(a_1a_2 \cdots a_K)$, where each $a_i$ is one of the 20 amino acid single-letter symbols. This frequency divided by the total number $(L-K+1)$ of $K$-Strings in the given protein sequence may be taken as the probability $p(a_1a_2 \cdots a_K)$ of appearance of the string $a_1a_2 \cdots a_K$ in the protein: \[p(a_1a_2 \cdots a_K)=\frac{f(a_1a_2 \cdots a_K)}{(L-K+1)}\] The collection of such frequencies or probabilities reflects both the result of random mutations and selective evolution in terms of $K$-strings as building blocks.

4.1.2 Subtraction of Random Background

Mutations happen in a more or less random manner at the molecular level, while selections shape the direction of evolution. Neutral mutations lead to some randomness in the $K$-string composition. In order to highlight the selective diversification of sequence composition one must subtract a random background from the simple counting results. This is done as follows.

Suppose we have done direct counting for all strings of length $(K-1)$ and $(K-2)$. The probability of appearance of $K$-strings is predicted by using a Markov model: \[p^0(a_1a_2 \cdots a_K) = \frac{p(a_1a_2 \cdots a_{K-1})p(a_2a_3\cdots a_K)}{p(a_2a_3 \cdots a_{K-1})}\] The superscript 0 on $p^0$ indicates the fact that it is a predicted quantity. We note that the denominator comes from the frequency of $(K-2)$-strings. This kind of Markov model prediction has been used in biological sequence analysis since long (Brendel, Beckmann, & Trifonov, 1986). It can be justified by virtue of a maximal entropy principle with appropriate constraints (Hu & Wang, 2001).

4.1.3 Composition Vectors and Dissimilarity Matrix

It is the difference between the actual counting result $p$ and the predicted value $p^0$ that really reflects the shaping role of selective evolution. Therefore, we collect \[a_i(a_1a_2 \cdots a_K) = \begin{cases} \frac{p(a_1a_2 \cdots a_K) - p^0(a_1a_2 \cdots a_K)}{p^0(a_1a_2 \cdots a_K)} & \text{when $p^0 \neq 0$}\\ 0 & \text{when $p^0 = 0$} \end{cases}\] for all possible strings $a_1a_2 \cdots a_K$ as components to form a composition vector for a species. To further simplify the notations, we write $a_i$ for the $i$-th component corresponding to the string type $i$, where $i$ runs from 1 to $N=20^K$. Putting these components in a fixed order, we obtain a composition vector for the species $A$: \[A=(a_1,a_2,\cdots,a_N)\] Likewise, for the species $B$ we have a composition vector \[B=(b_1,b_2,\cdots,b_N)\]

In principle there are different ways to construct the composition vectors. First, one may use the whole genome sequence. Second, one may just collect the coding sequences in the genome. Third, one makes use of the translated amino acid sequences from the coding segments of DNA. As mutation rates are higher and more variable in non-coding segments and protein sequences change at a more or less constant rate, one expects that the third choice is the best and the second is better than the first. We tried all three choices and the requirement of consistency served as a criterion. By consistency we mean the topology of the trees constructed with growing $K$ should converge. This is best realized with phylogenetic relations obtained from protein sequences. Therefore, in what follows we concentrate on results based on amino acid sequences.

The correlation $C(A,B)$ between any two species $A$ and $B$ is calculated as the cosine function of the angle between the two representative vectors in the $N$-dimensional space of composition vectors: \[C(A,B)=\frac{\sum_{i=1}^Na_i \times b_i}{(\sum_{i=1}^Na_i^2 \times \sum_{i=1}^Nb_i^2)^{\frac{1}{2}}}\] The distance $D(A,B)$ between the two species is defined as \[D(A,B)=\frac{1-C(A,B)}{2}\] Since $C(A,B)$ may vary between -1 and 1, the distance is normalized to the interval $(0,1)$. The collection of distances for all species pairs comprises a dissimilarity matrix. We prefer dissimilarity to distance, because the $D(A, B)$ defined above does not guarantee the fullfilment of all triangle inequalities (Li et al., 2010).

4.1.4 Tree Construction

Once a distance matrix has been calculated it is straightforward to construct phylogenetic trees by using the neighbor-joining (NJ) method (Saitou & Nei, 1987).

4.2 CLTree: Collpase Lineage Tree

CLTree is a tool for evaluating the congruence between a phylogenetic tree and a taxonomic system. Since the CLTree method has been described many times in the literature (Zuo, 2025), here we only present a brief overview. Before detailing the specific algorithm of CLTree, it is necessary to first explain why we employ this method to assess a phylogenetic tree.

4.2.1 Monopoly, Collapsing, and Convergence

A prominent feature of the CVTree approach consists in that the resulted trees are justified by direct comparison with taxonomy rather than by statistical resampling tests such as bootstrap or jackknife. Statistical resampling tests tell at most the stability and self-consistency of the tree with respect to small variations of the input data, by far not the objective correctness of the trees. We note, nevertheless, the CVTree results have also successfully passed various statistical resampling tests (Zuo, Xu, Yu, & Hao, 2010).

A central notion in comparing a tree with taxonomy is monopoly. The notion of monopoly applies to phylogeny as well as to taxonomy, see, e.g., discussion by James Farris (Farris, 1974, 1990). However, we use it in a pragmatic way by restricting to the classification of genomes in the input dataset and to the collection of genomes in various tree branches.

If all genomes in a certain tree branch come from one and the same taxon and no genomes from other taxa having mixed in, the branch is said to be monophyletic at this taxonomic rank. For example, this happens to Methanobacteriota as all the 178 genomes designated to this phylum in the input dataset appear entirely and exclusively in one and the same branch. Now the branch may be fully collapsed into one leave labeled by Methanobacteriota{178}. In this way, the total number of leaves seen in a tree may be greatly reduced.

From a taxonomic point of view a taxon is monophyletic only when all species listed in it are descendants of one and the same ancester. As this is a hardly provable fact, monphyleticity has to be deduced from some phylogenetis study. For example, according to vol. 3 of The Bergey’s Manual, 77 species out from 167 listed in the genus Clostridium form a cluster in a 16S rRNA gene tree. These are considered members of Costridium senso stricto, whereas the remaining 90 species are distributed in 10 different clusters. Naturally, one cannot expect a monophyletic branch of Clostridium genomes for the time being.

When a branch is collapsed monophyletically to a leave made of genomes from one and the same taxon, it is said to be convergent at this taxonomic rank. In other words, only when collapsing leads to a monophyletic leave, the taxon is considered convergent at the corresponding K. Usually this happens at one or more K-values. Convergence at most or all K-values adds confidence to the result, although the branching topology may be slightly different.

4.2.2 Annotating Rooted Phylogenetic Tree

After obtain the lineage of all leaves of the phylogenetic tree, a rooted tree can be annotated by a recursive algorithm, which named as “Collapse” algorithm. The pseudo-code was show below:

Figure 18: Collapse Algorith For Rooted Tree

4.2.3 Rooting the Unrooted Phylogenetic Tree

The collapse algorithm can only annotate the rooted tree, due to the taxonomy of biology is a hierarchic classified system. In the CLTree system, there are three methods to annotate a unrooted phylogenetic tree:

Rooting the phylogenetic tree to obtain the best consistency between the phylogenetic tree and taxonomy, and rearrange the tree to keep the operational taxonomic units (OTU). This is the default method of CLTree.
Rooting the tree by branch length, five rooting methods are provided by CLTree, i.e. minimal variance (mv, default), minimal ancestor deviation (mad), midpoint (mp), pairwise midpoint root (pmr), minimal depth (md). For more about the phylogenetic method, please refer the reference (Farris, 1972; Mai, Sayyari, & Mirarab, 2017; Tria, Landan, & Dagan, 2017)
Set a leaf as the out-group for the phylogenetic tree manually (Kluge & Farris, 1969).

The last two methods had been described by other references. The first method, which is firstly provided by CLTree, is a method based on both lineage and branch length of the tree. In this method, a root with the best consistent with the lineage. Here the degree of consistency is measured by the entropy reduction ratio in the CLTree system (see the next section). And basic idea of the method is that:

Random select a leaf as the out-group, and annotate the tree by Algorithm.
Obtain the root candidates by Algorithm . And select a candidates as the root, rearrange the tree.
Annotate the new subtree.
Set taxon level as operational taxonomic units (OTU), and determined the final root by a phylogenetic rooting method, e.g. minimal variance (mv, default), minimal ancestor deviation (mad), midpoint (mp), pairwise midpoint root (pmr), minimal depth (md).

Figure 19: The Algorith For Rooting Tree

5 Development History

The CV approach was first announced in 2002 at C. N. Yang’s 80th Birthday Conference (Hao, Qi, & Wang, 2003) and applied to coronovivuses (Gao et al., 2003) and prokaryotes (J. Qi et al., 2004). Since the publication of the paper (J. Qi et al., 2004), many groups had implemented the classical CVTree algorithms. Here we list the major versions which implemented by our group, and the version numbers of the Standalone CVTree were reset by the version number of the Web Server CVTree as the standalone CVTree have never published:

Most 0.x Standalone CVTree was written by Lei Gao; Ver. 0.9.6 was written by Ji Qi.
Web Server CVTree v1 was written by Ji Qi, Hong Luo, and Bailin Hao
Standalone CVTree 1.x was written by Zhao Xu
Web Server CVTree v2 was written by Zhao Xu and Bailin Hao
Standalone CVTree 2.x was written by Guanghong Zuo
Web Server CVTree v3 was written by Guanghong Zuo and Bailin Hao
Standalone CVTree 3.x was written by Guanghong Zuo
Standalone CLTree 1.0 was written by Guanghong Zuo
Web Server CVTree v4 was written by Guanghong Zuo

6 Acknowledgements

The CVTree project has been supported by National Basic Research Project of China (973 Programs No. 2007CB814800 and No. 2013CB834100), and the Wenzhou institute, University of Chinese Academy of Sciences (Grant No. WIUCASQD2021042).

Reference

Bergey’s Manual Trust. 2001-2012. The bergey’s manual of systematic bacteriology (second). New York, Heidelberg: Springer-Verlag.

Brendel, V., Beckmann, J. S., & Trifonov, E. N. 1986. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure & Dynamics, 4(1): 11–21.

Choi, J., Kim, K. T., Jeon, J., & Lee, Y. H. 2013. Fungal plant cell wall-degrading enzyme database: A platform for comparative and evolutionary genomics in fungi and Oomycetes. BMC Genomics, 14(Suppl. 5): 57.

Farris, J. S. 1972. Estimating phylogenetic trees from distance matrices. American Naturalist, 106(951): 645–667.

Farris, J. S. 1974. Formal definitions of paraphyly and polyphyly. Systematic Zoology, 23(4): 548–554.

Farris, J. S. 1990. Haeckel, History, and Hull. Systematic Zoology, 39(1): 81–88.

Gao, L., & Qi, J. 2007. Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evolutionary Biology, 7: 41.

Gao, L., Qi, J., Sun, J. D., & Hao, B. L. 2007. Prokaryote phylogeny meets taxonomy: An exhaustive comparison of composition vector trees with systematic bacteriology. Science in China Series C: Life Sciences, 49(5): 587–599.

Gao, L., Qi, J., Wei, H. B., Sun, Y. G., & Hao, B. L. 2003. Molecular phylogeny of coronaviruses including human SARS-CoV. Chinese Science Bulletin, 48(12): 1170–1174.

Hao, B. L. 2011. CVTrees support the Bergey’s systematics and provide high resolution at species level and below. Bulletin of BISMiS, 2,Part 2: 189–196.

Hao, B. L., Qi, J., & Wang, B. 2003. Prokaryote phylogeny based on complete genomes without sequence alignment. Modern Physics Letters B, 17: 91–94.

Hu, R., & Wang, B. 2001. Statistically significant strings are related to regulatory elements in the promoter regions of Saccharomyces cerevisiae. Physica A: Statistical Mechanics and Its Applications, 290(3-4): 464–474.

Karlin, S., & Burge, C. 1995. Dinucleotide relative abundance extremes: A genomic signature. Trends in Genetics: TIG, 11(7): 283–90.

Kluge, A. G., & Farris, J. S. 1969. Quantitative phyletics and the evolution of anurans. Systematic Zoology, 18(1): 1–32.

Konstntinidis, K. T., & Tiedje, J. V. 2005. Towards a genome-based taxonomy for prokaryotes. Journal of Bacteriology, 187: 6258–6264.

Lapage, S. P., Sneath, P. H. A., & Lessal, E. F. 1992. International code of nomenclature of bacteria: Bacteriological code 1990. Washington, DC: ASM Press.

Li, Q., Xu, Z., & Hao, B. L. 2010. Composition vector approach to whole-genome-based prokaryote phylogeny: Success and foundations. Journal of Biotechnology, 149: 115–119.

Liu, J., Wang, H., Yang, H., Zhang, Y., Wang, J., et al. 2013. Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms. Nucleic Acids Res, 41(1): e3.

Lu, Z., Y. F. 2025. CVTree for 16S rRNA: Constructing taxonomy-compatible all-species living tree effectively and efficiently. Chinese Physics B, 34(8): 088704.

Mai, U., Sayyari, E., & Mirarab, S. 2017. Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS One, 12(8): e0182238.

O’Connell, R. J., Thon, M. R., Hacquard, S., & Amyotte, S. G. 2012. Lifestyle transition in plant pathogenic collectotrichum fungi deciphered by genome and transcriptome analyses. Nature Genetics, 44: 1060–1065.

Qi, Ji, Luo, H., & Hao, B. 2004. CVTree: A phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research, 32(Web Server issue): W45–7.

Qi, J., Wang, B., & Hao, B. L. 2004. Whole proteome prokaryote phylogeny without sequence alignment: A K-String composition approach. Journal of Molecular Evolution, 58(1): 1–11.

Saitou, N., & Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4): 406–425.

Sun, J. D., Xu, Z., & Hao, B. L. 2010. Whole-genome based Archaea phylogeny and taxonomy: A composition vector approach. Chinese Science Bulletin, 55(24): 2323–2328.

Tria, F. D. K., Landan, G., & Dagan, T. 2017. Phylogenetic rooting using minimal ancestor deviation. Nat Ecol Evol, 1: 193.

Wang, H., Xu, Z., Gao, L., & L., H. B. 2009. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evolutionary Biology, 9: 195.

Xu, Z., & Hao, B. L. 2009. CVTree update: A newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Research, 37: W174–W178.

Yu, Z. G., Zhou, L. Q., Anh, V. V., Chu, K. H., Long, S. C., et al. 2005. Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment. Journal of Molecular Evolution, 60(4): 538–45.

Zhang, Q., Wu, Y., Wang, J., Wu, G., Long, W., et al. 2016. Accelerated dysbiosis of gut microbiota during aggravation of DSS-induced colitis by a butyrate-producing bacterium. Sci Rep, 6: 27572.

Zuo, G. 2021. CVTree: A parallel alignment-free phylogeny and taxonomy tool based on composition vectors of genome. Genomics Proteomics Bioinformatics, 19: 662–667.

Zuo, G. H. 2025. CLTree: Annotating, rooting and evaluating phylogenetic tree based on lineage of genomes. In Submitting.

Zuo, G. H., Hao, B. L., & Staley, J. T. 2014a. Geographic divergence of "Sulfolobus islandicus" strains by genomic analyses including electronic DNA hybridizztion confirms they are geovars. Antonie van Leeuwenhock J. Microbiol., 105: 431–435.

Zuo, G. H., Li, Q., & Hao, B. L. 2014b. On K-tuple length in composition vector phylogeny of prokaryotes. Computational Biology and Chemistry, 53: 166–173.

Zuo, G. H., Xu, Z., & Hao, B. L. 2013. Shigella strains are not clones of Escherichia coli but sister species in the genus Escherichia. Genomics Proteomics Bioinformatics, 11: 61–65.

Zuo, G. H., Xu, Z., & Hao, B. L. 2015. Phylogeny and taxonomy of Archaea: A comparison of the whole-genome-based CVTree approach with 16S rRNA sequence analysis. Life, 5: 949–968.

Zuo, G. H., Xu, Z., Yu, H. J., & Hao, B. L. 2010. Jackknife and bootstrap tests of the composition vector trees. Genomics, Proteomics and Bioinformatics, 8: 262–267.

Zuo, G., & Hao, B. 2015. CVTree3 web server for whole-genome-based and alignment-free prokaryotic phylogeny and taxonomy. Genomics Proteomics Bioinformatics, 13(5): 321–31.