Motivation: Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler[1], —a state-of-the-art method for the third critical assessment of functional annotation (CAFA3)[2], we propose NetGO 2.0[3], a method that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information, and biomedical text information.

Results: Specifically, the advantages of NetGO 2.0 in using network, text and sequence information are as follows: (i) NetGO 2.0 relies on a powerful learning to rank framework from machine learning to effectively integrate sequence, network as well as text information of proteins[4][5]; (ii) NetGO 2.0 uses the massive network information of all species (>5000) in STRING (other than only some specific species); (iii) NetGO 2.0 still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. (iv) Text information is useful for predicting BP (Biological Process) and CC (Cellular Component); and (v) Deep learning-based sequence model achieves good performance in predicting CC. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO 2.0. Experimental results have clearly demonstrated that NetGO 2.0 significantly outperforms GOLabeler and other competing methods. In addition, according to the preliminary results of CAFA4 reported in ISMB2020 (July 2020), NetGO 2.0 achieved the first place over several metrics of MF (Molecular Function), BP and CC.

Version History

2020.09.20: NetGO 2.0 released.
2020.02.22: NetGO 1.1 released.
2018.11.20: NetGO 1.0 released.

References

  1. You R, Zhang Z, Xiong Y, et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics, Volume 34, Issue 14, 15 July 2018, Pages 2465–2473.
  2. Zhou N, Jiang Y, Bergquist T R, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome biology, 2019, 20(1): 1-23.
  3. Yao S, You R, Wang S, et al. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Research, 2021;, gkab398.
  4. You R, Yao S, Xiong Y, et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W379–W387.
  5. You R, Huang X, Zhu S. DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation. Methods, 2018, 145: 82-90.
Note: The maximum supported number of proteins for each online submission is 1,000. You can click "Help" to get the tutorial of NetGO predictions. If the number of proteins in your job exceeds 1,000, you can divide them to separated jobs, or please send your whole input file to us at swyao18@fudan.edu.cn.

To run NetGO 2.0, you must provide proteins with standard UniProt identifiers. Otherwise, text information willl be ignored.

Show an exampleExample File

Enter Protein Sequence(s) (FASTA Format)

Or upload a fasta file

(Optional) Your email address

Clear input
Run as NetGO 2.0 Run as NetGO 1.0
submit

The Tutorial on NetGO 2.0 Predictions

NetGO 2.0 improves large-scale automated function prediction (AFP) with massive sequence, text, domain/family and network information. The steps on how to make a prediction and explanations of the meanings of the query results are detailed as follows.

A. How to obtain predictions

This is a screenshot of an input page. The numbers in red refer to the below different sections.

the main frame

1. Input a protein sequence(s)

In the first place, you should specify the sequence(s) which you want to predict. The sequences can either be typed directly into the text area, or can be uploaded from a file using the button. If both the text area and the file uploaded containing sequences, this server will only consider the sequence(s) in the text area and the sequence(s) in the uploaded file will be ignored.

Only the FASTA Format is acceptable to this server: By FASTA Format, long protein sequences in one or multiple lines, each of which begins with '>', are allowed. The lines starting with '>' are treated as the identifiers of the following sequences.

All sequences have to be amino acids specified in a single letter code (ACDEFGHIKLMNPQRSTVWYVBZX*). Any other non-white space characters will be rejected by the input processor with notification. Also, our input processor will check the empty sequence and the uniqueness of sequence identifiers, and a warning will be given when nucleotide-like sequences be found.

If you have trouble in choosing protein sequences, the web page also provides an example. Please click the "Show an example" to provide an example for NetGO. You can also click "Example File" to download an example file.

Note: To run NetGO 2.0, you must provide proteins with standard UniProt identifiers. Otherwise, text information related to the input proteins will not be used.

2. Input your email address(Optional)

For this option, you can input your email address. We will send you a confirmation email after your submission. You will then receive an email to inform you that prediction results are available. Althought it is optional, we highly recommend you use this service.

After you have followed all the above steps, please press the "Submit" button to process the prediction. We will provide you with an a job id and a web link to the results. The page will refresh automatically when your query results are available (If you had provided your email address, you will receive an email) notification . At any time, you can also track your job status by directly clicking the link or entering your job id in the "Check" page.

The time cost of making a prediction depends on the number of input data points. Usually, it will run relatively quickly. The below table on running times is for your reference.

Protein Num NetGO 1.0 NetGO 2.0
1 3.1min 7.3min
100 9.1min 16.8min
200 16.9min 23.5min
400 32.5min 42.4min
1000 72.7min 90.1min

B. Interpreting the prediction output

When a prediction has been made, you can obtain your query results just in the way that you track your job status. The following is a screenshot of a result page. The numbers in red refer to different sections of the interface.

the main frame

1. Information about results (Part 1)

We show some basic information about results and provide a link to download your results.

Prediction results for a protein (Part 2)

We will only show the result of the first 10 (20 or 30 depending on the user's choice) proteins.

(1) Protein name (2.(1))
For each protein, you can click the protein name to obtain its information in UniProt.

(2) Prediction results shown in gragh (2.(2))
We visualize the top m (m=20 by default, and can be set to 30, 50 or 100) predicted GO terms according to the GO structure and organize all of them in a gragh. Note that GO terms of high confidence (score > 0.6) will be emphasized with colors ([1.0, 0.9), [0.9, 0.8), [0.8, 0.7), [0.7, 0.6), [0.6, 0.0]). The top predictions of GO terms are visualized by using the AmiGO API and the meaning of coloured lines between the GO terms are described in AmiGO Manual: Visualize, where blue color means relation "is_a", lightblue color means relation "part_of", brown color represents for relation "develops_from", black color stands for relation "regulates", red color indicates relation "negatively_regulates" and green color denotes relation "positively_regulates".
Note that GO terms use the "is_a" and "part_of" relationships to form directed acyclic graphs(DGA) in subontology, while the AmiGO visualization tool takes into account "develops_from", "regulates" and other relationships. So terms from other sub-ontologies or CARO and CL terms will be presented.
You can click the graph to show a high resolution version.

(3) Prediction results listed in a table (2.(3))
The top 20 predicted GO terms are also shown in a table. The three columns list GO terms, corresponding scores, names respectively. The numbers between brackets are the ranks of a GO term in the prediction result of the component methods. You can click on GO terms to display their detailed information. GO terms of high confidence (score > 0.6) are also be emphasized with colors ([1.0, 0.9), [0.9, 0.8), [0.8, 0.7), [0.7, 0.6), [0.6, 0.0]).
Note that the scores, especially by the components methods, can be regarded as probabilities more or less. However, since we treat AFP as a ranking problem, NetGO 2.0 focuses more on the rank of their relative scores of different GO Terms for the same protein. The scores between different proteins, different components and different sub-ontologies are not directly comparable.

C. Browser compatibility

We have tested our system and browser compatibility, as shown below:
OS Version Chrome Firefox Microsoft Edge Safari
Linux Ubuntu 16 71.0 64.0 n/a n/a
MacOS 12 83.0 84.0 83.0 13.1
Windows 10 87.0 83.0 87.0 n/a

Job id

submit
For any scientific problems, please contact Shanfeng Zhu (zhusf@fudan.edu.cn).

If any bug occurs or for technical problems, please contact Shaojun Wang (shaojunwang20@fudan.edu.cn).

We will highly appreciate your support and kindness.

This web server is free and open to all without a login requirement.