ViMRT
Home     Docs     Dataset    
Welcome to the ViMRT docs!

These docs are bundled with the ViMRT download and use for your convenience, so you can also read in your installation.

Using ViMRT

Introduction

ViMRT is a text-mining tool and search engine for automated virus mutation recognition by rule patterns and regular expression patterns for different written forms of virus mutation in literature based on natural language processing. It can also quickly and accurately search virus mutation-related information including virus genes and related disease.

Getting Started with ViMRT

System Python

Before we start a quick note that using the system-wide installation of Python is not recommended. This often causes problems and it's a little risky to mess with it. If you find yourself prepending sudo to any ViMRT commands, take a step back and think about Python virtual environments / conda instead (see below).

Installing Python

To see if you have python installed, run python --version on the command line. ViMRT needs Python version 2.7+, 3.7+ or 3.8+.

We recommend using virtual environments to manage your Python installation. Our favourite is conda, a cross-platform tool to manage Python environments. You can installation instructions for Miniconda here.

Once conda is installed, you can create a Python environment with the following commands:

conda create --name py3.8 python=3.8
conda activate py3.8

You'll want to add the conda activate py3.8 line to your .bashrc file so that the environment is loaded every time you load the terminal.

Installing python package

Then you need to install python package to run the code as follows:

pip3 install -r requirements.txt

Please the requirements.txt is in the ViMRT.zip.

Virus mutation recognition

The recognition of virus mutation is mainly divided into two independent modules:
I. Optimize the recognition result of tmVar by rule patterns
II. Develop regular expression patterns to recognize virus mutation

Downloading BioC format

Each input file should follow the BioC format (From PubMed abstracts & PMC full text articles). The user can also download the BioC format by running the code below (if BioC files are ready, the user can skip this step).

python Bio_download.py -i [input] -o [output] -s [sources]

input: the user can provide the input file with PMID, such as PMIDlist.txt.
output: the user can provide the output folder path.
sources: the user can choose the output file source: PubMed | PMC | PubMed_PMC, which will obtain the PubMed abstracts, PMC full text articles or both, respectively)

Example: python Bio_download.py -i PMIDlist.txt -o ./tmVar/tmvar_input -s PubMed

Identifying mutation by ViMRT

It includes three steps:
1. Optimizing results of tmVar by rule patterns
2. Identifying mutation by regular expression patterns
3. Integrating the results of rule and regex

*Note: step 1 and step 2 can be run independently.

1. Optimizing results of tmVar by rule patterns

In this step, the user firstly needs to download tmVar to identify the mutation in the official website. tmVar can only run in a window environment or a linux environment. Please see the instructions for more details in zip files. The main code is as follows:

java -Xmx5G -Xms5G -jar tmVar.jar [input] [output]
input: the user can provide the input folder path.
output: the user can provide the output folder path.

*Note: each input file and output file of tmVar should follow the PubTator format or the BioC format, and if the input files are from PMC full text articles, the output files only have BioC format.

Then, the user can optimize the recognition results of tmVar via running a Python script:

python ViMRT.py -i [input] -o [output] -v [virus] -f [formart] -m rules

input: input folder path includes the output result files of tmVar.
output: the user can provide the output folder path. The default path is the current path.
virus: the user can provide one virus name. The default parameter is 'Unknown'.
formart: the user can choose one input file formart: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_result/ -v HBV -f BioC -m rules

*Note: ViMRT has designed the specific rules for optimizing the results of tmVar according the mutation written form of five viruses in the literature, including HBV, HPV, HIV, EBV, HTLV1. For example, sG145R was optimized as G145R for HBV mutation. For the virus parameter, if the parameter given is one of five virus names, its specific rules will be brought into optimization in the results.

2. Identifying mutation by regular expression patterns

Based on the development dataset and false positive results of the tmVar, ViMRT has developed regular expression patterns to recognize virus mutations from the original literature.

python ViMRT.py -i [input] -o [output] -v [virus] -f [formart] -m regex

input: input folder path includes the output result files of tmVar or BioC format files by downloading.
output: the user can provide the output folder path. The default path is the current path.
virus: the user can provide one virus name. The default parameter is 'Unknown'.
formart: the user can choose one input file format: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_result/ -v HBV -f BioC -m regex

*Note: ViMRT has designed the specific regular expression patterns to identify the virus mutation according to the written form in different literatures, including HBV, HPV, HIV, EBV, HTLV1. For example, HTLV1 M47 mutation will be identifeid as 'L319R' and 'L320S'. For the virus parameter, if the parameter given is one of five virus names, its specific regular expression patterns will be brought into recognition.

3. Integrating the results of rule and regex

If users separately recognize the mutation by rule and regex, they need to merge their results by running ViMRT.py as follows:

python ViMRT.py -i [input] -o [output] -f [formart] -c concat

input: input path includes identification result files by both rule and regex.
output: the user can provide the output folder path. The default path is the current path.
formart: the user can choose one input file format: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./ViMRT_result/ -o ./ViMRT_mutation/ -f BioC -c concat



At same time, the user can also directly run ViMRT.py to obtain both the optimization results and regular expression results from the output files of tmVar.

Identifying the mutation by rule and regex

By output result files of tmVar with BioC format:

python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_mutation/ -v HBV -f BioC

This will generate 3 files: BioC_rules.csv, BioC_regex.csv and BioC_rules_regex.xlsx in the "BioC" folder of the output path. The BioC_rules_regex.xlsx file merges the results of BioC_rules.csv and BioC_regex.csv, which are the last mutation recognition results of ViMRT.


By the output result files of tmVar with PubTator format:


python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_mutation/ -v HBV -f PubTator

It will generate 3 files: PubTator_rules.csv, PubTator_regex.csv and BioC_rules_regex.xlsx in the "PubTator" folder of the output path. The PubTator_rules_regex.xlsx file merges the results of PubTator_rules.csv and PubTator_regex.csv, which are the last mutation recognition results of ViMRT.

Virus gene recognition

ViMRT has built virus gene corpus from NCBI PubMed and gene database and developed a Python script to identify virus genes from virus mutation sentences.

Virus gene corpus

ViMRT has collected the gene name list of 7,194 viruses. Users can also add their own gene names at gene_vocabulary.txt file in genecorpus folder. Besides, ViMRT has further eliminated possible identification errors due to short virus gene names, eg., S gene of HBV in gene_vocabulary_error.txt file in genecorpus folder. Users can also complement errors according to their own needs.

Identifying virus gene

python Gene_Recognize.py  -i [input] -o [output] -v [virus]

input: the user can provide input file (gene_example.txt). The input file format: PMID+"|pmid|"+sentence
output: the user can provide output folder path. The default path is the current path.
virus: the user can choose virus name, such as HBV, HBV;HPV, etc. The default parameter is "fullvirus"

Example: python Gene_Recognize.py -i ./gene_example.txt -o ./ViMRT_gene/ -v HBV

*Note: we recommend selecting one virus name. Becuase the default parameter will match genes of all viruses in turn, which will run for a long time.

Disease recognition

ViMRT firstly needs a Python NLP Stanza library for many human languages to identify disease from virus mutation sentences. The stanza usage can refer to the github website.

Installing stanza

pip install stanza

Downloading stanza disease models

If users are running the stanza pipeline for the first time, they need to download stanza disease models:
import stanza 
stanza.download('en', package='mimic', processors={'ner': 'bc5cdr'}, verbose=False) stanza.download('en', package='mimic', processors={'ner': 'ncbi_disease'}, verbose=False)

Identifying disease

Disease corpus

ViMRT has built disease corpus from CTD database to optimize the results of stanza using Python script. Users can also add their own disease name at disease_vocabulary.txt file and complement new errors at disease_vocabulary_error.txt file in diseasecorpus folder to delete identification errors of stanza.

Identifying and optimizing disease

python Disease_Recognize.py -i [input] -o [output]

input: the user can provide input file (disease_example.txt). The input file format: PMID+"|pmid|"+sentence.
output: the user can provide output folder path. The default path is the current path.

Example: python Disease_Recognize.py -i ./disease_example.txt -o ./ViMRT_disease/ -v HBV

ViMRT web search engine

Overview


The ViMRT interface provides an easy-to-use graphical web interface. It can quickly and accurately search virus mutation-related information including virus genes and related disease for users.

Homepage

The homepage of ViMRT provides a navigation of our search engine, including listing the modules of upper right panel, providing two main features for helping users to use and search data. Users can enter their interested virus mutation or virus gene or related disease to retrieve the detail information in the centre select box.



Virus mutation information

The “Virus mutation information” page provides the best matching entity information of the query. Take the entry of “HBV-M:C1653T” in HBV as an example. After users choose the “HBV-M:C1653T”, ViMRT shows a list of publications including C1653T mutation with red background entities, ordered by publication date, as well as other co-occurrence entities with different colors including genes(green texts), diseases(purple texts), and other mutations (red texts) in same sentence. Click ‘More Details’ to show more mutation sentences of selected entity. The drop-down menu items on the upper left for users to get co-occurrence sentence of different entities. The drop-down menu items on lower left can filter publications based on "Year", "Position" and "PMID" of publication matching the query. The "Download the files" button on upper rigtht allows the user to download the search data files including mutation information.



Co-occurrence relationship

The “Co-occurrence relationship” page provides the co-occurrence correlation of the mutation and disease including 7 viruses based on the sentences. Users can firstly click the "To search the relationship" button in at the homepage, the web will acquiescently return the co-occurrence correlation of HBV by the histogram chart and table. Histogram chart shows the score of co-occurrence mutataions ranking top 10. The table shows the information including mutation, gene, disease, score, significance (Fisher's test,***, p < 0.001; **, p < 0.01; *, p < 0.05; ns, not significant). Click "View" to get the detailed literature information of choosed mutation. Click virus names in the left menu items to view the relationship information of each virus



Dataset Download

ViMRT allows the user to download related tools of ViMRT and literature files of 7 viruses on the “Dataset” page. Meanwhile, users can directly download mutation data files of 7 viruses including records with pmid, title, sentence, year, journal, region, mutation, gene, etc. in txt format.



Copyright © 2022.Tongji University, Life Science and Technology Department, Xiaoyan Zhang Lab.

Contact us: xyzhang@tongji.edu.cn.