James Borden
Nole Lin
In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes. We use data from 2000 abstracts reviewed in the sysrev Gene Hunter project. The first part of the series describes how users can load and process data for training with the spaCy.io library.
In this notebook we:
The Gene Hunter project was a 2000 article open online review of pubmed abstracts. 15 reviewers highlighted genes in text. Sysrev data is accessible using the Sysrev Python client PySysrev.
PySysrev is a python client written for sysrev.com. It is built in python 2.7
and depends on python packages spacy
, pandas
, requests
,plac
, and pathlib
. If you have these dependencies you can install PySysrev with:
> pip install PySysrev
This notebook is available at github.com/sysrev/sysrev-examples under NERGenes_Processing.ipynb with a minimal working conda environment.
PySysrev provides an API call to download data into a shape spaCy can handle.
Let's look at the data in the Gene Hunter project. The gene hunter project has the project_id
3144 which is all we need to get data from PySysrev.getAnnotations
api call.
import PySysrev
PySysrev.getAnnotations(project_id=3144).head(5)
In the above DataFrame we see user annotations of pubmed abstracts. This annotation process involves (1) highlighting words, (2) assigning a semantic_class (gene in this case) and (3) assigning a text annotation to the selected text. Each column described below:
You can see this annotation workflow in our youtube video.
we can see the different genes (under the column "selection") identified in the annotation column. The start and end indices indicate where in the text the gene name can be found.
Now, we'll call the PySysrev.processAnnotations
to get gene hunter data from sysrev in a format directly usable by Spacy. The project id is 3144
, and the entity we want is GENE
processed_output = PySysrev.processAnnotations(project_id=3144, label='GENE')
Let's take a look at the processed json. The data structure of the json file read into Python becomes a list of lists. For each individual list, we get:
The entities dictionary for each entry is a list of lists with one entry per annotated gene. Each entry contains the start index of an annotation, end index of an annotation, and the semantic_class of the annotation.
(text,jobj) = processed_output[0]
print("text: {}...".format(text[0:197]))
for entity in jobj["entities"]:
print("start:{},\tend:{},\tsemantic_class:{}".format(entity[0],entity[1],entity[2]))
Now that we have our "processed_output.json" file, we are ready to input it into spaCy for training. This step will be detailed in the next post.