Gene NER using PySysrev and Human Review (Part I)¶

James Borden

Nole Lin

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes. We use data from 2000 abstracts reviewed in the sysrev Gene Hunter project. The first part of the series describes how users can load and process data for training with the spaCy.io library.

In this notebook we:

Install PySysrev package - github.com/sysrev/PySysrev
Download Gene Annotations from the sysrev.com Gene Hunter project - sysrev.com/p/3144
Format downloaded annotations to feed into spaCy - https://spacy.io/

The Gene Hunter project was a 2000 article open online review of pubmed abstracts. 15 reviewers highlighted genes in text. Sysrev data is accessible using the Sysrev Python client PySysrev.

Install PySysrev¶

PySysrev is a python client written for sysrev.com. It is built in python 2.7 and depends on python packages spacy, pandas, requests,plac, and pathlib. If you have these dependencies you can install PySysrev with:

> pip install PySysrev

This notebook is available at github.com/sysrev/sysrev-examples under NERGenes_Processing.ipynb with a minimal working conda environment.

Download Gene Annotations¶

PySysrev provides an API call to download data into a shape spaCy can handle. Let's look at the data in the Gene Hunter project. The gene hunter project has the project_id 3144 which is all we need to get data from PySysrev.getAnnotations api call.

import PySysrev
PySysrev.getAnnotations(project_id=3144).head(5)

In the above DataFrame we see user annotations of pubmed abstracts. This annotation process involves (1) highlighting words, (2) assigning a semantic_class (gene in this case) and (3) assigning a text annotation to the selected text. Each column described below:

annotation: user supplied value for some highlighted text.
datasource: the source of the annotated object.
end: the character index of the end of highlighted text.
external_id: the datasource identifier for the annotated object (pubmed id in this case, eg 29211711].
selection: the highlighted text.
semantic_class: the assigned 'type' of the highlighted text (gene in this case).
start: the character index of the start of the highlighted text.
sysrev_id: a sysrev.com identifier for the annotated object
text: the full text from the annotated object.

You can see this annotation workflow in our youtube video.

we can see the different genes (under the column "selection") identified in the annotation column. The start and end indices indicate where in the text the gene name can be found.

Format for Spacy¶

Now, we'll call the PySysrev.processAnnotations to get gene hunter data from sysrev in a format directly usable by Spacy. The project id is 3144, and the entity we want is GENE

processed_output = PySysrev.processAnnotations(project_id=3144, label='GENE')

Let's take a look at the processed json. The data structure of the json file read into Python becomes a list of lists. For each individual list, we get:

The text as the first element (string)
The entities of the text as the second element (dictionary).

The entities dictionary for each entry is a list of lists with one entry per annotated gene. Each entry contains the start index of an annotation, end index of an annotation, and the semantic_class of the annotation.

(text,jobj) = processed_output[0]

print("text: {}...".format(text[0:197]))

for entity in jobj["entities"]:
    print("start:{},\tend:{},\tsemantic_class:{}".format(entity[0],entity[1],entity[2]))

text: BACKGROUND: Olaparib is an oral poly(adenosine diphosphate-ribose) polymerase inhibitor that has promising antitumor activity in patients with metastatic breast cancer and a germline BRCA mutation....
start:183,	end:187,	semantic_class:GENE
start:1726,	end:1730,	semantic_class:GENE
start:354,	end:358,	semantic_class:GENE

Now that we have our "processed_output.json" file, we are ready to input it into spaCy for training. This step will be detailed in the next post.

	annotation	datasource	end	external_id	selection	semantic_class	start	sysrev_id	text
0	α-KGDH	pubmed	286.0	29211711	α-KGDH	gene	280.0	1524023	Histone modifications, such as the frequently ...
1	KAT2A	pubmed	391.0	29211711	KAT2A	gene	386.0	1524023	Histone modifications, such as the frequently ...
2	GCN5	pubmed	411.0	29211711	GCN5	gene	407.0	1524023	Histone modifications, such as the frequently ...
3	succinyl-CoA	pubmed	493.0	29211711	succinyl-CoA	gene	481.0	1524023	Histone modifications, such as the frequently ...
4	KAT2A	pubmed	509.0	29211711	KAT2A	gene	504.0	1524023	Histone modifications, such as the frequently ...