Gene NER using PySysrev and Human Review (Part I)

James Borden

Nole Lin

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes. We use data from 2000 abstracts reviewed in the sysrev Gene Hunter project. The first part of the series describes how users can load and process data for training with the spaCy.io library.

In this notebook we:

  1. Install PySysrev package - github.com/sysrev/PySysrev
  2. Download Gene Annotations from the sysrev.com Gene Hunter project - sysrev.com/p/3144
  3. Format downloaded annotations to feed into spaCy - https://spacy.io/

The Gene Hunter project was a 2000 article open online review of pubmed abstracts. 15 reviewers highlighted genes in text. Sysrev data is accessible using the Sysrev Python client PySysrev.

Install PySysrev

PySysrev is a python client written for sysrev.com. It is built in python 2.7 and depends on python packages spacy, pandas, requests,plac, and pathlib. If you have these dependencies you can install PySysrev with:

> pip install PySysrev

This notebook is available at github.com/sysrev/sysrev-examples under NERGenes_Processing.ipynb with a minimal working conda environment.

Download Gene Annotations

PySysrev provides an API call to download data into a shape spaCy can handle. Let's look at the data in the Gene Hunter project. The gene hunter project has the project_id 3144 which is all we need to get data from PySysrev.getAnnotations api call.

In [6]:
import PySysrev
PySysrev.getAnnotations(project_id=3144).head(5)
Out[6]:
annotation datasource end external_id selection semantic_class start sysrev_id text
0 α-KGDH pubmed 286.0 29211711 α-KGDH gene 280.0 1524023 Histone modifications, such as the frequently ...
1 KAT2A pubmed 391.0 29211711 KAT2A gene 386.0 1524023 Histone modifications, such as the frequently ...
2 GCN5 pubmed 411.0 29211711 GCN5 gene 407.0 1524023 Histone modifications, such as the frequently ...
3 succinyl-CoA pubmed 493.0 29211711 succinyl-CoA gene 481.0 1524023 Histone modifications, such as the frequently ...
4 KAT2A pubmed 509.0 29211711 KAT2A gene 504.0 1524023 Histone modifications, such as the frequently ...

In the above DataFrame we see user annotations of pubmed abstracts. This annotation process involves (1) highlighting words, (2) assigning a semantic_class (gene in this case) and (3) assigning a text annotation to the selected text. Each column described below:

  1. annotation: user supplied value for some highlighted text.
  2. datasource: the source of the annotated object.
  3. end: the character index of the end of highlighted text.
  4. external_id: the datasource identifier for the annotated object (pubmed id in this case, eg 29211711].
  5. selection: the highlighted text.
  6. semantic_class: the assigned 'type' of the highlighted text (gene in this case).
  7. start: the character index of the start of the highlighted text.
  8. sysrev_id: a sysrev.com identifier for the annotated object
  9. text: the full text from the annotated object.

You can see this annotation workflow in our youtube video.

we can see the different genes (under the column "selection") identified in the annotation column. The start and end indices indicate where in the text the gene name can be found.

Format for Spacy

Now, we'll call the PySysrev.processAnnotations to get gene hunter data from sysrev in a format directly usable by Spacy. The project id is 3144, and the entity we want is GENE

In [4]:
processed_output = PySysrev.processAnnotations(project_id=3144, label='GENE')

Let's take a look at the processed json. The data structure of the json file read into Python becomes a list of lists. For each individual list, we get:

  1. The text as the first element (string)
  2. The entities of the text as the second element (dictionary).

The entities dictionary for each entry is a list of lists with one entry per annotated gene. Each entry contains the start index of an annotation, end index of an annotation, and the semantic_class of the annotation.

In [37]:
(text,jobj) = processed_output[0]

print("text: {}...".format(text[0:197]))

for entity in jobj["entities"]:
    print("start:{},\tend:{},\tsemantic_class:{}".format(entity[0],entity[1],entity[2]))
text: BACKGROUND: Olaparib is an oral poly(adenosine diphosphate-ribose) polymerase inhibitor that has promising antitumor activity in patients with metastatic breast cancer and a germline BRCA mutation....
start:183,	end:187,	semantic_class:GENE
start:1726,	end:1730,	semantic_class:GENE
start:354,	end:358,	semantic_class:GENE

Now that we have our "processed_output.json" file, we are ready to input it into spaCy for training. This step will be detailed in the next post.