Analyzing INCePTION annotations using DKPro cassis

By Peter Boot

In the Impact & Fiction project, we are currently using INCEpTION to annotate book reviews. Inception is an impressive toolset that lets users define annotation types of many different kinds and apply these to texts. In Inception, we are now working on applying various tags to online book reviews, to get an overview of the kinds of impact that reviews express.

Inception also lets users export annotations for further processing. They suggest using DKPro cassis for that purpose. DKPro cassis is a Python library that can help you analyze annotated texts, if these conform to the UIMA framework. But when setting out to do so, I found that this process of exporting annotations with DKPro cassis is not very well-documented. So what follows is a piece of documented Python code that shows one way of accessing your annotations. Once you have access to the annotations, you can start to compute statistics, to compute annotator agreement, or even to set up a machine learning pipeline that automates the annotation process.  

From Inception, you export data by selecting Export in the Settings menu. The following code assumes that you selected UIMA CAS XMI (XML 1.1) as the secondary output format. The export produces a zip file. The ‘annotation’ folder in the zip file contains folders for each document, and these folders have zip files for each annotator’s annotations. The following code assumes that the annotation folder has been unzipped to some location accessible to your Python script.

from cassis import *
import os
import zipfile

# this is the 'annotation 'subfolder of the unzipped export file
anndir = your annotation folder

# the folder contains subfolders for all documents in inception 
# we loop over these subfolders
for docdir in os.listdir(anndir)[:10]:
    print(docdir,'\n')
    fulldocdir = os.path.join(anndir,docdir)

# The document subfolder contain a zipfile per annotator, 
# The name of the zipfile is gibberish (like webanno10148686510395989892export.zip).
# First we loop over these zipfiles to collect all annotations on the document.
# We collect the annotations in the annots dictionary, grouping them in 
# sets by their start position

    annots = {}
    for f in os.listdir(fulldocdir):
        file = os.path.join(fulldocdir,f)
        z = zipfile.ZipFile(file)
        l = z.namelist()

# the zipfiles contain a file TypeSystem.xml (identical in 
# all zipfiles, I assume) and a file called annotator-name.xmi. 
# It contains a CAS object that holds that annotator's annotations to the document.
# The content of the zipfile's member files, when read, is a bytestring. 
# It must be decoded into a unicode string before we can process it.

        ts = load_typesystem(z.open('TypeSystem.xml').read().decode())
        xmi = l[0] if l[0] != 'TypeSystem.xml' else l[1]   # get name of other file in zipfile
        annotator = xmi.split('.')[0]                      # annotator name derived from filename
        cas = load_cas_from_xmi(z.open(xmi).read().decode(), typesystem=ts)

# We loop over annotation types defined in the type system:
        
        for t in ts.get_types():                        
            if 'custom' in t.name: # We were just interested in the custom annotation types

# We loop over the annotations of that type:
        
                for ann in cas.select(t.name):
                    if ann.begin not in annots:
                        annots[ann.begin] = set()          # empty set to start with
                    annots[ann.begin].add((ann,annotator)) # store annotation and annotator name 

# In the next step we process the annotations (in this case, just printing them).
# Based on the last user's CAS object (we could have used any)
# we list the tokens in the document. If there are annotations starting 
# with that token, we print the annotation's type, annotator, 
# covered text, features and tokens.
                    
    ttype = ts.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
    for token in cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'):
        print(token.xmiID, token.begin, token.get_covered_text())
        if token.begin in annots:
            for ann, annotator in annots[token.begin]:
                print(ann.type, annotator)
                print(ann.get_covered_text())
                for f in ann.type.features:
                    print('ann feature', f.name, ann[f.name])

# strictly speaking we are not using the correct CAS object here,
# but the tokens are identical:

                for token in cas.select_covered('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token', ann):
                    print('ann token', token.xmiID,token.get_covered_text())

This may not be the most effective way of accessing the annotations. I should also note that all our annotations are ‘span’ type annotations (not chains or relations), defined with token-level granularity. This code was run using Inception version 22.4 and DKPro cassis version 0.7. 

I hope this helps someone.

2 comments

  1. Thank you so much for this post. I also found it difficult to deal with and find information about the UIMA framework and DKPro cassis. Do you happen to know of any other resources that describe how to create a pandas dataframe from these annotations?

  2. No, I’m afraid I don’t know anything else. The coding above was the result of a lot of guesswork starting from the DKPro Cassis documentation.

Leave a comment

Your email address will not be published. Required fields are marked *