Plugins¶

Indigo uses plugins to allow specific functionality to be customised for different countries, localities and languages. For example, extracting a Table of Contents and automatically linking references both use the plugin system. This means they can be adjusted to suite different languages and references styles.

Locales¶

Most plugins are locale-aware. That is, Indigo looks for a plugin implementation that best matches the locale of a work or a document.

The locale is a tuple of strings (country, language, locality), such as ('za', 'eng', None), where None is a wildcard that will match anything. The tuple describes the locales to which the plugin applies. In this example, the plugin applies to any work in South Africa (za) with an English (eng) expression, and will match on any locality within South Africa (the last None item).

If there are multiple plugins with locales that match a document or work, Indigo will use the one that most specifically matches it (ie. has the fewest number of wildcards.)

Plugin Registry¶

Plugins register themselves with the plugin registry for a certain topic. The following plugin topics are understood by Indigo:

importer plugins import text from documents and mark them up with Akoma Ntoso. Usually extend indigo_api.importers.base.Importer.
publications plugins provide publication documents for works. Usually extend indigo.analysis.publications.base.BasePublicationFinder.
refs plugins automatically identify and markup references between works in the text of a document. Usually extend indigo.analysis.refs.base.BaseRefsFinder.
terms plugins automatically identify and markup defined terms in document markup. Usually extend indigo.analysis.terms.base.BaseTermsFinder.
toc plugins return a Table of Contents from document markup. Usually extend indigo.analysis.toc.base.TOCBuilderBase.
work-detail plugins return tradition-specific information for a work, such as numbered titles. Usually extend indigo.analysis.work_detail.base.BaseWorkDetail.

Register a plugin using plugins.register(topic) and include a locale that describes which locales your plugin is specific to:

from indigo.analysis.work_detail.base import BaseWorkDetail
from indigo.plugins import plugins


@plugins.register('work-detail')
class CustomisedWorkDetail(BaseWorkDetail):
    locale = ('za', 'afr', None)
    ...

Fetching a Plugin¶

You can fetch a plugin for a work or a document using for_work(), for_document(), or for_locale() on the plugin registry, giving it a plugin topic and a work, document or locale:

from indigo.plugins import plugins

toc_builder = plugins.for_document('toc', document)
if toc_builder:
    toc_builder.table_of_contents_for_document(document)

Custom Tasks¶

You can also create custom tasks using the plugin system. Custom tasks can provide specific URLs for performing the task, control who can close a task, etc.

Indigo recognises a custom task using the Task.code attribute on the task. This is an arbitrary string value which you provide when you register your custom task with the registry.

Like plugins, tasks are also locale-specific so you can provide locale-dependent implementations. More than one custom task can be registered for the same task code. Indigo will use the implementation with the closest locale match.

Register your task with the task system like this:

from indigo.custom_tasks import CustomTask, tasks

@tasks.register('my-custom-code')
class MyCustomTask(CustomTask):
    locale = (None, None, None)

    def setup(self, task):
        self.task = task

When Indigo sees a task with a task code attribute, it will lookup the custom task from the registry, create an instance, and call setup(task) with the task instance.

Loading Plugins and Custom Tasks¶

It’s common to place your plugins in plugins.py and custom tasks in custom_tasks.py in your project directory. Then load those files in your Django apps.py when Django calls your app’s ready() method:

from django.apps import AppConfig


class MyAppConfig(AppConfig):
    name = 'my_app'

    def ready(self):
        # ensure our plugins are pulled in
        import my_app.plugins
        import my_app.custom_tasks

Plugin API Reference¶

class indigo_api.importers.base.Importer¶

Import from PDF and other document types using Slaw.

Slaw is a commandline tool from the slaw Ruby Gem which generates Akoma Ntoso from PDF and other documents. See https://rubygems.org/gems/slaw

analyse_after_import(doc)¶: Run analysis after import. Usually only used on PDF documents.

create_from_docx(docx_file, doc)¶: We can create a mammoth image handler that stashes the binary data of the image and returns an appropriate img attribute to be put into the HTML (and eventually xml). Once the document is created, we can then create attachments with the stashed image data, and set appropriate filenames.

create_from_pdf(upload, doc)¶: Import from a PDF upload.

create_from_upload(upload, doc, request)¶: Create a new Document by importing it from a django.core.files.uploadedfile.UploadedFile instance.

cropbox = None¶: Crop box to import within, as [left, top, width, height]

expand_ligatures(text)¶: Replace ligatures with separate characters, eg. ﬁ -> fi.

fragment = None¶: The name of the AKN element that we’re importing, or None for a full act.

fragment_id_prefix = None¶: The prefix for all ids generated for this fragment

import_from_text(input, frbr_uri, suffix=u'')¶: Create a new Document by importing it from plain text.

locale = (None, None, None)¶: Locale for this analyzer, as a tuple: (country, language, locality). None matches anything.

reformat = False¶: Should we tell Slaw to reformat before parsing? Only do this with initial imports.

reformat_text(text)¶: Clean up extracted text before giving it to Slaw.

section_number_position = u'before-title'¶: By default, where do section numbers usually lie in relation to their title? One of: before-title, after-title or guess.

slaw_grammar = u'za'¶: Slaw grammar to use

tempfile_for_upload(upload)¶: Uploaded files might not be on disk, ensure it is by creating a temporary file.

use_ascii = True¶: Should we pass –ascii to slaw? This can have significant performance benefits for large files. See https://github.com/cjheath/treetop/issues/31

class indigo.analysis.publications.base.BasePublicationFinder¶

This finds publication details for a published document. For example, a country-specific implementation can lookup a Government Gazette given a date, gazette name, and number.

find_publications(params)¶: Return a list of publications matching the given params, a dict of arbitrary key-value pairs.

locale = (None, None, None)¶: The locale this finder is suited for, as (country, language, locality).

class indigo.analysis.refs.base.BaseRefsFinder¶

Finds references to Acts in documents.

Subclasses must implement find_references_in_document.

act_re = None¶: This must be defined by a subclass. It should be a compiled regular expression, with named captures for ref, num and year.

find_references_in_document(document)¶: Find references in +document+, which is an Indigo Document object.

make_href(match)¶: Turn this match into a full FRBR URI href

make_ref(match)¶

Make a reference out of this match, returning a (ref, start, end) tuple which is the new ref node, and the start and end position of what text in the parent element it should be replacing.

By default, the first group in the act_re is substituted with the ref.

class indigo.analysis.terms.base.BaseTermsFinder¶

Finds references to defined terms in documents.

Subclasses must implement find_terms_in_document.

add_terms_to_references(doc, terms)¶: Add defined terms to the references section of the XML.

build_tlc_term(parent, id, term)¶: Build an element such as <TLCTerm id=”term-applicant” href=”/ontology/term/this.eng.applicant” showAs=”Applicant”/>

definition_sections(doc)¶: Yield sections (or other basic units) that potentially contain definitions of terms.

find_definitions(doc)¶: Find def elements in the document and return a dict from term ids to the text of the term.

find_term_references(doc, terms)¶: Find and decorate references to terms in the document. The +terms+ param is a dict from term_id to actual term.

find_terms_in_document(document)¶: Find defined terms in +document+, which is an Indigo Document object.

guess_at_definitions(doc)¶

Find defined terms in the document, such as:

“this word” means something…

It identifies “this word” as a defined term and wraps it in a def tag with a refersTo attribute referencing the term being defined. The surrounding block structure is also has its refersTo attribute set to the term. This way, the term is both marked as defined, and the container element with the full definition of the term is identified.

mark_definition(container, term, start_pos, end_pos)¶: Update the container node to wrap the given term in a definition tag.

renumber_terms(doc)¶: Recalculate ids for <term> elements

class indigo.analysis.toc.base.TOCBuilderBase¶

This builds a Table of Contents for an Act.

A Table of Contents is a tree of TOCElement instances, each element representing an item of interest in the Table of Contents. Each item has attributes useful for presenting a Table of Contents, such as a type (chapter, part, etc.), a number, a heading and further child elements.

The TOC is assembled from certain tags in the document, see toc_elements.

The Table of Contents can also be used to lookup the XML element corresponding to an item in the Table of Contents identified by its subcomponent path. This is useful when handling URIs such as .../eng/main/section/1 or .../eng/main/part/C. See cobalt.act.Act.get_subcomponent().

Some components can be uniquely identified by their type and number, such as Section 2. Others require context, such as Part 2 of Chapter 1. The latter are controlled by toc_non_unique_elements.

determine_component(element)¶: Determine the component element which contains +element+.

friendly_title(item)¶: Build a friendly title for this, based on heading names etc.

locale = (None, None, None)¶: The locale this TOC builder is suited for, as (country, language, locality).

process_elements(component, elements, parent=None)¶: Process the list of elements and their children, and return a (potentially empty) set of TOC items.

table_of_contents(act, language)¶: Get the table of contents of act as a list of TOCElement instances.

table_of_contents_entry_for_element(document, element)¶: Build the table of contents entry for an element from a document.

table_of_contents_for_document(document)¶: Build the table of contents for a document.

titles = {}¶

Dict from toc elements (tag names without namespaces) to functions that take a TOCElement instance and return a string title for that element.

Include the special item default to handle elements not in the list.

toc_elements = [u'coverpage', u'preface', u'preamble', u'part', u'chapter', u'section', u'conclusions', u'doc']¶: Elements we include in the table of contents, without their XML namespace. Subclasses must provide this.

toc_non_unique_components = [u'chapter', u'part']¶: These TOC elements (tag names without namespaces) aren’t numbered uniquely throughout the document and will need their parent components for context. Subclasses must provide this.

class indigo.analysis.work_detail.base.BaseWorkDetail¶

Provides some locale-specific work details.

Subclasses should implement work_numbered_title.

work_friendly_type(work)¶: Return a friendly document type for this work, such as “Act” or “By-law”.

work_numbered_title(work)¶: Return a formatted title using the number for this work, such as “Act 5 of 2009”. This usually differs from the short title. May return None.

class indigo.analysis.work_detail.base.BaseWorkDetail

Provides some locale-specific work details.

Subclasses should implement work_numbered_title.

work_friendly_type(work): Return a friendly document type for this work, such as “Act” or “By-law”.

work_numbered_title(work): Return a formatted title using the number for this work, such as “Act 5 of 2009”. This usually differs from the short title. May return None.

class indigo.plugins.LocaleBasedRegistry¶

Base class for locale-based registries. Helps register and lookup locale-based classes.

for_document(topic, document)¶: Find an appropriate helper for this document.

for_locale(topic, country=None, language=None, locality=None)¶: Find an appropriate importer for this locale description. Tightest match wins.

for_work(topic, work)¶: Find an appropriate helper for this work.

register(topic, name=None)¶: Class decorator that registers a new class with the registry.

registry = None¶: Registry of class names to classes. Subclasses MUST define this to avoid sharing registry classes.