NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Paper link

Code link

Main image

This paper presents a new participatory Python-based natural language augmentation framework that supports the creation of transformations (modifications to the data) and filters (data splits according to specific features).

The current version of the framework contains 117 transformations and 23 filters for a variety of natural language tasks.

The authors demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models.

P. S. I’m one of the co-authors! :)


Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of their training data. But most transformations do not alter the structure of examples in drastic and meaningful ways, making them less effective as potential training or test examples.

Some transformations are universally useful, for example, changing places to ones from different geographic regions or changing names to those from different cultures. On the other hand, some NLP tasks may benefit from transforming specific linguistic properties: changing the word “happy” to “very happy” in input is more relevant for sentiment analysis than for summarization. As such, having a single place to collect both task-specific and task-independent augmentations will ease the barrier to creating appropriate suites of augmentations that should be applied to different tasks.

In 2021 several evaluation suites for the GEM benchmark were proposed:

  • transformations (e.g. back-translation, introduction of typographical errors, etc.)
  • subpopulations, i.e., test subsets filtered according to features such as input complexity, input size, etc.
  • data shifts, i.e., new test sets that do not contain any of the original test set material.

NL-Augmenter is a participant-driven repository that aims to enable more diverse and better-characterized data during testing and training. To encourage tasks-specific implementations, transformations are tied to a widely-used data format (e.g., text pair, a question-answer pair, etc.) along with various task types (e.g., entailment, tagging, etc.) that they intend to benefit.

The organisation of the process


A workshop was organized towards constructing this repository. Unlike a traditional workshop wherein people submit papers, participants were asked to submit python implementations of transformations to the GitHub repository.

The organizers created a base repository and incorporated a set of interfaces. Then, the participants could submit the code following the pre-defined interfaces and it was subjected to the code review: following the style guide, implementing testing and were encouraged to submit novel and/or specific transformations/filters.

At the moment of the paper’s publication, there are 117 transformations and 23 filters in the repository.


To make searching for specific perturbations and understanding their characteristics easier, three main categories of tags were introduced:

  • General properties;
  • Output properties;
  • Processing properties;

Some of the tags were assigned automatically (from the metadata in the code), others - by the contributors themselves.

Robustness analysis

All authors of the accepted perturbations were asked to provide the task performance scores for each of their respective transformations or filters.

The perturbations are mainly split into text classification tasks, tagging tasks, and question-answering tasks. For experiments in this paper, the authors focus on text classification and the relevant perturbations. They compare the model performance on the original and the perturbed data; the percentage of changed sentenced and the drop in the performance are reported.

Four datasets are used: SST-2 and IMDB for sentiment analysis, QQP for duplicate question detection and MNLI. Their corresponding models (most downloaded on Huggingface) are used for evaluation. A random 20% of the validation dataset is perturbed during the evaluation.

Discussions and impact


  • the results show that the tested perturbations introduce serious challenges for the models and decrease their score, but it would be better to analyze the contribution of each separate perturbation;
  • some of the perturbations’ tags could be inconsistent, and it is necessary to assess the quality of the tag assignment to ensure a reliable analysis;
  • the robustness analysis shows the weaknesses of the models, but a separate analysis is necessary to verify that using these perturbations while training will mitigate these weaknesses;

NL-augmenter was created by many participants, and there are risks that the contributions of the individuals will be less appreciated than in standalone projects. To proactively give appropriate credit, each transformation has a data card mentioning the contributors and all participants are listed as co-authors of this paper.

My contribution

When I saw this project, I was immediately interested in it and wanted to contribute. I had worked on multiple NLP projects before and knew that robustness of models is often an issue, so I wanted to take part in dealing with it.

At that time, there were already many transformations in the repository covering both common augmentations and many specific and new ones. However, I didn’t want to make a meaningless contribution, so I started looking for new ideas.

After some time, I chose a paper “An Analysis of Simple Data Augmentation for Named Entity Recognition”. This paper introduced several interesting data augmentations, and I thought they would be useful. (later, I have found this implementation, but I missed it during my initial research).

I chose the following augmentation that is relevant for NER tasks:

Shuffle within segments (SiS): We first split the token sequence into segments of the same label. Thus, each segment corresponds to either a mention or a sequence of out-of-mention tokens. For example, a sentence She did not complain of headache or any other neurological symptoms . with tags O O O O O B-problem O B-problem I-problem I-problem I-problem O is split into five segments: [She did not complain of], [headache], [or], [any other neurological symptoms], [.]. Then for each segment, we use a binomial distribution to randomly decide whether it should be shuffled. If yes, the order of the tokens within the segment is shuffled while the label order is kept unchanged.

Here is the full code of the implementation:

import itertools
import random
from typing import List, Tuple

import numpy as np

from interfaces.TaggingOperation import TaggingOperation
from tasks.TaskTypes import TaskType

class ShuffleWithinSegments(TaggingOperation):
    Shuffling tokens within segments.
    The token sequence is split into segments of the same label;
    for each segment, a decision is made randomly: to shuffle it or not.
    In case of a positive decision, the tokens within each segment are shuffled.
    tasks = [TaskType.TEXT_TAGGING]
    languages = "All"
    keywords = ["lexical",

    def __init__(self, seed=0, max_outputs=1):
        super().__init__(seed, max_outputs=max_outputs)

    def generate(
            self, token_sequence: List[str], tag_sequence: List[str]
    ) -> List[Tuple[List[str], List[str]]]:

        # it is necessary to set up numpy random seed due to np.random.binomial
        token_seq = token_sequence.copy()
        tag_seq = tag_sequence.copy()
        perturbed_sentences = []

        assert len(token_seq) == len(
        ), "Lengths of `token_seq` and `tag_seq` should be the same"

        # we need the original indices of each tag
        tags = [(i, t) for i, t in enumerate(tag_seq)]

        # split the tags into the segments with the same label
        groups = [list(g) for k, g in itertools.groupby(tags, lambda s: s[1].split('-')[-1])]

        for _ in itertools.repeat(None, self.max_outputs):
            new_tokens = []

            for group in groups:
                # now we need only indices of each group
                indices = [i[0] for i in group]

                if np.random.binomial(1, 0.5):

                new_tokens.extend([token_seq[idx] for idx in indices])

            perturbed_sentences.append((new_tokens, tag_seq))

        return perturbed_sentences

The link to the augmentation page.

This was an interesting experience for me, and I hope that people will benefit from using NL-Augmenter.

paperreview deeplearning nlp augmentation robustness