Collaborative labelling tool
Obtaining annotated data is a key issue in the development of any machine learning algorithm. As shown in this diagram, labeling is the most time-consuming task in a machine learning project, so its optimization is essential.
In our case, it is about annotating sentences that customers have written or said to the chatbot. Until now, the annotation was done on spreadsheets. It goes without saying that this procedure, very often laborious, was not facilitated by the tools made available to the annotators. This is why I decided to set up an annotation tool.
Architecture of the solution
After a benchmark of different open-source solutions, I chose Label Studio. This tool allows to perform different types of labeling with many data formats. It is also possible to integrate Label Studio with machine learning models to provide predictions for annotations, or to perform continuous active learning. The diagram below describes how Label Studio will fit into a machine learning environment :
Using the platform
Label Studio is very quick to learn, both for the administrator in charge of creating annotation projects and for the annotators. First of all, each new annotator has to create an account, which allows to secure the access and also to control the quality of annotation of each person afterwards.
The annotator sees a sentence displayed, above it he can choose which entities are in the sentence and then delimit their locations by highlighting with the mouse, below the sentence he sees a list of intentions with a prediction already made. Thus the annotation becomes in the vast majority of cases a verification since the algorithm already has a high accuracy.
Labelling strategy
It is important to set up an annotation strategy because annotating sentences at random only increases the representation disparities present in the initial dataset. In order to increase the annotation speed, it is possible to use the predictions made by a machine learning model. This type of strategy is called "active learning" and is generally articulated as follows:
- The first thing to do is to manually label a small subsample of this data
- Once we have a small amount of labeled data, we need to train the model on that data. The model will not be perfect, but it will tell us which data should be labeled first to improve it
- After the model is created, it is used to predict the class of each remaining unlabeled data point
- A score is calculated for each unlabeled data point based on the model prediction
- Once the best approach has been chosen to prioritize labeling, this process can be repeated iteratively: a new model can be trained on a new labeled dataset, which has been labeled based on the priority score. Once the new model has been trained on the subset of data, the unlabeled data can be submitted to the model to update the prioritization scores to continue labeling. In this way, the labeling strategy can continue to be optimized as the models improve.
This methodology allows to increase the size of the dataset in a very efficient way but what is even more important is that by targeting the most interesting data to annotate we increase the accuracy of the model at a lower cost.