MLRC

MLRC (Medical, Legal, Regulatory, and Compliance) takes weeks (and sometimes months) to review any consumer-facing content submitted by marketing agencies e.g., website text, Facebook ads, Instagram posts, TV ads, etc. Content could be text, audio, images, or video. This review process involving tens of people from medical, legal, regulatory, and compliance results in slow releases of ad campaigns or website content to consumers. Because there are thousands of content jobs for MLRC to review monthly, this backlog reduces the amount of time for them to actually do their day jobs. And these review jobs are increasing monthly with pressure being put on them to speed up reviews.

Inabia-AI

Inabia AI will reduce the review time from weeks to days by front-loading the review on the text content creators e.g., marketing agencies using a Grammarly-like web UI that will do four levels of review (similar to what MLRC reviewers conduct on actual content).

Level-1-Review (Detection)

Find the location of problem sentences/clauses aka error detection.

Fine-tuned BERT-large on MLRC dataset

This custom BERT-large was fine-tuned on MLRC balanced dataset.

Model description

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

The detailed release history can be found on the google-research/bert readme on github.

Model #params Language
bert-base-uncased 110M English
bert-large-uncased 340M English
bert-base-cased 110M English
bert-large-cased 340M English
bert-base-chinese 110M Chinese
bert-base-multilingual-cased 110M Multiple
bert-large-uncased-whole-word-masking 340M English
bert-large-cased-whole-word-masking 340M English

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('Inabia-AI/bert-large-uncased-mlrc')
model = BertModel.from_pretrained("Inabia-AI/bert-large-uncased-mlrc")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('Inabia-AI/bert-large-uncased-mlrc')
model = TFBertModel.from_pretrained("Inabia-AI/bert-large-uncased-mlrc")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Evaluation results

When fine-tuned on downstream tasks (text classification), this model achieves the following results:

Training dataset TP:TN # of TPs # of TNs Precision Recall F1
MLRC 1:1 200 200 64% 55% 50%