Skip to content

Benchmarking HuggingFace models on a dataset

There is a high likelihood that you, at some point, find yourself wanting to benchmark previously trained models. This guide shows you how to do it for a HuggingFace model with nnbench.

Example: Named Entity Recognition

We start with a small tangent about the example setup that we will use in this guide. If you are only interested in the application of nnbench, you can skip this section.

There are lots of reasons why you could want to retrieve saved models for benchmarking. Among them these are reviewing the work of colleagues, comparing model performance to an existing benchmark, or dealing with models that require significant compute such that in-place retraining is impractical.

For this example, we look at a named entity recognition (NER) model that is based on the pre-trained encoder-decoder transformer BERT from HuggingFace. The model is trained on the CoNLLpp dataset which consists of sentences from news stories where words were tagged with Person, Organization, Location, or Miscellaneous if they referred to entities. Words are assigned an out-of-entity label if they do not represent an entity.

Model Training

You find the code to train the model in the nnbench repository. If you want to skip running the training script but still want to reproduce this example, you can take any BERT model fine tuned for NER with the CoNLL dataset family. You find many on the Huggingface model hub, for example this one. You need to download the model.safetensors, config.json, tokenizer_config.json, and tokenizer.json files. If you want to train your own model, continue below.

There is some necessary preprocessing and data wrangling to train the model. We will not go into the details here, but if you are interested in a more thorough walkthrough, look into this resource by Huggingface which served as the basis for this example.

It is not feasible to train the model on a CPU. If you do not have access to a GPU, you can use free GPU instances on Google Colab. When opening a new Colab notebook, make sure to select a GPU instance in the upper right corner. Then, you can upload the training.py. You can ignore any warnings about the data not being persisted.

Next, install the necessary dependencies: !pip install datasets transformers[torch]. Google Colab comes with some dependencies already installed in the environment. Hence, if you are working with a different GPU instance, make sure to install everything from the pyproject.toml in the examples/artifact_benchmarking folder.

Finally, you can execute the training.py with !python training.py. This will train two BERT models ("bert-base-uncased" and "distilbert-base-uncased") which we can compare using nnbench. If you want, you can adapt the training script to train other models by editing the tuples in the tokenizers_and_models list at the bottom of the training script. The training of the models takes around 10 minutes.

Once it is done, download the respective files and save them to your disk. They should be the same mentioned above. We will need the paths to the files for benchmarking later.

The benchmarks

The benchmarking code is found in the examples/huggingface/benchmark.py. We calculate precision, recall, accuracy, and f1 scores for the whole test set and specific labels. Additionally, we obtain information about the model such as its memory footprint and inference time.

We are not walking through the whole file but instead point out certain design choices as an inspiration to you. If you are interested in a more detailed walkthrough on how to set up benchmarks, you can find it here.

Notable design choices in this benchmark are that we factored out the evaluation loop as it is necessary for all evaluation metrics. We cache it using the functools.cache decorator so the evaluation loop runs only once per benchmark run instead of once per metric which greatly reduces runtime.

We also use nnbench.parametrize to get the per-class metrics. As the parametrization method needs the same arguments for each benchmark, we use Python's builtin functools.partial to fill the arguments.

parametrize_label = partial(
    nnbench.product,
    model=[TokenClassificationModelMemo("dslim/distilbert-NER")],
    tokenizer=[TokenizerMemo("dslim/distilbert-NER")],
    valdata=[ConllValidationMemo(path="conllpp", split="validation")],
    index_label_mapping=[IndexLabelMapMemo(path="conllpp", split="validation")],
    class_label=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"],
    tags=("metric", "per-class"),
)()

Tip

In this parametrization, the model path is hardcoded to "dslim/distilbert-NER" on the HuggingFace hub. When benchmarking other models, be sure to change this path to the actual model you want to benchmark.

After this, the benchmarking code is actually very simple, as in most of the other examples. You find it in the nnbench repository in examples/huggingface/runner.py.

Custom memo classes

The parametrization contains a list of models, which are each instances of a TokenClassificationModelMemo a custom class we implemented which inherits from the nnbench.Memo class. A big advantage of a memo in this case is its ability to lazy-load models and later evict the loaded models again from a cache.

class TokenClassificationModelMemo(Memo[Module]):
    def __init__(self, path: str):
        self.path = path

    @cached_memo
    def __call__(self) -> Module:
        model: Module = AutoModelForTokenClassification.from_pretrained(
            self.path, use_safetensors=True
        )
        return model

    def __str__(self):
        return self.path

The Memo class is a generic wrapper around serialized data of any kind. It allows for lazy deserialization of artifacts from uniquely identifying metadata like storage paths, checksums, or model names on HuggingFace Hub in our case. In our derived class, we have to override the Memo.__call__() method to properly load the memoized value into memory.

We do similar with the CoNLLpp dataset.

class ConllValidationMemo(Memo[Dataset]):
    def __init__(self, path: str, split: str):
        self.path = path
        self.split = split

    @cached_memo
    def __call__(self) -> Dataset:
        dataset = load_dataset(self.path)
        path = dataset.cache_files[self.split][0]["filename"]
        dataset = Dataset.from_file(path)
        return dataset

    def __str__(self):
        return self.path + "/" + self.split

In this case, we lazy-load the datasets.Dataset object. In the following IndexLabelMapMemo class, we store a dictionary mapping the label ID to a semantic string.

class IndexLabelMapMemo(Memo[dict[int, str]]):
    def __init__(self, path: str, split: str):
        self.path = path
        self.split = split

    @cached_memo
    def __call__(self) -> dict[int, str]:
        dataset = load_dataset(self.path)
        path = dataset.cache_files[self.split][0]["filename"]
        dataset = Dataset.from_file(path)
        label_names: Sequence[str] = dataset.features["ner_tags"].feature.names
        id2label = {i: label for i, label in enumerate(label_names)}
        return id2label

    def __str__(self):
        return self.path + "/" + self.split

Info

There is no need to type-hint TokenClassificationModelMemos in the corresponding benchmarks - the benchmark running method takes care of filling in the memoized values for the memos themselves.

Because we implemented our memoized values as four different memo class types, this modularizes the benchmark input parameters - we only need to reference memos when they are actually used. Considering the recall benchmarks:

@nnbench.benchmark(tags=("metric", "aggregate"))
def recall(
    model: Module,
    tokenizer: PreTrainedTokenizerBase,
    valdata: Dataset,
    padding_id: int = -100,
) -> float:
    dataloader = make_dataloader(tokenizer, valdata)

    tp, fp, tn, fn = run_eval_loop(model, dataloader, padding_id)
    recall = tp / (tp + fn + 1e-6)
    return torch.mean(recall).item()


@parametrize_label
def recall_per_class(
    class_label: str,
    model: Module,
    tokenizer: PreTrainedTokenizerBase,
    valdata: Dataset,
    index_label_mapping: dict[int, str],
    padding_id: int = -100,
) -> float:
    dataloader = make_dataloader(tokenizer, valdata)

    tp, fp, tn, fn = run_eval_loop(model, dataloader, padding_id)
    recall_values = tp / (tp + fn + 1e-6)
    for idx, lbl in index_label_mapping.items():
        if lbl == class_label:
            return recall_values[idx]
    raise ValueError(f" Key {class_label} not in test labels")

we see that the memoized index_label_mapping argument is only necessary in the per-class benchmark, so it is never passed to the main computation.

Tip

When implementing memos for a benchmark workload, using only one value per memo at the cost of another class definition is often worth it, since you have more direct control over what goes into your benchmarks, and you can avoid having unused parameters altogether with this approach.