pytorch / fairseq

BLEU SCORE SACREBLEU
SPEED
MODEL CODE PAPER
ε-REPR
CODE PAPER
ε-REPR
PAPER
GLOBAL RANK
ConvS2S
40.13 40.46 38.76 -- 58.8 #5
DynamicConv
42.41 43.20 41.17 -- 59.3 #3
LightConv
42.25 43.10 40.88 -- 59.7 #4
Transformer Big
42.43 43.20 41.27 -- 63.3 #1
BLEU SCORE SACREBLEU
SPEED
MODEL CODE PAPER
ε-REPR
CODE PAPER
ε-REPR
PAPER
GLOBAL RANK
ConvS2S
25.70 25.16 22.34 -- 76.6 #19
DynamicConv
29.76 29.70 28.49 -- 66.5 #9
DynamicConv
(without GLUs)
29.17 -- 28.42 -- 68.6 #11
Facebook-FAIR
(ensemble)
36.72 -- 36.04 -- 1.7 #1
Facebook-FAIR
(single)
36.18 -- 35.56 -- 9.2 #2
LightConv
29.17 28.90 28.62 -- #9
LightConv
(without GLUs)
28.44 -- 27.80 -- 69.4 #13
Transformer Big
25.39 29.30 24.77 -- 68.4 #17
Transformer Big + BT
33.89 35.00 33.58 33.80 8.1 #3
BLEU SCORE SACREBLEU
SPEED
MODEL CODE PAPER
ε-REPR
CODE PAPER
ε-REPR
PAPER
GLOBAL RANK
ConvS2S
35.34 -- 30.89 -- 69.3 #15
DynamicConv
37.66 -- 37.58 -- 61.1 #9
DynamicConv
(without GLUs)
38.28 -- 37.93 -- 60.3 #3
Facebook-FAIR
(ensemble)
42.25 43.10 42.02 42.70 1.5 #1
Facebook-FAIR
(single)
40.94 -- 40.71 -- 8.2 #2
LightConv
36.63 -- 36.53 -- #13
LightConv
(without GLUs)
37.78 -- 37.51 -- 61.7 #6
Transformer Big
32.76 -- 32.61 -- 60.6 #17
Transformer Big + BT
37.82 -- 37.70 -- 7.4 #7
See Full Build Details +get badge code
[![SotaBench](https://img.shields.io/endpoint.svg?url=https://sotabench.com/api/v0/badge/gh/mkardas/fairseq)](https://sotabench.com/user/marcin/repos/mkardas/fairseq)

How the Repository is Evaluated

The full sotabench.py file - source
import re
from collections import OrderedDict

from sotabencheval.machine_translation import WMTEvaluator, WMTDataset, Language
from fairseq import utils
from tqdm import tqdm
import hubconf


class ModelCfg:
    def __init__(self, model_name, arxiv_id, src_lang, dst_lang, hubname, batch_size, description='', **kwargs):
        self.model_name, self.arxiv_id, self.src_lang, self.dst_lang = model_name, arxiv_id, src_lang, dst_lang
        self.hubname, self.batch_size = hubname, batch_size
        self.params = kwargs
        if self.params.get('tokenizer') == 'moses':
            self.params.setdefault('moses_no_dash_splits', True)
            self.params.setdefault('moses_no_escape', False)

        self.description = self._get_description(description)

    def _get_description(self, description):
        details = []
        if description:
            details.append(description)

        ensemble_len = len(self.params.get('checkpoint_file', '').split(':'))
        if ensemble_len > 1:
            details.append('ensemble of {} models'.format(ensemble_len))
        details.append('batch size: {}'.format(self.batch_size))
        details.append('beam width: {}'.format(self.params['beam']))
        lenpen = self.params.get('lenpen', 1)
        if lenpen != 1:
            details.append('length penalty: {:.2f}'.format(lenpen))
        return ', '.join(details)

    def get_evaluator(self, model, dataset):
        def tok4bleu(sentence):
            tokenized = model.tokenize(sentence)
            return re.sub(r'(\S)-(\S)', r'\1 ##AT##-##AT## \2', tokenized)

        return WMTEvaluator(
            dataset=dataset,
            source_lang=self.src_lang,
            target_lang=self.dst_lang,
            local_root="data/nlp/wmt",
            model_name=self.model_name,
            paper_arxiv_id=self.arxiv_id,
            model_description=self.description,
            tokenization=tok4bleu
        )

    def load_model(self):
        # similar to torch.hub.load, but makes sure to load hubconf from the current commit
        load = getattr(hubconf, self.hubname)
        return load(**self.params).cuda()


def translate_batch(model, sids, sentences):
    input = [model.encode(sentence) for sentence in sentences]
    lengths = [len(t) for t in input]
    dataset = model.task.build_dataset_for_inference(input, lengths)
    samples = dataset.collater(dataset)
    samples = utils.apply_to_sample(
        lambda tensor: tensor.to(model.device),
        samples
    )
    ids = samples['id'].cpu()

    generator = model.task.build_generator(model.args)

    translations = model.task.inference_step(generator, model.models, samples)
    hypos = [translation[0]['tokens'] for translation in translations]
    translated = [model.decode(hypo) for hypo in hypos]
    return OrderedDict([(sids[id], tr) for id, tr in zip(ids, translated)])


def batchify(items, batch_size):
    items = list(items)
    items = sorted(items, key=lambda x: len(x[1]), reverse=True)
    length = len(items)
    return [items[i * batch_size: (i+1) * batch_size] for i in range((length + batch_size - 1) // batch_size)]


datasets = [
    (WMTDataset.News2014, Language.English, Language.German),
    (WMTDataset.News2014, Language.English, Language.French),
    (WMTDataset.News2019, Language.English, Language.German),
]

models = [
    # English -> German models
    ModelCfg("ConvS2S", "1705.03122", Language.English, Language.German, 'conv.wmt14.en-de',
             description="trained on WMT14",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt'),
    ModelCfg("ConvS2S", "1705.03122", Language.English, Language.German, 'conv.wmt17.en-de',
             description="trained on WMT17",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt'),

    # ModelCfg(Language.English, Language.German, 'transformer.wmt16.en-de', checkpoint_file=?),
    ModelCfg("LightConv (without GLUs)", "1901.10430", Language.English, Language.German, 'lightconv.wmt16.en-de.noglu',
             description="trained on WMT16",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt'),
    ModelCfg("DynamicConv (without GLUs)", "1901.10430", Language.English, Language.German, 'dynamicconv.wmt16.en-de.noglu',
             description="trained on WMT16",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt'),
    ModelCfg("LightConv", "1901.10430", Language.English, Language.German, 'lightconv.wmt16.en-de',
             description="trained on WMT16",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt', lenpen=0.5),
    ModelCfg("DynamicConv", "1901.10430", Language.English, Language.German, 'dynamicconv.wmt16.en-de',
             description="trained on WMT16",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt', lenpen=0.5),
    ModelCfg("DynamicConv", "1901.10430", Language.English, Language.German, 'dynamicconv.wmt17.en-de',
             description="trained on WMT17",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt', lenpen=0.5),

    ModelCfg("Transformer Big", "1806.00187", Language.English, Language.German, 'transformer.wmt16.en-de',
             description="trained on WMT16",
             batch_size=128, beam=4, tokenizer='moses', bpe='fastbpe', lenpen=0.6),
    ModelCfg("Transformer Big + BT", "1808.09381", Language.English, Language.German, 'transformer.wmt18.en-de',
             description="trained on WMT18",
             batch_size=24, beam=5, tokenizer='moses', bpe='subword_nmt',
             checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt:wmt18.model6.pt'),
    ModelCfg("Facebook-FAIR (single)", "1907.06616", Language.English, Language.German, 'transformer.wmt19.en-de.single_model',
             description="trained on WMT19",
             batch_size=20, beam=50, tokenizer='moses', bpe='fastbpe'),

    ModelCfg("Facebook-FAIR (ensemble)", "1907.06616", Language.English, Language.German, 'transformer.wmt19.en-de',
             description="trained on WMT19",
             batch_size=4, beam=50, tokenizer='moses', bpe='fastbpe',
             checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'),

    # English -> French models
    ModelCfg("ConvS2S", "1705.03122v3", Language.English, Language.French, 'conv.wmt14.en-fr',
             description="trained on WMT14",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt'),
    ModelCfg("Transformer Big", "1806.00187", Language.English, Language.French, 'transformer.wmt14.en-fr',
             description="trained on WMT14",
             batch_size=128, beam=4, tokenizer='moses', bpe='fastbpe', lenpen=0.6),
    ModelCfg("LightConv", "1901.10430", Language.English, Language.French, 'lightconv.wmt14.en-fr',
             description="trained on WMT14",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt', lenpen=0.9),
    ModelCfg("DynamicConv", "1901.10430", Language.English, Language.French, 'dynamicconv.wmt14.en-fr',
             description="trained on WMT14",
             batch_size=128, beam=5, tokenizer='moses', bpe='subword_nmt', lenpen=0.9),
]

for model_cfg in models:
    print("Evaluating model {} ({} -> {})".
          format(model_cfg.model_name, model_cfg.src_lang.name, model_cfg.dst_lang.name))
    model = model_cfg.load_model()
    for ds, src_lang, dst_lang in datasets:
        if src_lang == model_cfg.src_lang and dst_lang == model_cfg.dst_lang:
            evaluator = model_cfg.get_evaluator(model, ds)

            with tqdm(batchify(evaluator.metrics.source_segments.items(), model_cfg.batch_size)) as iter:
                for batch in iter:
                    sids, texts = zip(*batch)
                    answers = translate_batch(model, sids, texts)
                    evaluator.add(answers)
                    if evaluator.cache_exists:
                        break

            evaluator.save()
            print(evaluator.results)
STATUS
BUILD
COMMIT MESSAGE
RUN TIME
0h:26m:28s
0h:37m:34s
0h:28m:18s
Use release version of sotabench
mkardas   37ae3e0  ·  Oct 09 2019
0h:15m:31s
Add two models and fix target langauge
mkardas   e24c870  ·  Oct 08 2019
0h:48m:21s
Remove a model with broken link
mkardas   f49bc9a  ·  Oct 08 2019
1h:36m:14s
Add more models
mkardas   58fd84d  ·  Oct 08 2019
0h:51m:01s
Add more models
mkardas   0658e71  ·  Oct 08 2019
0h:12m:53s
Use all wmt18.en-de models for ensemble
mkardas   e1992e2  ·  Oct 08 2019
0h:38m:35s
Use dev. version of sotabench-api
mkardas 7ce7093    12253ca  ·  Oct 08 2019
1h:42m:36s
Add descriptions to models * close tqdm iterator when cache exi…
mkardas   e39c4d4  ·  Oct 07 2019
1h:35m:23s
Add a back-translation model to sotabench
mkardas   31072fc  ·  Oct 06 2019
0h:45m:06s
0h:18m:22s
Add more models
mkardas   6aa4fa6  ·  Oct 06 2019
1h:54m:04s
Decrease batch size for transformers
mkardas   07a34c2  ·  Oct 06 2019
0h:33m:23s
Decrease batch size for transformers
mkardas   16b127f  ·  Oct 06 2019
0h:10m:43s
Decrease batch size for transformers
mkardas   abe212f  ·  Oct 06 2019
0h:10m:22s
Decrease batch size for transformers
mkardas   41dfd3c  ·  Oct 06 2019
0h:19m:36s
Decrease batch size for transformers
mkardas   8c7083e  ·  Oct 06 2019
0h:09m:47s
Decrease batch size for transformers
mkardas   ac7f415  ·  Oct 06 2019
0h:08m:57s
Batch sentences for sotabench evaluation
mkardas   065b955  ·  Oct 06 2019
0h:10m:43s
Increase beam
mkardas   33828fe  ·  Oct 05 2019
5h:18m:03s
Enable en-de ensemble
mkardas   230b162  ·  Oct 05 2019
1h:48m:04s
Use release version of sotabench
mkardas   62a81cf  ·  Oct 04 2019
0h:12m:15s
1h:32m:31s
0h:12m:25s
1h:41m:40s
Don't benchmark ensembles
mkardas   ed63323  ·  Oct 04 2019
0h:33m:26s
Add models and datasets to sotabench
mkardas   7a3c8bd  ·  Oct 04 2019
1h:16m:14s
0h:11m:05s
1h:09m:34s
Test ensemble of models
mkardas   8899fd4  ·  Oct 04 2019
0h:04m:20s
0h:21m:15s
0h:04m:42s