*Yes absolutely want to do comparisons to ELMo.* Perhaps even more interesting t...

Yes absolutely want to do comparisons to ELMo.

Perhaps even more interesting than comparison would be modifications to ULMFit to incorporate good ideas from the AllenNLP ELMo paper.

The learned weighting of representation layers seems like a decent candidate, as does giving the model flexibility to use something other than a concatenated [mean / max / last state] representation of final LSTM output layer (as is the case in some of ELMo's task models). I'm personally curious about using an attention mechanism in conjunction with something like ELMo's gamma task parameter (regularizer) for learning a weighted combination of outputs but haven't been able to get things to function well in practice.

The dataset the ELMo model is trained might also be preferable to WIKI 103 for practical English tasks, although you lose the nice multilingual benefits you get from working with WIKI 103.

In general it seems like the format described in the ELMo paper is simply not designed to work at very low N because the weights of the (often complex) task models used in ELMo's benchmarks are learned entirely for scratch. That's not possible without a decent amount of labeled training data.

Anyhow, thought the paper was very well put together, definitely an enjoyable read. Hope yourself and Sebastian collaborate on future papers, as good things certainly came of this one!