• NovEval: Assess scientific novelty in alignment with human evaluations.

  • Scientific Abstract Simplification: Rewrite complex scientific abstracts into lay summaries, making scientific knowledge accessible to everyone.


  • Blog-1K: The Blog-1K corpus is a redistributable English authorship identification benchmark. It has a roughly balanced sample distribution per author and fixed data splits (train/val/test), allowing for a fair comparison among deep learning-based models.

  • RAABT: The Reproducible Authorship Attribution Benchmark Tasks contain five tasks for attributing the authorship of contemporary non-fiction American English prose. They have fixed training/testing splits to prevent accuracy inflation caused by homogeneous corpora.

  • Cross-Register Authorship Attribution Corpus: This corpus contains writings from eight authors known to have written in both vernacular and classical Chinese. With 4.2 million Chinese characters, it can be useful for authorship identification research.

Python package