Here are the two research papers I’ve had the privilege of co-authoring:

Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus (2021)

pdf

OSCAR had a Go pipeline that got you raw, text for a plethora of languages from the web. This paper outlines a better pipeline architecture and the addition of some metadata in a backward-compatible format, and documents a new dataset based on a more modern CommonCrawl dump.

Towards a cleaner document-oriented multilingual crawled corpus (2022)

pdf

This one iterates over the previous paper, adding more metadata and trying to keep documents intact. IMHO it’s a better paper overall since I had more time to analyze the generated dataset.