Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

acl_segment Teaser

Abstract

Segmenting text into sentences plays an earlyand crucial role in many NLP systems. This iscommonly achieved by using rule-based or sta-tistical methods relying on lexical features suchas punctuation. Although some recent worksno longer exclusively rely on punctuation, wefind that no prior method achieves all of (i) ro-bustness to missing punctuation, (ii) effectiveadaptability to new domains, and (iii) high effi-ciency. We introduce a new model — Segmentany Text (SAT) — to solve this problem. To en-hance robustness, we propose a new pretrainingscheme that ensures less reliance on punctua-tion. To address adaptability, we introduce anextra stage of parameter-efficient fine-tuning,establishing state-of-the-art performance in dis-tinct domains such as verses from lyrics andlegal documents. Along the way, we introducearchitectural modifications that result in a three-fold gain in speed over the previous state of theart and solve spurious reliance on context farin the future. Finally, we introduce a variant ofour model with fine-tuning on a diverse, mul-tilingual mixture of sentence-segmented data,acting as a drop-in replacement and enhance-ment for existing segmentation tools. Overall,our contributions provide a universal approachfor segmenting any text. Our method outper-forms all baselines — including strong LLMs— across 8 corpora spanning diverse domainsand languages, especially in practically relevantsituations where text is poorly formatted.


Citation

Markus Frohmann, Igor Sterner, Ivan Vulic, Benjamin Minixhofer, Markus Schedl
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 11908–11941, doi:10.18653/v1/2024.emnlp-main.665, 2024.

BibTeX

@frohmann-etal-2024-segment{MarkusFrohmann2024acl_segment,
    title = {Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation},
    author = {Markus Frohmann and Igor Sterner and Ivan Vulic and Benjamin Minixhofer and Schedl, Markus},
    booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    editor = {Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen},
    publisher = {Association for Computing Machinery},
    address = {Miami, Florida, USA},
    doi = {10.18653/v1/2024.emnlp-main.665},
    url = {https://aclanthology.org/2024.emnlp-main/},
    pages = {11908–11941},
    month = {November},
    year = {2024}
}