Text Splitting

beginner
TechniquesLast updated: 2025-01-15

What is Text Splitting?


Text splitting is the process of dividing documents or long text passages into manageable segments for processing, embedding, and storage in retrieval systems. This segmentation is necessary because embedding models have input length limits, and working with smaller chunks often produces better retrieval results than processing entire documents as single units. Text splitting strategies balance preserving semantic coherence within chunks against maintaining manageable sizes.


Various splitting strategies exist with different tradeoffs. Character-based splitting divides text at fixed character or token counts, simple to implement but may split mid-sentence or mid-thought. Sentence-based splitting respects sentence boundaries, maintaining better semantic units. Paragraph-based splitting preserves document structure. Recursive splitting uses a hierarchy of separators (first splitting on major boundaries like "\n\n", then falling back to sentences, then to fixed character counts if needed), adapting to document structure. Some advanced approaches use semantic analysis to identify natural topic boundaries.


Text splitting significantly impacts RAG system performance. Chunks that are too small lack sufficient context and may not contain complete thoughts. Chunks that are too large include irrelevant information that dilutes relevance scores and wastes context window space. The optimal splitting strategy depends on document types (technical docs vs narratives), embedding model capabilities, expected query patterns, and how retrieved chunks will be used. Most frameworks provide configurable text splitters with options for split size, overlap, and splitting logic to tune for specific use cases.


Related Terms