Dynamic Script Chunking for interviews and presentations #6800

TomLucidor · 2024-01-22T05:15:16Z

TomLucidor
Jan 22, 2024

Currently the DocumentSplitter method only splits text by sentences, passages (paragrpah equivalent), and pages, and then bounds them by set chunks with overlaps. Sometimes however related information could be pushed into two seperate chunks. This might not work for oral content

Reference to software https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_splitter.py

For context, if the "document" in question is an interview script without well defined paragraphs/passage or even pages, and only large sets of sentences in a timeline, how can splitting the script be possible without losing surrounding context?

Idea borrowed from https://github.com/nicktill/YTRecap/blob/main/src/app.py
Side note: Video chatbots can be made https://github.com/Anil-matcha/Youtube-to-chatbot

TomLucidor · 2024-01-22T08:36:45Z

TomLucidor
Jan 22, 2024
Author

Also spotted this for reference https://www.assemblyai.com/blog/text-segmentation-approaches-datasets-and-evaluation-metrics/
And also Polovinkin's cosine similarity and cut-off points method can come on handy as well https://archive.ph/dL7wa https://github.com/poloniki/quint/blob/master/notebooks/Chunking%20text%20into%20paragraphs.ipynb
Note on a counter-problem: classical tools like SpaCy and NLTK can produce chunks that are too long https://archive.ph/5XgJI

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic Script Chunking for interviews and presentations #6800

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dynamic Script Chunking for interviews and presentations #6800

Uh oh!

Uh oh!

TomLucidor Jan 22, 2024

Replies: 1 comment

Uh oh!

TomLucidor Jan 22, 2024 Author

TomLucidor
Jan 22, 2024

TomLucidor
Jan 22, 2024
Author