Skip to content

Conversation

rkuo-danswer
Copy link
Contributor

@rkuo-danswer rkuo-danswer commented May 19, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1985/nltk-punkt-deprecated

related to nltk/nltk#3293

2025-05-19 08:53:43.267	
**********************************************************************

2025-05-19 08:53:43.267	
    - '/usr/local/lib/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/lib/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/local/share/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/share/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/local/lib/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/local/share/nltk_data'
2025-05-19 08:53:43.267	
    - '/usr/local/nltk_data'
2025-05-19 08:53:43.267	
    - '/root/nltk_data'
2025-05-19 08:53:43.267	
  Searched in:
2025-05-19 08:53:43.267	

2025-05-19 08:53:43.267	
  Attempted to load tokenizers/punkt_tab/english/
2025-05-19 08:53:43.267	

2025-05-19 08:53:43.267	
  For more information see: https://www.nltk.org/data.html
2025-05-19 08:53:43.267	
  
2025-05-19 08:53:43.267	
  >>> nltk.download('punkt_tab')
2025-05-19 08:53:43.267	
  >>> import nltk
2025-05-19 08:53:43.267	

2025-05-19 08:53:43.267	
  Please use the NLTK Downloader to obtain the resource:
2025-05-19 08:53:43.267	
  Resource punkt_tab not found.

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented May 19, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 19, 2025 5:08pm

@rkuo-danswer rkuo-danswer marked this pull request as ready for review May 19, 2025 17:05
@rkuo-danswer rkuo-danswer requested a review from a team as a code owner May 19, 2025 17:05
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Updates the NLTK Punkt tokenizer download command in /backend/Dockerfile to use 'punkt_tab' instead of 'punkt' due to deprecation of the original tokenizer.

  • Changed RUN python -c "import nltk; nltk.download('punkt')" to use punkt_tab in /backend/Dockerfile to address NLTK issue #3293
  • Ensures tokenization functionality remains intact by using the recommended replacement tokenizer

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@rkuo-danswer rkuo-danswer added this pull request to the merge queue May 19, 2025
Merged via the queue into main with commit b108503 May 19, 2025
13 checks passed
@rkuo-danswer rkuo-danswer deleted the bugfix/nltk-punkt-tab branch May 19, 2025 21:25
ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 27, 2025
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request May 27, 2025
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
ZhipengHe pushed a commit to ZhipengHe/onyx that referenced this pull request Jun 6, 2025
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants