Skip to content

Several improvements to token filter types #4291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 29, 2025
Merged

Several improvements to token filter types #4291

merged 8 commits into from
Apr 29, 2025

Conversation

JoshMock
Copy link
Member

@JoshMock JoshMock commented Apr 17, 2025

An issue on the JS repo was opened about a missing cjk_bigram token filter type. I did an audit of the token filter docs and found a few missing types, so I:

  • Added the missing types
  • Added some missing values to enums
  • Moved a couple types into their relevant plugin files
  • Created a few base types for token filters that share most of the same parameters
  • Added docstrings to as many parameters as I could find
  • Removed a few parameters not mentioned in the docs

Fixes #4251
Fixes elastic/elasticsearch-js#2726

@JoshMock JoshMock enabled auto-merge (squash) April 17, 2025 18:42
@JoshMock JoshMock requested review from pquentin, l-trotta and a team April 17, 2025 18:43
@l-trotta
Copy link
Contributor

l-trotta commented Apr 28, 2025

@JoshMock I'm having some doubts about this one, I think while it's good to have the full list of stopwords, some of the specific tokenizers only accept part of it? For example, this is the code for the JapaneseStopTokenFilter, where stopwords seems to be a list of strings defined in Lucene

@JoshMock
Copy link
Member Author

@l-trotta Good catch. I updated the stopwords types in 5fc7dbe.

Copy link
Member

@flobernd flobernd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JoshMock 🙂 Just a minor comment.

@@ -56,7 +56,7 @@ export class FingerprintAnalyzer {
*
* @server_default _none_
*/
stopwords?: StopWords
stopwords?: StopWordLanguage | string[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause inconveniences for users in statically typed languages.

Could we please continue using the StopWords type here?

StopWords is currently defined as:

export type StopWords = string | string[]

Changing that to:

export type StopWords = StopWordLanguage | string[]

while keeping this file "as is", should do the trick.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't comment on this because the java client simplifies enum | string to string, so I assumed other static clients would have something similar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .NET I don’t do this since enums are way nicer to use.

In this case it’s as well string[]. Do you simplify that as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it also works as enum | string[] -> string[]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flobernd I think I got what you mean. Made a change in db8c130. Let me know if that looks better to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect now! Thank you @JoshMock

@JoshMock JoshMock merged commit f5dca08 into main Apr 29, 2025
8 checks passed
@JoshMock JoshMock deleted the token-filters branch April 29, 2025 15:40
github-actions bot pushed a commit that referenced this pull request Apr 29, 2025
github-actions bot pushed a commit that referenced this pull request Apr 29, 2025
github-actions bot pushed a commit that referenced this pull request Apr 29, 2025
JoshMock added a commit that referenced this pull request Apr 30, 2025
(cherry picked from commit f5dca08)

Co-authored-by: Josh Mock <joshua.mock@elastic.co>
JoshMock added a commit that referenced this pull request Apr 30, 2025
(cherry picked from commit f5dca08)

Co-authored-by: Josh Mock <joshua.mock@elastic.co>
Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
JoshMock added a commit that referenced this pull request Apr 30, 2025
(cherry picked from commit f5dca08)

Co-authored-by: Josh Mock <joshua.mock@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Several missing token filter types TypescriptError: Missing filter type cjk_bigram
3 participants