Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Commit 2b9a4f9

Browse files
authored
FIX: Ignore captions and quotes when detecting locale and update prompts (#1483)
A more deterministic way of making sure the LLM detects the correct language (instead of relying on prompt to LLM to ignore it) is to take the cooked and remove unwanted elements. In this commit - we remove quotes, image captions, etc. and only take the remaining text, falling back to the unadulterated cooked - and update prompts related to detection and translation - /152465/12
1 parent 8b4f401 commit 2b9a4f9

File tree

9 files changed

+163
-34
lines changed

9 files changed

+163
-34
lines changed

lib/personas/locale_detector.rb

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ def system_prompt
3232
3333
If the language is not in this list, use the appropriate IETF language tag code.
3434
35-
5. Format your response as a JSON object with a single key "locale" and the value as the language code.
35+
5. Avoid using `und` and prefer `en` over `en-US` or `en-GB` unless the text specifically indicates a regional variant.
36+
37+
6. Format your response as a JSON object with a single key "locale" and the value as the language code.
3638
3739
Your output should be in the following format:
3840
<output>
@@ -52,6 +54,23 @@ def response_format
5254
def temperature
5355
0
5456
end
57+
58+
def examples
59+
spanish = <<~MARKDOWN
60+
[quote]
61+
Non smettere mai di credere nella bellezza dei tuoi sogni. Anche quando tutto sembra perduto, c'è sempre una luce che aspetta di essere trovata.
62+
63+
Ogni passo, anche il più piccolo, ti avvicina a ciò che desideri. La forza che cerchi è già dentro di te.
64+
[/quote]
65+
66+
¿Cuál es el mensaje principal de esta cita?
67+
MARKDOWN
68+
69+
[
70+
["Can you tell me what '私の世界で一番好きな食べ物はちらし丼です' means?", { locale: "en" }.to_json],
71+
[spanish, { locale: "es" }.to_json],
72+
]
73+
end
5574
end
5675
end
5776
end

lib/personas/post_raw_translator.rb

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ def self.default_enabled
99

1010
def system_prompt
1111
<<~PROMPT.strip
12-
You are a highly skilled translator tasked with translating content from one language to another. Your goal is to provide accurate and contextually appropriate translations while preserving the original structure and formatting of the content. Follow these instructions carefully:
12+
You are a highly skilled translator tasked with translating content from one language to another. Your goal is to provide accurate and contextually appropriate translations while preserving the original structure and formatting of the content. Follow these instructions strictly:
1313
14-
1. Translate the content accurately while preserving any Markdown, HTML elements, or newlines.
14+
1. Preserve Markdown elements, HTML elements, or newlines. Text must be translated without altering the original formatting.
1515
2. Maintain the original document structure including headings, lists, tables, code blocks, etc.
1616
3. Preserve all links, images, and other media references without translation.
17-
4. Handle code snippets appropriately:
18-
- Do not translate variable names, functions, or syntax within code blocks (```).
19-
- Translate comments within code blocks.
20-
5. For technical terminology:
17+
4. For technical terminology:
2118
- Provide the accepted target language term if it exists.
2219
- If no equivalent exists, transliterate the term and include the original term in parentheses.
23-
6. For ambiguous terms or phrases, choose the most contextually appropriate translation.
24-
7. Do not add any content besides the translation.
25-
8. Ensure the translation only contains the original language and the target language.
20+
5. For ambiguous terms or phrases, choose the most contextually appropriate translation.
21+
6. Ensure the translation only contains the original language and the target language.
22+
23+
Follow these instructions on what NOT to do:
24+
7. Do not translate code snippets or programming language names, but ensure that any comments within the code are translated.
25+
8. Do not add any content besides the translation.
2626
2727
The text to translate will be provided in JSON format with the following structure:
2828
{"content": "Text to translate", "target_locale": "Target language code"}
@@ -62,17 +62,6 @@ def examples
6262
}.to_json,
6363
{ translation: "Nueva actualización para Minecraft añade templos submarinos" }.to_json,
6464
],
65-
[
66-
{
67-
content:
68-
"# Machine Learning 101\n\nMachine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on the development of algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience.\n\n## Key Concepts\n\n1. **Supervised Learning**: The algorithm learns from labeled training data.\n2. **Unsupervised Learning**: The algorithm finds patterns in unlabeled data.\n3. **Reinforcement Learning**: The algorithm learns through interaction with an environment.\n\n```python\n# Simple example of a machine learning model\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\n\n# Assuming X and y are your features and target variables\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nmodel = LogisticRegression()\nmodel.fit(X_train, y_train)\n\n# Evaluate the model\naccuracy = model.score(X_test, y_test)\nprint(f'Model accuracy: {accuracy}')\n```\n\nFor more information, visit [Machine Learning on Wikipedia](https://en.wikipedia.org/wiki/Machine_learning).",
69-
target_locale: "fr",
70-
}.to_json,
71-
{
72-
translation:
73-
"# Machine Learning 101\n\nLe Machine Learning (ML) est un sous-ensemble de l'Intelligence Artificielle (IA) qui se concentre sur le développement d'algorithmes et de modèles statistiques permettant aux systèmes informatiques d'améliorer leurs performances sur une tâche spécifique grâce à l'expérience.\n\n## Concepts clés\n\n1. **Apprentissage supervisé** : L'algorithme apprend à partir de données d'entraînement étiquetées.\n2. **Apprentissage non supervisé** : L'algorithme trouve des motifs dans des données non étiquetées.\n3. **Apprentissage par renforcement** : L'algorithme apprend à travers l'interaction avec un environnement.\n\n```python\n# Exemple simple d'un modèle de machine learning\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\n\n# En supposant que X et y sont vos variables de caractéristiques et cibles\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nmodel = LogisticRegression()\nmodel.fit(X_train, y_train)\n\n# Évaluer le modèle\naccuracy = model.score(X_test, y_test)\nprint(f'Model accuracy: {accuracy}')\n```\n\nPour plus d'informations, visitez [Machine Learning sur Wikipedia](https://en.wikipedia.org/wiki/Machine_learning).",
74-
}.to_json,
75-
],
7665
]
7766
end
7867
end

lib/translation/language_detector.rb

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,10 @@ module Translation
55
class LanguageDetector
66
DETECTION_CHAR_LIMIT = 1000
77

8-
def initialize(text)
8+
def initialize(text, topic: nil, post: nil)
99
@text = text
10+
@topic = topic
11+
@post = post
1012
end
1113

1214
def detect
@@ -36,6 +38,8 @@ def detect
3638
skip_tool_details: true,
3739
feature_name: "translation",
3840
messages: [{ type: :user, content: @text }],
41+
topic: topic,
42+
post: post,
3943
)
4044

4145
structured_output = nil
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# frozen_string_literal: true
2+
3+
module DiscourseAi
4+
module Translation
5+
class PostDetectionText
6+
NECESSARY_REMOVAL_SELECTORS = [
7+
".lightbox-wrapper", # image captions
8+
"blockquote, aside.quote", # quotes
9+
]
10+
OPTIONAL_SELECTORS = [
11+
"a.hashtag-cooked", # categories or tags are usually in site's language
12+
"a.mention", # mentions are based on the mentioned's user's name
13+
"aside.onebox", # onebox external content
14+
"img.emoji",
15+
"code, pre",
16+
]
17+
18+
def self.get_text(post)
19+
return if post.blank?
20+
cooked = post.cooked
21+
return if cooked.blank?
22+
23+
doc = Nokogiri::HTML5.fragment(cooked)
24+
original = doc.text.strip
25+
26+
# these selectors should be removed,
27+
# as they are the usual culprits for incorrect detection
28+
doc.css(*NECESSARY_REMOVAL_SELECTORS).remove
29+
necessary = doc.text.strip
30+
31+
doc.css(*OPTIONAL_SELECTORS).remove
32+
preferred = doc.text.strip
33+
34+
return preferred if preferred.present?
35+
return necessary if necessary.present?
36+
original
37+
end
38+
end
39+
end
40+
end

lib/translation/post_locale_detector.rb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ class PostLocaleDetector
66
def self.detect_locale(post)
77
return if post.blank?
88

9-
detected_locale = LanguageDetector.new(post.raw).detect
9+
text = PostDetectionText.get_text(post)
10+
detected_locale = LanguageDetector.new(text, post:).detect
1011
locale = LocaleNormalizer.normalize_to_i18n(detected_locale)
1112
post.update_column(:locale, locale)
1213
locale

lib/translation/topic_locale_detector.rb

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,7 @@ class TopicLocaleDetector
66
def self.detect_locale(topic)
77
return if topic.blank?
88

9-
text = topic.title.dup
10-
text << " #{topic.first_post.raw}" if topic.first_post.raw
11-
12-
detected_locale = LanguageDetector.new(text).detect
9+
detected_locale = LanguageDetector.new(topic.title.dup, topic:).detect
1310
locale = LocaleNormalizer.normalize_to_i18n(detected_locale)
1411
topic.update_column(:locale, locale)
1512
locale
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# frozen_string_literal: true
2+
3+
describe DiscourseAi::Translation::PostDetectionText do
4+
describe ".get_text" do
5+
let(:post) { Fabricate.build(:post) }
6+
7+
it "returns nil when post is nil" do
8+
expect(described_class.get_text(nil)).to be_nil
9+
end
10+
11+
it "returns nil when post.cooked is nil" do
12+
post.cooked = nil
13+
expect(described_class.get_text(post)).to be_nil
14+
end
15+
16+
it "handles simple text" do
17+
post.cooked = "<p>Hello world</p>"
18+
expect(described_class.get_text(post)).to eq("Hello world")
19+
end
20+
21+
it "removes quotes" do
22+
post.cooked = "<p>Hello </p><blockquote><p>Quote</p></blockquote><p>World</p>"
23+
expect(described_class.get_text(post)).to eq("Hello World")
24+
end
25+
26+
it "removes Discourse quotes" do
27+
post.cooked = '<p>Hello </p><aside class="quote"><p>Quote</p></aside><p>World</p>'
28+
expect(described_class.get_text(post)).to eq("Hello World")
29+
end
30+
31+
it "removes image captions" do
32+
post.cooked = '<p>Hello </p><div class="lightbox-wrapper">Caption text</div><p>World</p>'
33+
expect(described_class.get_text(post)).to eq("Hello World")
34+
end
35+
36+
it "removes oneboxes" do
37+
post.cooked = '<p>Hello </p><aside class="onebox">Onebox content</aside><p>World</p>'
38+
expect(described_class.get_text(post)).to eq("Hello World")
39+
end
40+
41+
it "removes code blocks" do
42+
post.cooked = "<p>Hello </p><pre><code>Code block</code></pre><p>World</p>"
43+
expect(described_class.get_text(post)).to eq("Hello World")
44+
end
45+
46+
it "removes hashtags" do
47+
post.cooked = '<p>Hello </p><a class="hashtag-cooked">#hashtag</a><p>World</p>'
48+
expect(described_class.get_text(post)).to eq("Hello World")
49+
end
50+
51+
it "removes emoji" do
52+
post.cooked = '<p>Hello </p><img class="emoji" alt=":smile:" title=":smile:"><p>World</p>'
53+
expect(described_class.get_text(post)).to eq("Hello World")
54+
end
55+
56+
it "removes mentions" do
57+
post.cooked = '<p>Hello </p><a class="mention">@user</a><p>World</p>'
58+
expect(described_class.get_text(post)).to eq("Hello World")
59+
end
60+
61+
it "falls back to necessary text when preferred is empty" do
62+
post.cooked = '<aside class="quote">Quote</aside><a class="mention">@user</a>'
63+
expect(described_class.get_text(post)).to eq("@user")
64+
end
65+
66+
it "falls back to cooked when all filtering removes all content" do
67+
post.cooked = "<blockquote>Quote</blockquote>"
68+
expect(described_class.get_text(post)).to eq("Quote")
69+
end
70+
71+
it "handles complex nested content correctly" do
72+
post.cooked =
73+
'<p>Hello </p><div class="lightbox-wrapper"><p>Image caption</p><img src="test.jpg"></div><blockquote><p>Quote text</p></blockquote><p>World</p><pre><code>Code block</code></pre><a class="mention">@user</a>'
74+
expect(described_class.get_text(post)).to eq("Hello World")
75+
end
76+
end
77+
end

spec/lib/translation/post_locale_detector_spec.rb

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,13 @@
22

33
describe DiscourseAi::Translation::PostLocaleDetector do
44
describe ".detect_locale" do
5-
fab!(:post) { Fabricate(:post, raw: "Hello world", locale: nil) }
5+
fab!(:post) { Fabricate(:post, cooked: "Hello world", locale: nil) }
66

77
def language_detector_stub(opts)
88
mock = instance_double(DiscourseAi::Translation::LanguageDetector)
99
allow(DiscourseAi::Translation::LanguageDetector).to receive(:new).with(
1010
opts[:text],
11+
post: opts[:post],
1112
).and_return(mock)
1213
allow(mock).to receive(:detect).and_return(opts[:locale])
1314
end
@@ -17,16 +18,16 @@ def language_detector_stub(opts)
1718
end
1819

1920
it "updates the post locale with the detected locale" do
20-
language_detector_stub({ text: post.raw, locale: "zh_CN" })
21+
language_detector_stub({ text: post.cooked, locale: "zh_CN", post: })
2122
expect { described_class.detect_locale(post) }.to change { post.reload.locale }.from(nil).to(
2223
"zh_CN",
2324
)
2425
end
2526

2627
it "bypasses validations when updating locale" do
27-
post.update_column(:raw, "A")
28+
post.update_column(:cooked, "A")
2829

29-
language_detector_stub({ text: post.raw, locale: "zh_CN" })
30+
language_detector_stub({ text: post.cooked, locale: "zh_CN", post: })
3031

3132
described_class.detect_locale(post)
3233
expect(post.reload.locale).to eq("zh_CN")

spec/lib/translation/topic_locale_detector_spec.rb

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,13 @@
33
describe DiscourseAi::Translation::TopicLocaleDetector do
44
describe ".detect_locale" do
55
fab!(:topic) { Fabricate(:topic, title: "this is a cat topic", locale: nil) }
6-
fab!(:post) { Fabricate(:post, raw: "and kittens", topic:) }
6+
fab!(:post) { Fabricate(:post, topic:) }
77

88
def language_detector_stub(opts)
99
mock = instance_double(DiscourseAi::Translation::LanguageDetector)
1010
allow(DiscourseAi::Translation::LanguageDetector).to receive(:new).with(
1111
opts[:text],
12+
topic: opts[:topic],
1213
).and_return(mock)
1314
allow(mock).to receive(:detect).and_return(opts[:locale])
1415
end
@@ -18,7 +19,7 @@ def language_detector_stub(opts)
1819
end
1920

2021
it "updates the topic locale with the detected locale" do
21-
language_detector_stub({ text: "This is a cat topic and kittens", locale: "zh_CN" })
22+
language_detector_stub({ text: "This is a cat topic", locale: "zh_CN", topic: })
2223
expect { described_class.detect_locale(topic) }.to change { topic.reload.locale }.from(
2324
nil,
2425
).to("zh_CN")
@@ -29,7 +30,7 @@ def language_detector_stub(opts)
2930
SiteSetting.min_topic_title_length = 15
3031
SiteSetting.max_topic_title_length = 16
3132

32-
language_detector_stub({ text: "A and kittens", locale: "zh_CN" })
33+
language_detector_stub({ text: "A", locale: "zh_CN", topic: })
3334

3435
described_class.detect_locale(topic)
3536
expect(topic.reload.locale).to eq("zh_CN")

0 commit comments

Comments
 (0)