[FLINK-35652][doc] Add document for lookup custom shuffle

yunfengzhou-hub · web-flow · commit 3cfc211e152a · 2025-04-18T11:31:28.000+08:00
diff --git a/docs/content.zh/docs/dev/table/sql/queries/hints.md b/docs/content.zh/docs/dev/table/sql/queries/hints.md
@@ -359,6 +359,14 @@ LOOKUP 联接提示允许用户建议 Flink 优化器:
 	<td>N/A</td>
 	<td>固定延迟策略的最大重试次数</td>
 </tr>
+<tr>
+	<td>shuffle</td>
+	<td>shuffle</td>
+	<td>N</td>
+	<td>boolean</td>
+	<td>false</td>
+	<td>是否开启自定义数据分发功能。此功能允许 Lookup Source 自行决定数据分布方式并依此对数据查询逻辑做相应优化</td>
+</tr>
 </tbody>
 </table>
 
@@ -445,6 +453,19 @@ LOOKUP('table'='Customers', 'async'='false', 'retry-predicate'='lookup_miss', 'r
 LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
 ```
 
+#### 4. 启用自定义数据分布
+
+在默认情况下，Lookup Join 的输入流数据分布是随机的，因此数据源可能无法有效利用缓存来加速查找。 用户可以通过如下方式启
+用自定义数据分发，使数据源能够自行决定输入数据的分布，并利用这一先验知识来优化其缓存和查找策略。
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+为了充分利用这个优化，目标 Lookup Source 应该提供对自定义数据分发能力的支持。连接器开发人员可以通过让 
+LookupTableSource 子类实现 SupportsLookupCustomShuffle 接口来支持这种能力。即使 Source 尚未提供这种能力，用户
+依然可以选择先启用这个功能，此时 Flink 将会尝试应用哈希分区的优化方式以尽可能带来性能提升。
+
 #### 进一步说明
 
 #### 开启缓存对重试的影响
diff --git a/docs/content/docs/dev/table/sql/queries/hints.md b/docs/content/docs/dev/table/sql/queries/hints.md
@@ -369,6 +369,14 @@ The LOOKUP hint allows users to suggest the Flink optimizer to:
 	<td>N/A</td>
 	<td>max attempt number of the 'fixed_delay' strategy</td>
 </tr>
+<tr>
+	<td>shuffle</td>
+	<td>shuffle</td>
+	<td>N</td>
+	<td>boolean</td>
+	<td>false</td>
+	<td>whether to enable custom lookup shuffle, which allows the lookup source to decide input data distribution and to optimize lookup strategy accordingly</td>
+</tr>
 </tbody>
 </table>
 
@@ -464,6 +472,23 @@ If the lookup source only has one capability, then the 'async' mode option can b
 LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
 ```
 
+#### 4. Enable Custom Data Distribution
+
+By default, the data distribution of Lookup Join's input stream is arbitrary, so sources may not
+make effective use of caches to accelerate lookups. By enabling custom shuffle as follows, the
+sources would be able to decide the distribution of the input data on their own and use this prior
+knowledge to optimize their caches and lookup strategy.
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+In order to make full use of this feature, the target lookup source should have supported custom 
+shuffle. For connector developers, this could be achieved by having the `LookupTableSource` subclass 
+implement `SupportsLookupCustomShuffle`. Even if the source has not provided such support yet, users
+can still enable this feature first, and then Flink will try best to apply a hash partitioning, 
+which should also bring performance improvement.
+
 #### Further Notes
 
 #### Effect Of Enabling Caching On Retries