Skip to content

Commit 3cfc211

Browse files
[FLINK-35652][doc] Add document for lookup custom shuffle
1 parent de280fa commit 3cfc211

File tree

2 files changed

+46
-0
lines changed
  • docs
    • content/docs/dev/table/sql/queries
    • content.zh/docs/dev/table/sql/queries

2 files changed

+46
-0
lines changed

docs/content.zh/docs/dev/table/sql/queries/hints.md

+21
Original file line numberDiff line numberDiff line change
@@ -359,6 +359,14 @@ LOOKUP 联接提示允许用户建议 Flink 优化器:
359359
<td>N/A</td>
360360
<td>固定延迟策略的最大重试次数</td>
361361
</tr>
362+
<tr>
363+
<td>shuffle</td>
364+
<td>shuffle</td>
365+
<td>N</td>
366+
<td>boolean</td>
367+
<td>false</td>
368+
<td>是否开启自定义数据分发功能。此功能允许 Lookup Source 自行决定数据分布方式并依此对数据查询逻辑做相应优化</td>
369+
</tr>
362370
</tbody>
363371
</table>
364372

@@ -445,6 +453,19 @@ LOOKUP('table'='Customers', 'async'='false', 'retry-predicate'='lookup_miss', 'r
445453
LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
446454
```
447455

456+
#### 4. 启用自定义数据分布
457+
458+
在默认情况下,Lookup Join 的输入流数据分布是随机的,因此数据源可能无法有效利用缓存来加速查找。 用户可以通过如下方式启
459+
用自定义数据分发,使数据源能够自行决定输入数据的分布,并利用这一先验知识来优化其缓存和查找策略。
460+
461+
```sql
462+
LOOKUP('table'='Customers', 'shuffle'='true')
463+
```
464+
465+
为了充分利用这个优化,目标 Lookup Source 应该提供对自定义数据分发能力的支持。连接器开发人员可以通过让
466+
LookupTableSource 子类实现 SupportsLookupCustomShuffle 接口来支持这种能力。即使 Source 尚未提供这种能力,用户
467+
依然可以选择先启用这个功能,此时 Flink 将会尝试应用哈希分区的优化方式以尽可能带来性能提升。
468+
448469
#### 进一步说明
449470

450471
#### 开启缓存对重试的影响

docs/content/docs/dev/table/sql/queries/hints.md

+25
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,14 @@ The LOOKUP hint allows users to suggest the Flink optimizer to:
369369
<td>N/A</td>
370370
<td>max attempt number of the 'fixed_delay' strategy</td>
371371
</tr>
372+
<tr>
373+
<td>shuffle</td>
374+
<td>shuffle</td>
375+
<td>N</td>
376+
<td>boolean</td>
377+
<td>false</td>
378+
<td>whether to enable custom lookup shuffle, which allows the lookup source to decide input data distribution and to optimize lookup strategy accordingly</td>
379+
</tr>
372380
</tbody>
373381
</table>
374382

@@ -464,6 +472,23 @@ If the lookup source only has one capability, then the 'async' mode option can b
464472
LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
465473
```
466474

475+
#### 4. Enable Custom Data Distribution
476+
477+
By default, the data distribution of Lookup Join's input stream is arbitrary, so sources may not
478+
make effective use of caches to accelerate lookups. By enabling custom shuffle as follows, the
479+
sources would be able to decide the distribution of the input data on their own and use this prior
480+
knowledge to optimize their caches and lookup strategy.
481+
482+
```sql
483+
LOOKUP('table'='Customers', 'shuffle'='true')
484+
```
485+
486+
In order to make full use of this feature, the target lookup source should have supported custom
487+
shuffle. For connector developers, this could be achieved by having the `LookupTableSource` subclass
488+
implement `SupportsLookupCustomShuffle`. Even if the source has not provided such support yet, users
489+
can still enable this feature first, and then Flink will try best to apply a hash partitioning,
490+
which should also bring performance improvement.
491+
467492
#### Further Notes
468493

469494
#### Effect Of Enabling Caching On Retries

0 commit comments

Comments
 (0)