Skip to content

Commit fdd683e

Browse files
committed
Address to review comments
1 parent b35df7d commit fdd683e

2 files changed

Lines changed: 40 additions & 200 deletions

File tree

  • docs
    • content.zh/docs/deployment/filesystems
    • content/docs/deployment/filesystems

docs/content.zh/docs/deployment/filesystems/s3.md

Lines changed: 20 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -66,19 +66,23 @@ Note that these examples are *not* exhaustive and you can use S3 in other places
6666

6767
## S3 FileSystem Implementations
6868

69-
Flink provides three independent S3 filesystem implementations, each with different trade-offs:
69+
Flink provides three independent S3 filesystem implementations:
7070

71-
- **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK v2 with async I/O and parallel transfers, removing the dependency from Hadoop entirely. Supports both checkpointing and the FileSink in a single plugin, removing the need to choose between Presto (checkpointing) and Hadoop (FileSink). [Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396) show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
72-
- **Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project code. The proven choice for checkpointing in production.
73-
- **Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project code. Supports both checkpointing and the FileSink.
71+
| Implementation | Checkpointing | FileSink | Notes |
72+
|---------------|:---:|:---:|-------|
73+
| **Native S3** (`flink-s3-fs-native`) ||| **Experimental** in Flink 2.3. Built on AWS SDK v2; no Hadoop dependency. |
74+
| **Presto S3** (`flink-s3-fs-presto`) || x | Production-proven for checkpointing. |
75+
| **Hadoop S3** (`flink-s3-fs-hadoop`) ||| Mature; the only stable implementation that provides `RecoverableWriter` for the FileSink. |
76+
77+
Previously, users had to choose between Presto (recommended for checkpointing throughput) and Hadoop (the only implementation with `RecoverableWriter`, required by the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}})). The Native S3 implementation unifies both capabilities in a single plugin and measurements show significant checkpoint throughput improvements over the Presto implementation.
7478

7579
All three are self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use them.
7680

7781
## Common Configuration
7882

7983
### Configure Access Credentials
8084

81-
After setting up the S3 FileSystem implementation, you need to make sure that Flink is allowed to access your S3 buckets.
85+
After setting up the S3 FileSystem implementation, you need to make sure that Flink is allowed to access your S3 buckets. The following three approaches are **independent alternatives** — choose the one that fits your environment:
8286

8387
#### Identity and Access Management (IAM) (Recommended)
8488

@@ -151,7 +155,7 @@ The legacy configuration key `s3.path.style.access` is still supported as a fall
151155
**Experimental**: The Native S3 FileSystem is experimental in Flink 2.3. It is functionally complete and has demonstrated strong performance in benchmarks.
152156
{{< /hint >}}
153157

154-
The Native S3 FileSystem is a pure-Java implementation built on the AWS SDK v2 completely removing the dependencies from hadoop. It is registered under the schemes *s3://* and *s3a://*. It requires no additional dependencies and provides a drop-in replacement for the Presto and Hadoop implementations.
158+
The Native S3 FileSystem is a pure-Java implementation built on the AWS SDK v2 completely removing the dependency on Hadoop. It is registered under the schemes *s3://* and *s3a://*. It provides a drop-in replacement for the Presto and Hadoop implementations, supporting checkpointing, the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (via `RecoverableWriter`), server-side encryption (SSE-S3, SSE-KMS), cross-account access via IAM role assumption, entropy injection, and bulk copy via S3TransferManager.
155159

156160
#### Setup
157161

@@ -162,37 +166,14 @@ mkdir -p ./plugins/s3-fs-native
162166
cp ./opt/flink-s3-fs-native-{{< version >}}.jar ./plugins/s3-fs-native/
163167
```
164168

165-
#### Features
166-
167-
- **No external dependencies**: Built on AWS SDK v2 with minimal footprint
168-
- **Drop-in replacement**: Compatible with the same S3 URI schemes (`s3://`, `s3a://`)
169-
- **FileSystem sink support**: Supports the [FileSystem sink]({{< ref "docs/connectors/datastream/filesystem" >}}) via `RecoverableWriter`
170-
- **Encryption support**: Server-side encryption (SSE-S3, SSE-KMS)
171-
- **Assume role**: Cross-account access via IAM role assumption
172-
- **Entropy injection**: Optimize S3 scalability through random key prefixes
173-
- **Bulk copy**: Efficient multi-part copy operations via S3TransferManager
174-
175169
#### Configuration
176170

177-
The Native S3 FileSystem uses the following configuration options:
171+
In addition to the [common configuration](#common-configuration) options (`s3.access-key`, `s3.secret-key`, `s3.endpoint`, `s3.path-style-access`), the Native S3 FileSystem supports the following options:
178172

179173
```yaml
180-
# AWS credentials (if using static credentials)
181-
s3.access-key: your-access-key
182-
s3.secret-key: your-secret-key
183-
184-
# AWS region (optional; auto-detected if not specified)
185-
s3.region: us-east-1
186-
187-
# Custom S3 endpoint for S3-compatible storage
188-
s3.endpoint: your-endpoint-hostname
189-
190-
# Path style access for S3-compatible storage
191-
s3.path-style-access: true
192-
193174
# Server-side encryption
194175
s3.sse.type: sse-s3 # or sse-kms, aws:kms, AES256, none (default)
195-
s3.sse.kms.key-id: arn:aws:kms:region:account:key/id # For SSE-KMS
176+
s3.sse.kms.key-id: arn:aws:kms:region:account:key/id # Required for SSE-KMS
196177
197178
# IAM role assumption for cross-account access
198179
s3.assume-role.arn: arn:aws:iam::account:role/RoleName
@@ -201,28 +182,16 @@ s3.assume-role.session-name: flink-s3-session
201182
s3.assume-role.session-duration: 3600
202183
203184
# Performance tuning
204-
s3.upload.min.part.size: 5242880 # 5MB default
185+
s3.upload.min.part.size: 5242880 # 5 MB default
205186
s3.upload.max.concurrent.uploads: 4 # Based on CPU cores
206-
s3.read.buffer.size: 262144 # 256KB default
207-
s3.async.enabled: true # Enable async operations
208-
s3.bulk-copy.enabled: true # Enable bulk copy
187+
s3.read.buffer.size: 262144 # 256 KB default
188+
s3.async.enabled: true # Async read/write operations
189+
s3.bulk-copy.enabled: true # Bulk copy via S3TransferManager
209190
s3.bulk-copy.max-concurrent: 16 # Max concurrent copy ops
210-
211-
# Entropy injection for scalability
212-
s3.entropy.key: _entropy_
213-
s3.entropy.length: 4
214-
215-
# Retry configuration
216-
s3.retry.max-num-retries: 3
217-
218-
# Credentials provider (optional; see note below)
219-
# fs.s3.aws.credentials.provider: software.amazon.awssdk.auth.credentials.AnonymousCredentialsProvider
220191
```
221192

222193
When `fs.s3.aws.credentials.provider` is not set, the Native S3 FileSystem automatically builds a credentials chain in the following order: delegation tokens, static credentials (if `s3.access-key` and `s3.secret-key` are configured), and the AWS SDK v2 `DefaultCredentialsProvider` (environment variables, instance profiles, etc.). You only need to set this option if you require a custom provider chain.
223194

224-
See the [AWS SDK v2 documentation](https://docs.aws.amazon.com/sdk-for-java/) for additional configuration details.
225-
226195
---
227196

228197
### Presto S3 FileSystem
@@ -231,13 +200,7 @@ See the [AWS SDK v2 documentation](https://docs.aws.amazon.com/sdk-for-java/) fo
231200
You don't have to configure this manually if you are running [Flink on EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html).
232201
{{< /hint >}}
233202

234-
The Presto S3 FileSystem is based on code from the [Presto project](https://prestodb.io/). It is registered under the schemes *s3://* and *s3p://*.
235-
236-
#### Features
237-
238-
- **Recommended for checkpointing**: The Presto implementation is the recommended file system for checkpointing to S3
239-
- **Self-contained**: No Hadoop dependency required
240-
- **Production-ready**: Stable and widely used
203+
The Presto S3 FileSystem is based on code from the [Presto project](https://prestodb.io/). It is registered under the schemes *s3://* and *s3p://*. It is the production-proven choice for checkpointing to S3. It does not support the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (`createRecoverableWriter` throws `UnsupportedOperationException`).
241204

242205
#### Setup
243206

@@ -250,36 +213,13 @@ cp ./opt/flink-s3-fs-presto-{{< version >}}.jar ./plugins/s3-fs-presto/
250213

251214
#### Configuration
252215

253-
Configure it using [the same configuration keys as the Presto file system](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration), by adding the configurations to your [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}):
254-
255-
```yaml
256-
# AWS credentials
257-
s3.access-key: your-access-key
258-
s3.secret-key: your-secret-key
259-
260-
# Custom endpoint
261-
s3.endpoint: your-endpoint-hostname
262-
263-
# Path style access
264-
s3.path-style-access: true
265-
266-
# Credentials provider
267-
presto.s3.credentials-provider: org.apache.flink.fs.s3.common.token.DynamicTemporaryAWSCredentialsProvider
268-
```
269-
270-
Refer to the [Presto documentation](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration) for all available configuration options.
216+
The [common configuration](#common-configuration) options apply. In addition, Presto-specific keys are supported via the [Presto file system configuration](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration).
271217

272218
---
273219

274220
### Hadoop S3 FileSystem
275221

276-
The Hadoop S3 FileSystem is based on code from the [Hadoop Project](https://hadoop.apache.org/). It is registered under the schemes *s3://* and *s3a://*.
277-
278-
#### Features
279-
280-
- **FileSystem sink support**: Supports the [FileSystem sink]({{< ref "docs/connectors/datastream/filesystem" >}})
281-
- **Self-contained**: No additional Hadoop installation required
282-
- **Mature implementation**: Long-established code from the Hadoop ecosystem
222+
The Hadoop S3 FileSystem is based on code from the [Hadoop Project](https://hadoop.apache.org/). It is registered under the schemes *s3://* and *s3a://*. It is the only stable implementation that supports the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (via `RecoverableWriter`).
283223

284224
#### Setup
285225

@@ -292,27 +232,7 @@ cp ./opt/flink-s3-fs-hadoop-{{< version >}}.jar ./plugins/s3-fs-hadoop/
292232

293233
#### Configuration
294234

295-
Configure it using [Hadoop's s3a configuration keys](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) by adding the configurations to your [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}):
296-
297-
```yaml
298-
# AWS credentials
299-
s3.access-key: your-access-key
300-
s3.secret-key: your-secret-key
301-
302-
# Custom endpoint
303-
s3.endpoint: your-endpoint-hostname
304-
305-
# Path style access
306-
s3.path-style-access: true
307-
308-
# Connection settings
309-
s3.connection.maximum: 10
310-
311-
# Credentials provider
312-
fs.s3a.aws.credentials.provider: org.apache.flink.fs.s3.common.token.DynamicTemporaryAWSCredentialsProvider
313-
```
314-
315-
Hadoop configuration keys are automatically translated. For example, `fs.s3a.connection.maximum` becomes `s3.connection.maximum`. Refer to the [Hadoop S3A documentation](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) for all available options.
235+
The [common configuration](#common-configuration) options apply. In addition, [Hadoop's s3a configuration keys](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) are supported. Hadoop configuration keys are automatically translated — for example, `fs.s3a.connection.maximum` becomes `s3.connection.maximum`.
316236

317237
---
318238

0 commit comments

Comments
 (0)