Improve data locality by considering Kubernetes topology

## Description

As users of the HDFS operator and a Stackable deployed HDFS we want it to ensure data locality by talking to a DataNode on the same Kubernetes node as the client first if one exists.

## Value

HDFS tries to store the first copy of a block on a "local" machine before shipping data to remote machines over the network. This relies on a simple IP address comparison in the HDFS code which breaks due to the nature of Kubernetes where pods don't share the same IP even if they are on the same Kubernetes node.

I believe we can improve this situation by changing the HDFS code to consider the Kubernetes node while looking for a "local" machine.

We already have precedent with the [`hdfs-topology-provider`](https://github.com/stackabletech/hdfs-topology-provider/) which does something similar. I believe we can plug this logic into the `chooseLocalOrFavoredStorage` method of `BlockPlacementPolicyDefault`.

We want this because it will probably benefit _all_ workloads that are using HDFS and locally attached storage and that are using things like Spark or HBase where processing can happen on the same Kubernetes node as the storage. The benefit is going to be less network traffic and a boost in performance.

## Dependencies

It probably makes sense to reuse code from the `hdfs-topology-provider` project.


## Tasks

- Understand exactly what the topology provider is doing: We need a way - from within Kubernetes - to get the node a client is "calling" from (via its IP) as well as the node the DataNodes are running on
- Decide whether this behavior is going to be enabled by default or not
- Discuss the best point where to plug this behavior in and create a patch or pluggable class (etc. a BlockPlacementPolicy)
- Document the behavior
- Test it on a multi-node cluster and also test it with external clients (listener etc.)
- Marketing: Discuss with marketing whether we should create a blog post about it

## Acceptance Criteria

- We have a way to compare data locality based on Kubernetes nodes that falls back to the default in case there is an error
- The behavior is documented even if it is not changeable/pluggable  

## Release Notes

The HDFS NameNodes will now look at the Kubernetes topology when considering whether a `client` request is made `locally` or not. This means it will consider all clients "local" that are hosted on the same Kubernetes node as a DataNode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve data locality by considering Kubernetes topology #595

Description

Value

Dependencies

Tasks

Acceptance Criteria

Release Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve data locality by considering Kubernetes topology #595

Description

Description

Value

Dependencies

Tasks

Acceptance Criteria

Release Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions