Optimize document-similarity memory requirements

Currently top level `mapredChildJavaOpts` value (e.g. defined at `document-similarity-oap-uberworkflow` workflow level) is propagated deep down to all subworkflows and all PIG scripts.

Does it mean all the subworkflows and scripts have the same, pretty high, memory requirements?

In OpenAIRE CDH5 OCEAN cluster, after number of experiments, we were able to get down to `4g` with top level `mapredChildJavaOpts` parameter value without affecting document-similarity stability. The thing is this is still causing performance bottleneck because YARN is able to delegate at most ~200 cores out of 608 cores in total due to the physical memory shortage. 

If we could get down to e.g. `1638m` for some of the subworkflows then all 608 cores could be utilized at this phase of processing.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize document-similarity memory requirements #415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize document-similarity memory requirements #415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions