Skip to content

Replace buggy pig rank function with custom solution #432

@marekhorst

Description

@marekhorst

This issue was originally reported in openaire/iis#927 but since it requires changes in CoAnSys PIG script I am reporting it once again here.

Pig RANK operation related problems ware mitigated within OpenAIRE scope several times already: either by extending the amount of memory (openaire/iis#796, openaire/iis#807) or by refactoring PIG script to minimize memory footprint during the RANK operation (#425).

After recent increase in number of publications (to 37M) we are struggling again with the memory related problem:

java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)

full log is available here: https://pastebin.com/dk2C8wLF

Pig execution plan is available here:
https://pastebin.com/bAUsCNjb

claiming again RANK operation to be the phase when the map task failed:

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1524597382992_21544 wc_ranked   ORDER_BY    Message: Job failed!   

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions