Skip to content

Replace buggy pig rank function with custom solution #927

@marekhorst

Description

@marekhorst

This problem was fixed several times already: either by extending the amount of memory (#796, #807) or by refactoring PIG script to minimize memory footprint during the RANK operation (CeON/CoAnSys#425).

After recent increase in number of publications (to 37M) we are struggling again with the memory related problem:

java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)

full log is available here: https://pastebin.com/dk2C8wLF

Pig execution plan is available here:
https://pastebin.com/bAUsCNjb

claiming again RANK operation to be the phase when the map task failed:

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1524597382992_21544 wc_ranked   ORDER_BY    Message: Job failed!   

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions