Skip to content

Add option to resample features at nodes without replacement #192

@mharradon

Description

@mharradon

Hello, thanks for the nice package.

I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:

https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104

julia> rand(1:5, 5)
5-element Vector{Int64}:
 5
 5
 2
 2
 3

I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.

I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions