-
Notifications
You must be signed in to change notification settings - Fork 102
Description
Hello, thanks for the nice package.
I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:
https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104
julia> rand(1:5, 5)
5-element Vector{Int64}:
5
5
2
2
3
I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.
I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.
Thank you!