Skip to content

Add fromArrayMonolithic #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

treeowl
Copy link
Contributor

@treeowl treeowl commented Dec 25, 2014

For GHC, this should let us create a sequence out of an array
without having any stray pointers to the array in the sequence.

@treeowl treeowl force-pushed the fromArrayMonolithic branch 3 times, most recently from 12a2eeb to 68d58e2 Compare December 25, 2014 03:45
For GHC, this should let us create a sequence out of an array
without having any stray pointers to the array in the sequence.
@treeowl treeowl force-pushed the fromArrayMonolithic branch from 68d58e2 to 6ce774c Compare December 25, 2014 03:57
@treeowl
Copy link
Contributor Author

treeowl commented Dec 25, 2014

Note: if you're concerned about depending on primitive, we could reimplement what we need; it's just really ugly.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 25, 2014

After a lot of trouble, I finally figured out how to test this properly, and it works: holding on to a split-off piece of a sequence produced by fromArrayMonolithic requires much less memory than holding on to one produced by fromArray, although actually performing the conversion produces a lot more garbage. I doubt that many people will need fromArrayMonolithic, but if they do, they probably won't want to write it themselves.

The testing code (compile it with -with-rtsopts=-T and give it a number like 100000):

module Main where

import Prelude hiding (splitAt)

import Data.Array hiding (index)
import Data.Sequence

import Control.Exception
import Control.DeepSeq

import GHC.Stats
import System.Mem (performMajorGC)

import qualified Data.Foldable as F

sillyArray :: Int -> Array Int Int
sillyArray n = listArray (0,n-1) [0 .. n-1]

testFromArray :: Int -> String -> (Array Int Int -> Seq Int) -> IO ()
testFromArray size name fa = do
  putStrLn $ "Testing " ++ name
  let sm = case splitAt 10 . fa $ sillyArray size of
             (small, large) -> large `deepseq` small
  evaluate sm
  stats1 <- getGCStats
  putStrLn $ "Bytes used before GC: " ++ show (currentBytesUsed stats1)
  performMajorGC
  stats2 <- getGCStats
  putStrLn $ "Bytes used after GC: " ++ show (currentBytesUsed stats2)
  putStrLn $ "Result: " ++ show (F.sum sm)

main = do
  putStrLn "How large should the array be?"
  size <- readLn
  testFromArray size "fromArray" fromArray
  testFromArray size "fromArrayMonolithic" fromArrayMonolithic

@treeowl
Copy link
Contributor Author

treeowl commented Dec 25, 2014

We should also test how this ST implementation compares to one using State. State leads to a bunch of pairs showing up in the Core, which suggests extra allocation, but that may not be the case.

Allocation actually goes down quite a bit when using State instead
of ST.
@treeowl
Copy link
Contributor Author

treeowl commented Dec 25, 2014

State is better, by quite a bit, and avoids silly GHC version issues. Perfect.

@foxik
Copy link
Contributor

foxik commented Dec 30, 2014

Personally, I am against this. Why having two 'fromArray' functions? Also, what is the difference between fromArrayMonolithic and rnf . fromArray? The second one is a trivial way of having the same effect, isn't it?

In any case, this is API change, so please propose to libraries@haskell if you want to pursue this.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

Personally, I am against this. Why having two 'fromArray' functions?

Because they have very different performance tradeoffs.

Also, what is the difference between fromArrayMonolithic and rnf .
fromArray? The second one is a trivial way of having the same effect, isn't
it?

No. fromArrayMonolithic does not force any of the array elements. It just
copies them out of the array (so pointers to thunks will just get copied).
What I'd really like is to add support for a sort of copy-on-GC array
slice to GHC's RTS, which would let us have it all, but it sounds like that
would be a pretty large project.

In any case, this is API change, so please propose to libraries@haskell
if you want to pursue this.

Will do.


Reply to this email directly or view it on GitHub.

@tibbe
Copy link
Member

tibbe commented Dec 30, 2014

I find the name confusing. Does this just create an element-strict Seq (which would imply that there aren't any references to the array) or is it still value-lazy, but careful to evaluate the array reads just enough to make sure there are no "read array" thunks left? If it's the latter, perhaps we should just change fromArray to do that? Having a fromArray that leaks memory by holding references to the array sounds bad.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

It does the latter (the careful thing). I guess changing fromArray is an
option, but fromArray has much better performance in other contexts, so I'm
not sure.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

The monolithic version can also be expanded to grab a limited slice, which
I don't think fromArray can do (because of Ix nonsense). But the
monolithic way feels a lot like a workaround for the absence of RTS support
for slicing.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

Er .... actually, fromArray can probably do that too. Maybe we should
yank fromArray pending further thought?

@foxik
Copy link
Contributor

foxik commented Dec 30, 2014

Personally I am not happy about exporting two fromArray functions, so I would agree with modifying fromArray.

Nevertheless, we are not using primitive for this -- every dependency costs something. I understand that vector uses primitive, but I am unwilling to use primitive for one (probably very little used) conversion function from Sequence.

Personally, I would even consider getting rid of GHC-specific stuff in fromArray and use elems. Yes, it is slow, but the alternative is using semi-official API and a bunch of extensions (probably MagicHash and UnboxedTuples) for one conversion function I am not sure anyone really wants. Any thoughts, @tibbe?

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

I'm not happy with the elems approach at all. I'd rather just kill the
function altogether for now and consider the design space more carefully
for the next version.

@foxik
Copy link
Contributor

foxik commented Dec 30, 2014

Well it is my mistake I did not insist on proposal to libraries@haskell before merging the first fromArray. Not going to happen in the future :-)

Nevertheless, we are probably stuck with fromArray now. If the numbers show there is a reasonaly large difference, I can probably live with MagicHash, UnboxedTuples and dependency on ghc-prim, but there should be nontrivial improvement to justify all this.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

Insisting on bureaucratic process doesn't always make things work better, but perhaps this time it would have. There is a huge time-vs-space safety tradeoff for fromArray vs fromArrayMonolithic. fromArray uses unsafeAt to get immediate O(log(min{i,n-i})) access to the resulting sequence. fromArrayMonolithic uses the primitive stuff to guarantee space safety if an array is converted and the resulting sequence split up. An approach using elems offers neither of these benefits—no immediate access and no space safety.

@tibbe
Copy link
Member

tibbe commented Dec 30, 2014

I generally agree with @foxik here. It would help to clearly show the benefits of having both (i.e. with benchmarks). We also need to be careful so we measure the right thing. Deferring lots of things lazily can often give performance that we don't see in real programs (where we typically force the elements eventually and suffer higher GC costs down the line due to holding on to thunks.)

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

I can put benchmarks together for sure. In this case, deferring things lazily is a pretty clear performance win for some reasonably likely situations. In particular, it's extremely common to build a structure (e.g., a Seq), and then only use it as-is a single time. By deferring the construction of each subtree until that subtree is used, we can avoid the very high cost of letting all the pieces fall out of cache as we construct the sequence only to drag them back in again to use it. Something like f <$> fromArray arry <*> s2, for example, will get a big speed boost from this.

@tibbe
Copy link
Member

tibbe commented Dec 30, 2014

It sounds like you have a good benchmark then. :) Please add it to the benchmark suite and post the results here (including the naive toList comparison).

@foxik
Copy link
Contributor

foxik commented Dec 30, 2014

The bureaucratic process you are referring to has a reasonably good meaning -- on one side, it allows the community to express its oppinion, and on the other side, it makes the proposal more explicit and usually causes that the alternatives are spelled out and considered; in this situation, we probably would have discussed the fast_but_keeping_reference vs slow_but_no_references approaches and settle on one of them or decided that we need both.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2014

@foxik Yes, you are right. It can definitely be helpful sometimes. Sometimes it just bogs down in silly bickering, which does not negate the fact that sometimes it is helpful. I made a mistake here, and I'm sorry.

@foxik
Copy link
Contributor

foxik commented Dec 30, 2014

There is no need to feel sorry -- the mistake of adding fromArray without discussing the semantics is mine as a maintainer, not yours :-)

I understand you can feel that the library sumission process is blocking and slowing down your work. But there is a good reason for it, because the widely used packages like containers need a lot of stability. Also, multiple heads are better than one.

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat agree that this should be discussed on the libraries list.

I'll "request changes" to clarify that this discussion should happen before the PR could be merged.

@sjakobi sjakobi added Seq feature-request needs-libraries-proposal A proposal needs to be discussed on the libraries mailing list labels Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request needs-libraries-proposal A proposal needs to be discussed on the libraries mailing list Seq
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants