Skip to content

Concurrent normalization #2074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

alex-mckenna
Copy link
Contributor

@alex-mckenna alex-mckenna commented Feb 3, 2022

Still TODO:

  • Write a changelog entry (see changelog/README.md)
  • Check copyright notices are up to date in edited files
  • Make sure the flag to turn it on/off works

@alex-mckenna alex-mckenna force-pushed the concurrent-normalization branch 6 times, most recently from b188fa8 to e2a4522 Compare February 8, 2022 16:52
@vmchale vmchale force-pushed the concurrent-normalization branch 4 times, most recently from f2e77e4 to de54ed3 Compare March 21, 2022 16:18
@vmchale vmchale marked this pull request as ready for review March 21, 2022 16:27
@vmchale vmchale requested review from christiaanb, martijnbastiaan and leonschoorl and removed request for christiaanb and martijnbastiaan March 21, 2022 16:28
@vmchale vmchale force-pushed the concurrent-normalization branch 6 times, most recently from 90b1223 to 8cf1a7e Compare March 21, 2022 20:06
@martijnbastiaan
Copy link
Member

@vmchale I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue? Have you got any numbers for us? :)

@alex-mckenna
Copy link
Contributor Author

I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue?

Part of that was a lack of programs to try it on where there was a significant branching factor when normalizing top-level binders. For the benchmarks / examples we have I was finding that actually not much was running at the same time (since so many entities are also so small this doesn't help). For many examples we have the number of children for a binder is typically 1, which means in the worst case there is no concurrency at all. When there are multiple children, it typically doesn't go higher than 2 or 3 in a design so the actual concurrency you observe is somewhat limited

The point is I don't think this can really be answered without a benchmark program where normalizing a binder can result in many child binders needing to be normalized as a result (i.e. an entity to normalize where it's pretty much guaranteed it will be able to find enough work for all cores on a machine). I think ideally this would be two benchmarks:

  • one where every child in the tree of things to normalize is distinct. This lets us see just the effect of concurrency without the work queue being able to share any normalization results
  • one where children can be repeated in the tree. This lets us see the effect of the work queue in further reducing the time needed to normalize the entire design

@vmchale
Copy link
Contributor

vmchale commented Mar 22, 2022

@martijnbastiaan sort of? It runs faster instead of slower, but not always drastically better.

Let me gather more data.

Copy link
Member

@leonschoorl leonschoorl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to easily disable concurrent normalization?
I'd think that disabling it might be useful when trying debug the compiler.
Because with it enabled the various debug messages that clash can generate might be interleaved from different things being normalized in parallel.

I tried running with +RTS -N1 or +RTS -maxN1 but that doesn't seem to work.

@vmchale
Copy link
Contributor

vmchale commented Mar 22, 2022

comparison

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

This is just the clash-benchmark-normalization we have, not particularly parallel as far as I can tell (?)

@alex-mckenna
Copy link
Contributor Author

Is there a way to easily disable concurrent normalization?

For easy debugging you would likely have to compile without -threaded. This is already somewhat of an implicit recommendation for when you need to use gdb with Haskell

@alex-mckenna
Copy link
Contributor Author

alex-mckenna commented Mar 22, 2022

One thing we should probably do is make sure we record transformation history sensibly though. Actually debugging the normalization process becomes somewhat more hectic if rewrites are seen (e.g. through the medium of -fclash-debug-info AppliedTerm) in the other they are performed.

Perhaps that means waiting until an entity is normalized, and only then printing out the rewrite steps it took (with locked access to stdout). That way even if entities aren't printed in the other they're normalized, you still see all their rewrite steps together without steps from other entities interspersed.

@vmchale vmchale force-pushed the concurrent-normalization branch from 8cf1a7e to 6866bcf Compare March 22, 2022 18:23
@martijnbastiaan
Copy link
Member

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

Okay, that's good - hopefully it means our examples are just not benefitting from concurrent normalization. An example of a more realistic codebase would be Christiaan's Contranomy, perhaps try there?

(GitHub please copy threaded conversation from GitLab next pls.)

@christiaanb
Copy link
Member

Other "larger" public code bases I can think of are:

  1. https://github.com/cbiffle/cfm (I started a port to Clash 1.4 over here Port to Clash 1.4.2 cbiffle/cfm#2, perhaps port over to Clash 1.6 first)
  2. https://github.com/gergoerdi/clash-spaceinvaders
  3. https://github.com/gergoerdi/clash-compucolor2

@vmchale
Copy link
Contributor

vmchale commented Mar 23, 2022

cc @christiaanb

Unfortunately it seems to be a little slower on Contranomy too!

Not concurrent:

time                 7.020 s    (6.929 s .. 7.067 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.929 s    (6.882 s .. 6.960 s)
std dev              48.11 ms   (31.76 ms .. 60.14 ms)

Concurrent:

time                 8.506 s    (8.338 s .. 8.615 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 8.438 s    (8.375 s .. 8.479 s)
std dev              61.79 ms   (24.56 ms .. 84.55 ms)

So you can see it's slower but nothing crazy (with work queues)

@alex-mckenna
Copy link
Contributor Author

Yeah, this is fairly similar to my experiences pre Vanessa's improvements, i.e. work queue and non-parallel GC (which I guess doesn't matter here anyway). At least contranomy is large enough that the std dev. doesn't go ridiculous

@alex-mckenna
Copy link
Contributor Author

It may be what we want to do is make a new flag, -fclash-concurrent-normalization which turns on concurrent normalization only. The concurrent topentity compilation remains always on. We document this flag with the caveat it's only worth it for designs where normalization is sufficiently wide that cores have enough to do

@vmchale
Copy link
Contributor

vmchale commented Mar 23, 2022

Unfortunately cfm seems to require work and in any case we see slowdowns in the majority of our little examples so I think @alex-mckenna's suggestion is the way to go.

@vmchale vmchale force-pushed the concurrent-normalization branch from 352d131 to dc07f93 Compare March 23, 2022 17:47
@vmchale vmchale requested a review from leonschoorl March 23, 2022 17:49
@vmchale vmchale force-pushed the concurrent-normalization branch from dc07f93 to a1cf3d1 Compare March 24, 2022 13:38
@alex-mckenna
Copy link
Contributor Author

To save you looking around for everything, adding a flag is basically:

  • add the new configuration option to ClashOpts
  • add the new option to the NFData, Eq and Hashable instances, and the defClashOpts further down in the same file
  • Define the new flag in ClashFlags.hs
  • Update the documentation for compiler flags

Then you can use the new flag by getting the ClashEnv from the monad and looking at it's ClashOpts inside.

@vmchale
Copy link
Contributor

vmchale commented Mar 24, 2022

Oh wow, thank you!!

@vmchale vmchale force-pushed the concurrent-normalization branch 2 times, most recently from bd22ac3 to 8e7a8de Compare March 24, 2022 18:54
@vmchale vmchale force-pushed the concurrent-normalization branch from 8e7a8de to 9c2b6f9 Compare April 7, 2022 11:35
@@ -28,25 +29,25 @@ import Clash.Rewrite.Types (Rewrite, RewriteMonad)
-- | State of the 'NormalizeMonad'
data NormalizeState
= NormalizeState
{ _normalized :: BindingMap
{ _normalized :: MVar BindingMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficient because inlineWorkFree calls normalizeTopLvlBndr

So why is this bad? Take the following example:

top x = (g x, f x)

f x = h + x
{-# NOINLINE f #-}
g x = h * x
{-# NOINLINE g #-}

h = expensive_workfree_expression

After normalizing top, we normalize f and g concurrently. And both will do an inlineWorkFree of h. They could then both start normalizing h before committing it to the cache. Thus I think normalizeTopLvlBndr should do something akin:

normalizeTopLvlBndr isTop nm (Binding nm' sp inl pr tm _) = do
  normalizedV <- Lens.use (extra.normalized)
  cache <- takeMVar normalizedV
  case lookup nm cache of
    Just vMVar ->
       putMVar normalizedV cache
       readMVar vMVar
    Nothing ->
      tmp <- newEmptyMVar
      putMVar normalizedV (extendVarEnv nm tmp cache)
      ... continue normalization ...
      putMVar tmp value
      return value

I doesn't save us much time, because the thread arriving second still has to wait for the first. But the further along the first thread is, the more time we save.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! That is clever.

It slows things down a little bit, however, perhaps due to more MVars or perhaps because I convert to lists in order to mapM over a VarEnv.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... slowing down... not the effect I was hoping for :)

@martijnbastiaan
Copy link
Member

So a couple of things:

  1. We haven't been able to verify any (meaningful) speedups on public codebases. I'll go around and ask some commercial partners to see if they observe any speedups. If not, I don't think we should merge this PR given that it adds quite a bit of complexity.
  2. We should finish the flag work. Thanks @leonschoorl.
  3. We should clean up the commit history.

No use in doing (2) and (3) before (1) though.

@@ -0,0 +1 @@
CHANGED: Add concurrent normalization flag [#2074](https://github.com/clash-lang/clash-compiler/pull/2074)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by review: Please explicitly name the flag (-fclash-concurrent-normalization) in the Changelog, otherwise people still have to hunt for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as a brief description of what it does. The Changelog should be an understandable stand-alone unit. It doesn't have to be extensive, but it should be informative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it should be ADDED, not CHANGED. It would probably be good to document somehwere the keywords we expect here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started documenting changelog best practices in #2169.

Co-authored-by: Vanessa McHale <vamchale@gmail.com>
@martijnbastiaan martijnbastiaan force-pushed the concurrent-normalization branch from 5b32515 to 861f350 Compare April 8, 2022 07:21
@@ -271,6 +271,11 @@ Clash Compiler Flags

.. _`Edalize`: https://github.com/olofk/edalize

-fclash-concurrent-normaliztation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-fclash-concurrent-normaliztation
-fclash-concurrent-normalization

Comment on lines +180 to +188
if not (id' `elemVarSet` bound)
then do
-- mark that we are attempting to normalize id'
MVar.putMVar binds (bound `extendVarSet` id', pairs)
pair <- normalize' id' q
MVar.modifyMVar_ binds (pure . second (pair:))
else
MVar.putMVar binds (bound, pairs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could do with some explanation.
Is this accurate?

Suggested change
if not (id' `elemVarSet` bound)
then do
-- mark that we are attempting to normalize id'
MVar.putMVar binds (bound `extendVarSet` id', pairs)
pair <- normalize' id' q
MVar.modifyMVar_ binds (pure . second (pair:))
else
MVar.putMVar binds (bound, pairs)
if not (id' `elemVarSet` bound)
then do
-- mark that we are attempting to normalize id'
MVar.putMVar binds (bound `extendVarSet` id', pairs)
pair <- normalize' id' q
-- record the normalized id'
MVar.modifyMVar_ binds (pure . second (pair:))
else
-- id' is already (being) normalized
MVar.putMVar binds (bound, pairs)

Comment on lines +188 to +190
nextS <- Lens.use uniqSupply
normalizeStep q binds nextS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just be:

Suggested change
nextS <- Lens.use uniqSupply
normalizeStep q binds nextS
normalizeStep q binds s

Since the Supply is local to our monad and it couldn't have changed, right?

Comment on lines +376 to +378
MVar.withMVar ioLockV $ \() ->
traceWhen (hasTransformationInfo AppliedTerm opts)
("Dropping type application on TopEntity: " ++ showPpr (varName f) ++ "\ntype:\n" ++ showPpr tyArg)
Copy link
Member

@leonschoorl leonschoorl Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to do

    ioLockV <- Lens.use ioLock

    MVar.withMVar ioLockV $ \() ->
      traceWhen cond "..."

all over the place.

Maybe we should have a helper, something like:

traceWithIoLockWhen cond msg = when cond $ do
  ioLockV <- Lens.use ioLock
  MVar.withMVar ioLockV $ \() ->
    traceWhen cond "..."

Which has the added benefit of not needing to get the lock when the condition is False.
(the name could do with some bikeshedding probably)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just

Monad.when (hasTransformationInfo AppliedTerm opts) $
  MVar.withMVar ioLockV $ \() -> traceM ...

I would rather just use the "normal" when / unless from Control.Monad in situations like these instead of making new strangely-specific functions

@martijnbastiaan
Copy link
Member

We've only been able to measure slight slowdowns, and never speedups. Let's keep the branch around, but I doubt we're going to merge this as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants