Concurrent normalization #2074

alex-mckenna · 2022-02-03T14:52:09Z

Still TODO:

Write a changelog entry (see changelog/README.md)
Check copyright notices are up to date in edited files
Make sure the flag to turn it on/off works

martijnbastiaan · 2022-03-22T09:21:57Z

@vmchale I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue? Have you got any numbers for us? :)

alex-mckenna · 2022-03-22T09:35:09Z

I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue?

Part of that was a lack of programs to try it on where there was a significant branching factor when normalizing top-level binders. For the benchmarks / examples we have I was finding that actually not much was running at the same time (since so many entities are also so small this doesn't help). For many examples we have the number of children for a binder is typically 1, which means in the worst case there is no concurrency at all. When there are multiple children, it typically doesn't go higher than 2 or 3 in a design so the actual concurrency you observe is somewhat limited

The point is I don't think this can really be answered without a benchmark program where normalizing a binder can result in many child binders needing to be normalized as a result (i.e. an entity to normalize where it's pretty much guaranteed it will be able to find enough work for all cores on a machine). I think ideally this would be two benchmarks:

one where every child in the tree of things to normalize is distinct. This lets us see just the effect of concurrency without the work queue being able to share any normalization results
one where children can be repeated in the tree. This lets us see the effect of the work queue in further reducing the time needed to normalize the entire design

vmchale · 2022-03-22T13:22:32Z

@martijnbastiaan sort of? It runs faster instead of slower, but not always drastically better.

Let me gather more data.

leonschoorl

Is there a way to easily disable concurrent normalization?
I'd think that disabling it might be useful when trying debug the compiler.
Because with it enabled the various debug messages that clash can generate might be interleaved from different things being normalized in parallel.

I tried running with +RTS -N1 or +RTS -maxN1 but that doesn't seem to work.

clash-lib/src/Clash/Driver.hs

clash-lib/src/Clash/Normalize.hs

clash-lib/src/Clash/Rewrite/Util.hs

vmchale · 2022-03-22T17:37:04Z

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

This is just the clash-benchmark-normalization we have, not particularly parallel as far as I can tell (?)

alex-mckenna · 2022-03-22T17:47:18Z

Is there a way to easily disable concurrent normalization?

For easy debugging you would likely have to compile without -threaded. This is already somewhat of an implicit recommendation for when you need to use gdb with Haskell

alex-mckenna · 2022-03-22T18:04:44Z

One thing we should probably do is make sure we record transformation history sensibly though. Actually debugging the normalization process becomes somewhat more hectic if rewrites are seen (e.g. through the medium of -fclash-debug-info AppliedTerm) in the other they are performed.

Perhaps that means waiting until an entity is normalized, and only then printing out the rewrite steps it took (with locked access to stdout). That way even if entities aren't printed in the other they're normalized, you still see all their rewrite steps together without steps from other entities interspersed.

martijnbastiaan · 2022-03-23T08:30:56Z

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

Okay, that's good - hopefully it means our examples are just not benefitting from concurrent normalization. An example of a more realistic codebase would be Christiaan's Contranomy, perhaps try there?

(GitHub please copy threaded conversation from GitLab next pls.)

christiaanb · 2022-03-23T08:43:40Z

Other "larger" public code bases I can think of are:

https://github.com/cbiffle/cfm (I started a port to Clash 1.4 over here Port to Clash 1.4.2 cbiffle/cfm#2, perhaps port over to Clash 1.6 first)
https://github.com/gergoerdi/clash-spaceinvaders
https://github.com/gergoerdi/clash-compucolor2

vmchale · 2022-03-23T14:55:29Z

cc @christiaanb

Unfortunately it seems to be a little slower on Contranomy too!

Not concurrent:

time                 7.020 s    (6.929 s .. 7.067 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.929 s    (6.882 s .. 6.960 s)
std dev              48.11 ms   (31.76 ms .. 60.14 ms)

Concurrent:

time                 8.506 s    (8.338 s .. 8.615 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 8.438 s    (8.375 s .. 8.479 s)
std dev              61.79 ms   (24.56 ms .. 84.55 ms)

So you can see it's slower but nothing crazy (with work queues)

alex-mckenna · 2022-03-23T15:05:20Z

Yeah, this is fairly similar to my experiences pre Vanessa's improvements, i.e. work queue and non-parallel GC (which I guess doesn't matter here anyway). At least contranomy is large enough that the std dev. doesn't go ridiculous

alex-mckenna · 2022-03-23T15:09:26Z

It may be what we want to do is make a new flag, -fclash-concurrent-normalization which turns on concurrent normalization only. The concurrent topentity compilation remains always on. We document this flag with the caveat it's only worth it for designs where normalization is sufficiently wide that cores have enough to do

vmchale · 2022-03-23T16:29:57Z

Unfortunately cfm seems to require work and in any case we see slowdowns in the majority of our little examples so I think @alex-mckenna's suggestion is the way to go.

clash-lib/clash-lib.cabal

clash-lib/src/Clash/Normalize.hs

alex-mckenna · 2022-03-24T13:56:54Z

To save you looking around for everything, adding a flag is basically:

add the new configuration option to ClashOpts
add the new option to the NFData, Eq and Hashable instances, and the defClashOpts further down in the same file
Define the new flag in ClashFlags.hs
Update the documentation for compiler flags

Then you can use the new flag by getting the ClashEnv from the monad and looking at it's ClashOpts inside.

vmchale · 2022-03-24T14:52:02Z

Oh wow, thank you!!

clash-lib/src/Clash/Normalize.hs

christiaanb · 2022-04-07T12:41:32Z

clash-lib/src/Clash/Normalize/Types.hs

@@ -28,25 +29,25 @@ import Clash.Rewrite.Types    (Rewrite, RewriteMonad)
 -- | State of the 'NormalizeMonad'
 data NormalizeState
  = NormalizeState
-  { _normalized          :: BindingMap
+  { _normalized          :: MVar BindingMap


This is insufficient because inlineWorkFree calls normalizeTopLvlBndr

So why is this bad? Take the following example:

top x = (g x, f x) f x = h + x {-# NOINLINE f #-} g x = h * x {-# NOINLINE g #-} h = expensive_workfree_expression

After normalizing top, we normalize f and g concurrently. And both will do an inlineWorkFree of h. They could then both start normalizing h before committing it to the cache. Thus I think normalizeTopLvlBndr should do something akin:

normalizeTopLvlBndr isTop nm (Binding nm' sp inl pr tm _) = do normalizedV <- Lens.use (extra.normalized) cache <- takeMVar normalizedV case lookup nm cache of Just vMVar -> putMVar normalizedV cache readMVar vMVar Nothing -> tmp <- newEmptyMVar putMVar normalizedV (extendVarEnv nm tmp cache) ... continue normalization ... putMVar tmp value return value

I doesn't save us much time, because the thread arriving second still has to wait for the first. But the further along the first thread is, the more time we save.

Done! That is clever.

It slows things down a little bit, however, perhaps due to more MVars or perhaps because I convert to lists in order to mapM over a VarEnv.

Ah... slowing down... not the effect I was hoping for :)

martijnbastiaan · 2022-04-07T12:42:30Z

So a couple of things:

We haven't been able to verify any (meaningful) speedups on public codebases. I'll go around and ask some commercial partners to see if they observe any speedups. If not, I don't think we should merge this PR given that it adds quite a bit of complexity.
We should finish the flag work. Thanks @leonschoorl.
We should clean up the commit history.

No use in doing (2) and (3) before (1) though.

DigitalBrains1 · 2022-04-07T12:58:18Z

changelog/2022-03-21T11_09_46-05_00_concurrent_normalization

@@ -0,0 +1 @@
+CHANGED: Add concurrent normalization flag [#2074](https://github.com/clash-lang/clash-compiler/pull/2074)


Drive by review: Please explicitly name the flag (-fclash-concurrent-normalization) in the Changelog, otherwise people still have to hunt for it.

As well as a brief description of what it does. The Changelog should be an understandable stand-alone unit. It doesn't have to be extensive, but it should be informative.

Also, it should be ADDED, not CHANGED. It would probably be good to document somehwere the keywords we expect here.

I started documenting changelog best practices in #2169.

Co-authored-by: Vanessa McHale <vamchale@gmail.com>

leonschoorl · 2022-04-07T12:04:18Z

docs/developing-hardware/flags.rst

@@ -271,6 +271,11 @@ Clash Compiler Flags

 .. _`Edalize`: https://github.com/olofk/edalize

+-fclash-concurrent-normaliztation


Suggested change

-fclash-concurrent-normaliztation

-fclash-concurrent-normalization

leonschoorl · 2022-04-07T12:57:22Z

clash-lib/src/Clash/Normalize.hs

+        if not (id' `elemVarSet` bound)
+          then do
+            -- mark that we are attempting to normalize id'
+            MVar.putMVar binds (bound `extendVarSet` id', pairs)
+            pair <- normalize' id' q
+            MVar.modifyMVar_ binds (pure . second (pair:))
+          else
+            MVar.putMVar binds (bound, pairs)


I think it could do with some explanation.
Is this accurate?

Suggested change

if not (id' `elemVarSet` bound)

then do

-- mark that we are attempting to normalize id'

MVar.putMVar binds (bound `extendVarSet` id', pairs)

pair <- normalize' id' q

MVar.modifyMVar_ binds (pure . second (pair:))

else

MVar.putMVar binds (bound, pairs)

if not (id' `elemVarSet` bound)

then do

-- mark that we are attempting to normalize id'

MVar.putMVar binds (bound `extendVarSet` id', pairs)

pair <- normalize' id' q

-- record the normalized id'

MVar.modifyMVar_ binds (pure . second (pair:))

else

-- id' is already (being) normalized

MVar.putMVar binds (bound, pairs)

leonschoorl · 2022-04-07T12:59:02Z

clash-lib/src/Clash/Normalize.hs

+        nextS <- Lens.use uniqSupply
+        normalizeStep q binds nextS


Couldn't this just be:

Suggested change

nextS <- Lens.use uniqSupply

normalizeStep q binds nextS

normalizeStep q binds s

Since the Supply is local to our monad and it couldn't have changed, right?

leonschoorl · 2022-04-11T15:08:39Z

clash-lib/src/Clash/Normalize/Transformations/Specialize.hs

+        MVar.withMVar ioLockV $ \() ->
+          traceWhen (hasTransformationInfo AppliedTerm opts)
+            ("Dropping type application on TopEntity: " ++ showPpr (varName f) ++ "\ntype:\n" ++ showPpr tyArg)


Instead of having to do

ioLockV <- Lens.use ioLock MVar.withMVar ioLockV $ \() -> traceWhen cond "..."

all over the place.

Maybe we should have a helper, something like:

traceWithIoLockWhen cond msg = when cond $ do ioLockV <- Lens.use ioLock MVar.withMVar ioLockV $ \() -> traceWhen cond "..."

Which has the added benefit of not needing to get the lock when the condition is False.
(the name could do with some bikeshedding probably)

Or just

Monad.when (hasTransformationInfo AppliedTerm opts) $ MVar.withMVar ioLockV $ \() -> traceM ...

I would rather just use the "normal" when / unless from Control.Monad in situations like these instead of making new strangely-specific functions

martijnbastiaan · 2023-03-15T08:09:39Z

We've only been able to measure slight slowdowns, and never speedups. Let's keep the branch around, but I doubt we're going to merge this as is.

alex-mckenna force-pushed the concurrent-normalization branch 6 times, most recently from b188fa8 to e2a4522 Compare February 8, 2022 16:52

vmchale force-pushed the concurrent-normalization branch 4 times, most recently from f2e77e4 to de54ed3 Compare March 21, 2022 16:18

vmchale marked this pull request as ready for review March 21, 2022 16:27

vmchale requested review from christiaanb, martijnbastiaan and leonschoorl and removed request for christiaanb and martijnbastiaan March 21, 2022 16:28

vmchale force-pushed the concurrent-normalization branch 6 times, most recently from 90b1223 to 8cf1a7e Compare March 21, 2022 20:06

leonschoorl reviewed Mar 22, 2022

View reviewed changes

clash-lib/src/Clash/Driver.hs Outdated Show resolved Hide resolved

clash-lib/src/Clash/Normalize.hs Show resolved Hide resolved

clash-lib/src/Clash/Rewrite/Util.hs Outdated Show resolved Hide resolved

leonschoorl mentioned this pull request Mar 22, 2022

Rewrite history (-fclash-debug-history) recording and/or viewer (clash-term) broken for the I2C example #2132

Open

vmchale force-pushed the concurrent-normalization branch from 8cf1a7e to 6866bcf Compare March 22, 2022 18:23

vmchale mentioned this pull request Mar 23, 2022

better RTS options for executables, makes benchmarks more sensible #2134

Merged

1 task

vmchale force-pushed the concurrent-normalization branch from 352d131 to dc07f93 Compare March 23, 2022 17:47

vmchale requested a review from leonschoorl March 23, 2022 17:49

martijnbastiaan requested changes Mar 24, 2022

View reviewed changes

clash-lib/clash-lib.cabal Outdated Show resolved Hide resolved

clash-lib/src/Clash/Normalize.hs Outdated Show resolved Hide resolved

clash-lib/src/Clash/Normalize.hs Outdated Show resolved Hide resolved

clash-lib/src/Clash/Normalize.hs Outdated Show resolved Hide resolved

vmchale force-pushed the concurrent-normalization branch from dc07f93 to a1cf3d1 Compare March 24, 2022 13:38

vmchale force-pushed the concurrent-normalization branch 2 times, most recently from bd22ac3 to 8e7a8de Compare March 24, 2022 18:54

martijnbastiaan requested changes Apr 5, 2022

View reviewed changes

clash-lib/src/Clash/Normalize.hs Show resolved Hide resolved

vmchale force-pushed the concurrent-normalization branch from 8e7a8de to 9c2b6f9 Compare April 7, 2022 11:35

martijnbastiaan approved these changes Apr 7, 2022

View reviewed changes

christiaanb reviewed Apr 7, 2022

View reviewed changes

DigitalBrains1 reviewed Apr 7, 2022

View reviewed changes

Add concurrent normalization

861f350

Co-authored-by: Vanessa McHale <vamchale@gmail.com>

martijnbastiaan force-pushed the concurrent-normalization branch from 5b32515 to 861f350 Compare April 8, 2022 07:21

leonschoorl reviewed Apr 11, 2022

View reviewed changes

martijnbastiaan closed this Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent normalization #2074

Concurrent normalization #2074

alex-mckenna commented Feb 3, 2022 •

edited by leonschoorl

Loading

martijnbastiaan commented Mar 22, 2022

alex-mckenna commented Mar 22, 2022

vmchale commented Mar 22, 2022

leonschoorl left a comment

vmchale commented Mar 22, 2022 •

edited

Loading

alex-mckenna commented Mar 22, 2022

alex-mckenna commented Mar 22, 2022 •

edited

Loading

martijnbastiaan commented Mar 23, 2022

christiaanb commented Mar 23, 2022

vmchale commented Mar 23, 2022 •

edited

Loading

alex-mckenna commented Mar 23, 2022

alex-mckenna commented Mar 23, 2022

vmchale commented Mar 23, 2022

alex-mckenna commented Mar 24, 2022

vmchale commented Mar 24, 2022

christiaanb Apr 7, 2022

vmchale Apr 7, 2022

christiaanb Apr 7, 2022

martijnbastiaan commented Apr 7, 2022

DigitalBrains1 Apr 7, 2022

DigitalBrains1 Apr 7, 2022

DigitalBrains1 Apr 7, 2022

DigitalBrains1 Apr 13, 2022

leonschoorl Apr 7, 2022

leonschoorl Apr 7, 2022

leonschoorl Apr 7, 2022

leonschoorl Apr 11, 2022 •

edited

Loading

alex-mckenna Apr 13, 2022

martijnbastiaan commented Mar 15, 2023

		@@ -0,0 +1 @@
		CHANGED: Add concurrent normalization flag [#2074](https://github.com/clash-lang/clash-compiler/pull/2074)

		@@ -271,6 +271,11 @@ Clash Compiler Flags

		.. _`Edalize`: https://github.com/olofk/edalize

		-fclash-concurrent-normaliztation

	-fclash-concurrent-normaliztation
	-fclash-concurrent-normalization

	nextS <- Lens.use uniqSupply
	normalizeStep q binds nextS
	normalizeStep q binds s

Concurrent normalization #2074

Concurrent normalization #2074

Conversation

alex-mckenna commented Feb 3, 2022 • edited by leonschoorl Loading

Still TODO:

martijnbastiaan commented Mar 22, 2022

alex-mckenna commented Mar 22, 2022

vmchale commented Mar 22, 2022

leonschoorl left a comment

Choose a reason for hiding this comment

vmchale commented Mar 22, 2022 • edited Loading

alex-mckenna commented Mar 22, 2022

alex-mckenna commented Mar 22, 2022 • edited Loading

martijnbastiaan commented Mar 23, 2022

christiaanb commented Mar 23, 2022

vmchale commented Mar 23, 2022 • edited Loading

alex-mckenna commented Mar 23, 2022

alex-mckenna commented Mar 23, 2022

vmchale commented Mar 23, 2022

alex-mckenna commented Mar 24, 2022

vmchale commented Mar 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnbastiaan commented Apr 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonschoorl Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnbastiaan commented Mar 15, 2023

alex-mckenna commented Feb 3, 2022 •

edited by leonschoorl

Loading

vmchale commented Mar 22, 2022 •

edited

Loading

alex-mckenna commented Mar 22, 2022 •

edited

Loading

vmchale commented Mar 23, 2022 •

edited

Loading

leonschoorl Apr 11, 2022 •

edited

Loading