Skip to content

Conversation

tirix
Copy link
Collaborator

@tirix tirix commented May 23, 2022

This introduces a trait Resolver meant to handle work variables outside of metamath-knife.

The new trait shall be able to resolve work variable names like e.g. &W1 into a virtual symbol, as well as into the corresponding a virtual floating axiom, and vice-versa for printing back that symbol.

An implementation is provided in the form of Nameset, which does not resolve any additional work variables, but only regular symbols straight from the database.

@tirix tirix mentioned this pull request May 23, 2022
@david-a-wheeler
Copy link
Member

I haven't had time to review the code, but the idea makes complete sense. If no one objects for a while, I'd say merge it. I'd like to hear from @digama0 if he has a chance to review it.

src/formula.rs Outdated
}

fn label_for_symbol(&self, _: Symbol) -> (Label, TypeCode) {
panic!("Symbols in regular formulas shall only be regular symbols");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a panic? Even in the regular case, var -> varhyp lookups are a thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place where it is called currently is a case where it has been established the symbol tested is work variable, so an implementation is currently not required.
grammar.rs, line 1276:

      let opt_node = if !rslv.is_work_variable(token.symbol) {
          map.get(&(SymbolType::Constant, token.symbol))
      } else {
          // The next symbol is not a regular symbol, resolve it first
          let (label, typecode) = rslv.label_for_symbol(token.symbol);

But you are right that we could provide an implementation for the case where the symbol provided is a regular variable. I'll add that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit.

src/formula.rs Outdated
});
}
dt.finish()
}
}

type FlattenStack<'a> = Vec<(
Either<TokenIter<'a>, Symbol>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using an enum here instead of Either. It is not obvious what the two variants mean from the type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Copy link
Collaborator Author

@tirix tirix Jun 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.stack.pop();
self.next()
(Either::Right(symbol), _) => {
self.stack.pop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. You are popping the stack but you haven't done anything with the sibling_iter yet. Isn't that going to lose items following the work variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually whenever there is a single symbol on the stack, there are no siblings, so there is no sibling iterator to track.
This is actually better visible with the new enum you suggested above.

src/formula.rs Outdated
match self.symbol_token(symbol) {
Cow::Borrowed(token) => as_str(token).into(),
Cow::Owned(token) => String::from_utf8(token)
.expect("TokenPtr is supposed to be UTF8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be modified to use as_str()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try e.g. as_str(&token), Rust complains that I'm trying to return a reference to a local variable.
Indeed, as_str does not return an owned structure, but just a &str which is borrowed from its input token.

In any case there is duplicated code here, and I agree this is a concern. Maybe I could move this code to a new as_string function, right next to as_str ?
Or do you have a better suggestion ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_string sounds good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See as_string commit

@@ -1102,29 +1100,6 @@ impl Grammar {
{
// Empty parser command
match k.as_ref(buf) {
b"syntax" => match rest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this get removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I shall have put that in a separated PR.

This piece of code records the typecodes declared with the $j syntax commands.
However this is done both at the initialization phase, and again here during the handling of the special garden_path commands, which happens after a first grammar has been built.

As a result, the typecodes array contained duplicated elements.

@@ -116,7 +116,7 @@ fn test_parse_formula() {
&mut fmla_vec.clone().into_iter(),
&[wff, class],
false,
&names,
&names as &Nameset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&names as &Nameset,
&**names,

Alternatively and additionally, Arc<T: Resolver>: Resolver seems like a reasonable impl and would allow you to pass &names directly, although it is probably better at runtime to undo the indirection here rather than in the method calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check that - here this is only for test purpose, so we don't need to squeeze out the highest performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/nameck.rs Outdated
@@ -63,7 +63,7 @@ use std::u32;
///
/// [INC]: https://github.com/sorear/smetamath-rs/issues/11
#[derive(Copy, Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default, Hash)]
pub struct Atom(u32);
pub struct Atom(pub u32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for making this public? This will make garbage atoms user-constructible, and its relation to the other invariants of the library are not clear. I don't see why you would ever want to be able to write Atom(37). Read only access to the number can be done by a Atom::index() function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main drawback of this approach: an implementation of the Resolver trait needs to build and return its own Symbol for work variables, which are atoms.

Slightly better here might be to only provide a public constructor, this would at least make these atoms practically unmutable, since even with a mutable reference to one, a library caller could not modify the underlying identifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resolver needs to create Symbols, then it should have access to a name generator which can ensure that the Atoms are globally unique, and it should not be directly responsible for manipulating the number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[must_use]
pub fn last_atom(&self) -> Atom {
Atom(self.atom_table.table.len() as u32)
pub fn is_extra_atom(&self, atom: Atom) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we should promise that extra atoms always come in a block after regular atoms, or even that there is a way to distinguish them. What if the parser is restarted after some extra atoms are created, in incremental use cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now I'm kind of eating the cake starting from both sides:
Nameset naturally allocates atoms from 0 up, continuously, while I'm getting the identifiers from MAX down, assuming the passed ID stays small.

What my client currently does to compute the ID, is that it also ranks them and assumes that small indices are used first (&W1, &S1, &C1, &W2, &S2, &C2, &W3, etc.) This allows a direct computation without any lookup.

There are currently around 100'000 atoms (99'577) in set.mm, so there is still plenty of space until 2^32.

Beyond the fact that this method works with direct computations and no lookup, which shall be faster, it most importantly works with an immutable Nameset instance. If Nameset were to reserve internal spots for the "extra" atoms, this means it needs to be mutated, which goes against the current philosophy of "once the DB is parsed, everything is immutable", which allows very fast execution since we only pass around Arcs to the different structures of the DB.

Let's assume we went up to work variable &W99999, and that each time the parser is restarted everything changes and it requires a new set of 100'000 atom indices. Using u32 as currently, the parsing could be restarted 43'000 times until there is a clash.

Do you have another idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the fact that this method works with direct computations and no lookup, which shall be faster, it most importantly works with an immutable Nameset instance. If Nameset were to reserve internal spots for the "extra" atoms, this means it needs to be mutated, which goes against the current philosophy of "once the DB is parsed, everything is immutable", which allows very fast execution since we only pass around Arcs to the different structures of the DB.

I think that a name generator that does not mutate can't possibly work correctly. I think it is fine to use interior mutability here, and the only state a name generator needs to track is the last allocated ID so a AtomicU32 will suffice, so API-wise it can still take a &self.

I'm still a little dubious about this mmj2-styled work variable thing. This really ought to be some kind of enum, work vars are not the same as regular vars in a whole bunch of different ways, and adding phantom $f statements to the database seems like a really weird encoding to me.

@tirix tirix mentioned this pull request Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants