Skip to content

Conversation

IsaacMilarky
Copy link
Contributor

Description

  • Add a date filter to the Facade contributer resolution logic
  • Now, we will only use commits that have appeared since last time facade was ran in contributer resolution
  • I have also removed the start_date setting because it wasn't doing anything
  • Added a helper function for if we need to get the date that the last commit was added

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change noted with some additional comments on how getting the since_date bulletproofed against dufus contributors who make commit_date random could affect some other parts of the code. LOOKS VERY GOOD! Let me know when this is ready to test.

@sgoggins sgoggins added the workers Related to data workers label Sep 2, 2025
@IsaacMilarky IsaacMilarky marked this pull request as ready for review September 9, 2025 22:31
Copy link
Contributor

@Ulincsys Ulincsys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question about the change to the makefile.

On many systems, that should fail, because installing packages systemwide is not supported anymore (on EG: Ubuntu), and I believe with UV, we are no longer intended to run Augur in a venv (as UV supposedly manages those automatically).

sgoggins
sgoggins previously approved these changes Sep 30, 2025
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Lets let it fly!

@sgoggins
Copy link
Member

The test would be:

  1. New Augur instance
  2. Identify 3-5 very large repositories
  3. Do collection for them (note that for Facade collection it typically requires an Augur restart)
  4. Truncate the augur_operations.collection_status table
  5. Restart collection after commits were collected the first time and see if it goes insanely fast, which it should if this branch does what we claim it does.

@sgoggins sgoggins requested review from MoralCode and removed request for Ulincsys September 30, 2025 14:28
Copy link
Contributor

@ABrain7710 ABrain7710 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure we can still support full collection on demand I think it would good to pass a full_collection flag and if it is true then grab all commits, otherwise do this new logic to only get new commits

@MoralCode
Copy link
Contributor

in poking through the code, im actually not sure this even affects analyze_commits_in_parallel beyond removing some date related things

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@IsaacMilarky
Copy link
Contributor Author

in poking through the code, im actually not sure this even affects analyze_commits_in_parallel beyond removing some date related things

The partial commit collection was already supported previously in the body of the analyze_commits_in_parallel method:

    repo = get_repo_by_repo_id(repo_id)

    #Get the huge list of commits to process.
    absolute_path = get_absolute_repo_path(facade_helper.repo_base_directory, repo.repo_id, repo.repo_path, repo.repo_name)
    repo_loc = (f"{absolute_path}/.git")
    # Grab the parents of HEAD

    parent_commits = get_parent_commits_set(repo_loc)

    # Grab the existing commits from the database
    existing_commits = get_existing_commits_set(repo_id)

    # Find missing commits and add them
    missing_commits = parent_commits - existing_commits

    facade_helper.log_activity('Debug',f"Commits missing from repo {repo_id}: {len(missing_commits)}")

These changes just effect the contributor resolution logic to only go through commits that we haven't before by default.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@MoralCode
Copy link
Contributor

What would the behavior of this PR be in the case of a repo deciding to rewrite the history of their git repo? Is that something we can detect and maybe just fall back to the full recollection?

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
from augur.tasks.init.celery_app import AugurFacadeRepoCollectionTask
from augur.tasks.github.util.github_data_access import GithubDataAccess, UrlNotFoundException
from augur.tasks.github.util.github_random_key_auth import GithubRandomKeyAuth
from augur.application.db.models import Contributor
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused Contributor imported from augur.application.db.models (unused-import)

@IsaacMilarky
Copy link
Contributor Author

What would the behavior of this PR be in the case of a repo deciding to rewrite the history of their git repo? Is that something we can detect and maybe just fall back to the full recollection?

The logic is based on git commit hashes for the main analysis logic and for the contributor resolution logic it's based on the date that the commit record was inserted.

So, if a repository is rewriting their history the hashes would presumably be different and would be collected by the main logic into the commits table. Then that new commit would be collected later and therefore not be filtered by the new logic. The date that it is filtering by is the date that the commit record was collected not the date the commit record was committed.

Comment on lines 25 to 28
conn.execute(text(f"""
INSERT INTO "augur_operations"."config" ("section_name", "setting_name", "value", "type") VALUES ('Facade', 'facade_contributor_full_recollect', '{0}', 'int');
"""))
# ### end Alembic commands ###
Copy link
Contributor

@MoralCode MoralCode Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A migration is not needed if you are simply adding a value to the database - its really only needed for changing the structure of the DB tables/columns themselves. I suspect you could add this to augur/application/config.py:~71 or so and it would get included in the database for anyone who has augur set to insert that config on startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotcha, I just wasn't sure where we create the config in the code. Have changed to reflect.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workers Related to data workers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants