Skip to content

Querying diffs is very slow on moderately large repositories #124

@mplanchard

Description

@mplanchard

Describe the bug

Queries on diffs for even moderately large repositories are incredibly slow. Our repository at work has ~5,500 commits.

The following operation to get the diff with the most deletions took ~30 minutes:

❯ time .cargo/bin/gitql --query 'select * from diffs order by deletions desc limit 1'
╭──────────────────────────────────────────┬───────────────────┬───────────────────────┬────────────┬───────────┬───────────────┬─────────────────────────┬───────────────────────────────────╮
│ commit_id                                ┆ name              ┆ email                 ┆ insertions ┆ deletions ┆ files_changed ┆ datetime                ┆ repo                              │
╞══════════════════════════════════════════╪═══════════════════╪═══════════════════════╪════════════╪═══════════╪═══════════════╪═════════════════════════╪═══════════════════════════════════╡
│ 8b685201464c3027afe9105bb5ed9b40a1befce7 ┆ Matthew Planchard ┆ msplanchard@gmail.com ┆ 3284       ┆ 41552     ┆ 212           ┆ 2024-08-15 18:15:45.000 ┆ /home/matthew/s/spec/.git         │
╰──────────────────────────────────────────┴───────────────────┴───────────────────────┴────────────┴───────────┴───────────────┴─────────────────────────┴───────────────────────────────────╯

________________________________________________________
Executed in   27.37 mins    fish           external
   usr time   27.25 mins  569.00 micros   27.25 mins
   sys time    0.04 mins    0.00 micros    0.04 mins

During the entire time, a single thread was pretty much pegged. I can get this same result using git and awk in a fraction (1/270th, 0.37%) of the time:

❯ time git log --pretty="@%h" --shortstat | tr "\n" " " | tr "@" "\n" | awk '{if ($7 > deletions) { deletions = $7; commit = $1 }}; END { print commit; print deletions }' 
8b6852014
41720

________________________________________________________
Executed in    6.01 secs    fish           external
   usr time    5.41 secs    0.00 millis    5.41 secs
   sys time    0.63 secs    1.78 millis    0.63 secs

Queries on commits seem to run in a more reasonable amount of time, e.g.:

❯ time .cargo/bin/gitql --query "select count(author_name) from commits where author_name like '%matthew%'"
╭──────────╮
│ column_2 │
╞══════════╡
│ 1001     │
╰──────────╯

________________________________________________________
Executed in  357.45 millis    fish           external
   usr time  351.94 millis    0.00 micros  351.94 millis
   sys time    4.62 millis  641.00 micros    3.98 millis

To Reproduce

  1. Check out any large repo
  2. Run the example command above

Expected behavior
Speed is at least within an order of magnitude of git/awk

GQL (please complete the following information):
GitQL version 0.28.0

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions