We’re having consistent reports of timeouts when users enable this. We have two different sources of timeouts:
Very long lines slow down regex matching, or poorly written regex ( )
Very large binary files (this ticket).
There’s a more detailed report of the issue from a savvy user.
There’s a limit to how long git cat-file is allowed to run before timing out
Currently we reduce the likelihood of timeouts by batching 20 files at a time, but this can fail if there is a very large file that needs to be scanned.
It’s also very inefficient most of the time, when files are small.
Once we invoke git cat-file, we always consume all the bytes, even if we determine early on that a file is a binary file (reference FileValidationOutputHandler)
Set the git cat-file timeout explicitly (reference )
Rather than feeding git cat-file a fixed list of files, set up a CommandInputHandler that can continue streaming input files on demand.
During file validation, check how much time has elapsed (e.g. every 1000 bytes). If it’s close to the timeout, break and re-start scanning the file with a new git cat-file command.
During file validation, if we find a large binary file (e.g. > 3 Mb), break and restart scanning with a new git cat-file command rather than consuming all the bytes.
In addition to fixing timeouts, this will speed up scanning (based on profiling in SOTERIA-68) since the effective batch size will be much bigger. Source code files are usually small, and with a dynamic batching approach we can scan thousands of files in a single git cat-file, rather than just 20.
Georges statement on the work remaining: “I can walk through the optimization ticket w/ Andrey or Alexey. It’s working, just needs one failsafe to prevent it getting hung if a slow regex encounters catastrophic backtracking. I added a TODO in the code and pushedYou can grab the code from the PR and run it through its paces. I think it’s close to the theoretical limit of how fast we can scan linux, because even with 0 processing “git car-file” of linux takes a while. Once the scanning thread speed matches the git cat-file speed there’s no additional improvement to be made because we’ll just be waiting on git to stream the data”