Timeouts on large binary files

Description

We’re having consistent reports of timeouts when users enable this. We have two different sources of timeouts:

  • Very long lines slow down regex matching, or poorly written regex ( )

  • Very large binary files (this ticket).

There’s a more detailed report of the issue from a savvy user.

Problem

  • There’s a limit to how long git cat-file is allowed to run before timing out

  • Currently we reduce the likelihood of timeouts by batching 20 files at a time, but this can fail if there is a very large file that needs to be scanned.

    • It’s also very inefficient most of the time, when files are small.

  • Once we invoke git cat-file, we always consume all the bytes, even if we determine early on that a file is a binary file (reference FileValidationOutputHandler)

Proposed implementation

  • Set the git cat-file timeout explicitly (reference )

  • Rather than feeding git cat-file a fixed list of files, set up a CommandInputHandler that can continue streaming input files on demand.

  • During file validation, check how much time has elapsed (e.g. every 1000 bytes). If it’s close to the timeout, break and re-start scanning the file with a new git cat-file command.

  • During file validation, if we find a large binary file (e.g. > 3 Mb), break and restart scanning with a new git cat-file command rather than consuming all the bytes.

In addition to fixing timeouts, this will speed up scanning (based on profiling in SOTERIA-68) since the effective batch size will be much bigger. Source code files are usually small, and with a dynamic batching approach we can scan thousands of files in a single git cat-file, rather than just 20.

Environment

None

Activity

Show:
Mohammed Davoodi
September 28, 2020, 10:35 PM

Georges statement on the work remaining: “I can walk through the optimization ticket w/ Andrey or Alexey. It’s working, just needs one failsafe to prevent it getting hung if a slow regex encounters catastrophic backtracking. I added a TODO in the code and pushedYou can grab the code from the PR and run it through its paces. I think it’s close to the theoretical limit of how fast we can scan linux, because even with 0 processing “git car-file” of linux takes a while. Once the scanning thread speed matches the git cat-file speed there’s no additional improvement to be made because we’ll just be waiting on git to stream the data”

Assignee

Alexey Remnev

Reporter

George V @Mohami

Sprint

None

Labels

None

Github URL

None

Priority

High
Configure