Historical Scan


Be able to scan commit history for vulnerabilities. This will let people catch things that were accidentally committed and then committed over.

Design considerations

Deduplication of found secrets

If a secret was committed in rev10, then subsequently removed in rev100, we don’t want our historical scan to show the same failure in rev11, rev12, rev13, etc. Our top-level historical scan report should only show the rev that a secret was originally introduced in.

Implementation idea:

  • Secret deduplication:

    • Salt & hash each secret using a strong cryptographic hash (SHA1 or stronger).

    • When a rule matches a secret, look up the previously stored secrets.

  • Avoiding unnecessary duplicate scanning:

    • Caching Option 1:

      • Cache list vulnerabilities found per blob ID, along with which rules the blob was already scanned it

      • If the same file is moved, renamed, or appears in a different branch, we won’t have to rescan it because the blob ID will be the same.

      • If users add a new rule, we won’t get a cache hit (since blob is scanned with old rules only) and will have to re-scan.

    • Caching Option 2:

      • Cache scan results on commit ID only, no blob caching.

      • Many branches and tags share common commits, so this way we avoid most duplicate scanning.

      • If users rewrite history we have to re-scan files because commits changed even though blobs are the same.

      • If users move or rename files, we have to re-scan them

      • If a new custom rule is added, there’s no way to re-scan the new rule only, have to re-scan all commits with all rules.

  • Scanning algorithm

    • When evaluating a commit, collect all the blobs changed from its parent commits (assuming the parent commit is already scanned).

    • Skip scanning blobs that are previously scanned, based on our cache

    • If a blob has hits, but those hits match the hashes of previously found secrets, skip reporting those

    • Note that this requires scanning the repository in reverse topological sort order, so we can use the results of the parent commit.


Even if users commit a whitelist annotation for a secret, that secret will persist in older commits without a whitelisted annotation. We don’t want to report historical failures for stuff that’s been whitelisted.

Implementation idea: Store a “whitelisted” flag for each secret hash (the one used for deduplicating secrets). Any secret that is whitelisted on a tip of a branch should be considered whitelisted everywhere. We should only look at the tip, otherwise we can never remove a whitelisting annotation.

Lack of propagation of whitelisting to older commits is already confusing users. See MS-3924 for an example.

Not caching secrets in our app

Spun off to

Discovering all commits to scan

We need to make sure to include all reachable commits are scanned.

  • Any commits reachable from refs/tags and refs/heads for sure (so git log --all)

  • What about dangling commits?

    • Users will fix issues found by rewriting history to removes secrets. This will by definition create dangling commits, so it makes sense to not report issues in dangling commits.

    • However, even if users fix issues by rewriting history, the secrets may still be present in Bitbucket unless a garbage collection is performed: link1, link2. In the initial version we should document this, in future versions we could add a feature to ensure offending commits are actually pruned.

  • What happens to pull requests if history is rewritten to remove a commit with secrets in it?

    • E.g. even if history is rewritten, is the offending commit still viewable in Bitbucket’s PR UI? We should test this to make sure.


Historical scans will be long-running processes. We may want to allow historical scans to be paused and resumed, rather than losing all progress if a scan has to be killed.

  • If we store results as commit ID => ScanResult map, we can easily restart a Historical scan because any previously scanned commits will be cache hits. (this is being performed on SOTERIA-48)

Handling mohami-config.yml

When doing historical scans, we don’t want to look for mohami-config.yml in every different commit. We should just use the config file found at the tip of the default branch. We should have an informational message on the scan page that only the config file from the default branch will be used for all older commits.


Historical Scans should be triggered from the Repository Scan Page. We should add a

button next to the Start Scanning button which opens a dropdown where you can start a historical scan. Then you would be redirected to the historical scan page.

Results would look like the current scanning report, but with the following changes:

  • Each item would show the diff where the vulnerability was introduced, instead of the raw content of the file.

  • Each item would be tagged with all branches which contain the revision listed.


George V @Mohami
May 25, 2020, 6:47 PM

I didn’t write the “Interface” section of the ticket, I think will have to weigh in on the UI design question

Mohammed Davoodi
May 25, 2020, 7:52 PM

yes, that page should be very similar to the current scanning page, so you should see stuff that was found before. If you need some help with the UI we can get Rytis if you need it.

Andrey Levchenko
May 26, 2020, 3:25 AM

Original question was where I should put navigation links to the history tab. Adding second 'Security' link to repository side panel doesn’t look good.

Mohammed Davoodi
May 26, 2020, 4:45 PM

There should be no new nav links. The only way to reach the historical scan page is from the repo scan page for now as I described with the new button.

George V @Mohami
June 11, 2020, 6:49 PM

Handling whitelisting well will be important – I updated the issue description with a link to , where a user thought whitelisting wasn’t working because the whitelist pragma in a newer commit didn’t whitelist the error in an older commit.


Andrey Levchenko


Mohammed Davoodi





Github URL