Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options to scan remote Git repos #41

Open
4 of 5 tasks
samsmithnz opened this issue Aug 7, 2022 · 13 comments
Open
4 of 5 tasks

Add options to scan remote Git repos #41

samsmithnz opened this issue Aug 7, 2022 · 13 comments

Comments

@samsmithnz
Copy link
Owner

samsmithnz commented Aug 7, 2022

@blueboxes
Copy link

The way backstage.io does scans repos is using the APIs as they perform better than just using git commands may be worth doing similar.

@samsmithnz
Copy link
Owner Author

@blueboxes I was planning to use the Rest API's - is that what you mean?

@blueboxes
Copy link

blueboxes commented Aug 8, 2022

Yes it was, guess it is the obvious choice though I have seen other approaches before.

@samsmithnz
Copy link
Owner Author

Absolutely. Appreciate the input and engagement! Very helpful to confirm I'm moving in the right directions.

@jedjohan
Copy link

jedjohan commented Aug 15, 2022

Thats a great idea. Any plans on supporting scanning multiple repos ? Azure Devops uses a structure of Projects and repos where a project may contain multiple repos, in that case an option for "scan all repos is this project" would be useful 😎

https://docs.microsoft.com/en-us/rest/api/azure/devops/git/repositories/list?view=azure-devops-rest-6.0&tabs=HTTP

Need any help ? just let us know 👌

@samsmithnz
Copy link
Owner Author

@jedjohan I'm looking into it right now - starting with repos, and then expanding to organizations/projects. The challenge I'm seeing so far is that to list all files is relatively expensive from an API perspective - I'm scared of rate limits being hit for large organizations/projects. Will try to build it into error handling, but not sure how it will scale.

@jedjohan
Copy link

jedjohan commented Aug 15, 2022

Hmm, true. Will put some load on the process, and also introduce a capability to be able to fail separate repos without crashing the rest I guess ? Are you planning to run parallel scans after fetching the list of repos? A simple polly retry on the HttpClient is a good start I guess, something like this maybe:

    public IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
    {
        return HttpPolicyExtensions
            .HandleTransientHttpError()
            .OrResult(msg => msg.StatusCode == HttpStatusCode.TooManyRequests)
            .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(1, retryAttempt)) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 2000)),
                onRetry: (message, timespan, attempt, _) => Logger?.LogInformation($"Retrying request to {message?.Result?.RequestMessage?.RequestUri} in {timespan.TotalSeconds} seconds. Retry attempt {attempt}."));
    }

@samsmithnz
Copy link
Owner Author

Yes - definitely want to add parallelism as I go - so far performance is very quick - ms times, unless the directory is massive, and even then it's still seconds.

@blueboxes
Copy link

blueboxes commented Aug 16, 2022

Do you need to scan all the files in each repo?

If you are not already thinking of this you could use the search API across whole devops project as you want to only look at files with certain extensions then fetch just those single files?

@samsmithnz
Copy link
Owner Author

@blueboxes I'm open to ideas, here is the current challenges with GitHub - and while I haven't jumped into issues with other DevOps solutions, I'm guessing the other Git repos are similar:

  • The search API is rate limited to 10 requests per minute, and 1000 results (https://docs.github.com/en/rest/search).
  • Because of the search restrictions , I'm using a tree call (see first post above).
  • Even then I need to call a contents API to get the text for each individual file - no matter the method to find the project files.

@blueboxes
Copy link

Just out of interest I have had a look at backstage.io which stores config files in each repo and pulls them into a software catalog. This uses a code search and then downloads each file from the results just as you have described.

https://github.com/backstage/backstage/blob/53a7275955f0c77db2e6e3a7afcee53a80447a4c/plugins/catalog-backend-module-azure/src/processors/AzureDevOpsDiscoveryProcessor.ts#L95

https://github.com/backstage/backstage/blob/1a1e711aced140635964d23794f11fc0d06ca172/plugins/catalog-backend-module-azure/src/lib/azure.ts#L40

The results can number a few hundred, not sure what would happen if they hit the thousands. It does not do a retry for rate limiting but does use pages.

Good to know this is how others have done this.

@samsmithnz
Copy link
Owner Author

I'm finding it quite different processing a list of files from a REST API. When I'm scanning directories, I look in a folder for project files, and if I don't find it, I recursively jump into the new sub-directory until I find a project file (or don't! :)). With a list of all files, I have to reconstruct that directory structure manually. If I don't, I would start to scan web.config files when I don't need to. But I also don't want to search for individual files - there are 8 different project files I scan for today, and it seems silly to search for each one (that is 8 search calls, plus a get-content call for each file found - vs 1 list files call, plus a get-content for each file found).

@samsmithnz
Copy link
Owner Author

GitHub scanning is done! I'm going to experiment with what happens when I add orgs (and iterate through every repo), but I suspect for the really large orgs some customers have, this will always time out and I need a solution that allows you to segment the load.

@samsmithnz samsmithnz changed the title Add connections to Git Repos to scan directory Add options to scan remote Git repos Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants