I want to collect specific data in ScanCode.io using GrimoireLab using a ScanCode.io pipeline.
The outcome should be new ScanCode.io pipeline(s) to download and/or clone code and collect metrics, by orchestrating the execution of AboutCode, GrimoireLab, and other open source tools. Collected data is stored for further metric computations.
The high level flow would be:
- Through a PurlDB API endpoint, the user requests metric scoring for a PURL
- If the the package has been analyzed already, the data should be retrieved from the PurlDB DB and returned
- Otherwise, PurlDB queues (or run) a scancode.io data collection/metric computation scoring pipeline
- The pipeline collects the source/binary/git repos for the PURL
- Then does it magic in GrimoireLab
- The GrimoireLab analysis is returned somehow (webhook? polling? direct code integration in SCIO?)
- PurlDB gets the data back, saves it in its DB and return the results
I suggest we implement a middle out strategy, starting with Grimoire, to ScanCode.io, to PurlDB:
Questions:
- is the PurlDB API call returning immediately (and run in the background) or wait synchronously? (NB: We have similar pattern for the on-demand scancode scans in the API already)
- what if the analysis is stale and 3 months old?
I want to collect specific data in ScanCode.io using GrimoireLab using a ScanCode.io pipeline.
The outcome should be new ScanCode.io pipeline(s) to download and/or clone code and collect metrics, by orchestrating the execution of AboutCode, GrimoireLab, and other open source tools. Collected data is stored for further metric computations.
The high level flow would be:
I suggest we implement a middle out strategy, starting with Grimoire, to ScanCode.io, to PurlDB:
Questions: