Search before asking
What happened
When the same physical repository is added to DevLake under more than one connection - which the UI currently allows without any warning - every entity collected from that repository (pull requests, issues, board associations, repo-commit links) is stored as a separate record for each connection.
For example a repository connected via four connections, every pull request appears four times in the pull_requests table with four different primary keys. Any metric computed over these tables - PR count, cycle time, throughput, DORA lead time - is inflated by the number of connections pointing at the same repo.
The UI gives no indication that this configuration will produce duplicate data. Users following the suggested multi-connection workaround issue#7684 are silently creating corrupted metrics.
Verification: All duplicate records share the same url field (e.g., https://github.com/owner/repo/pull/123). Running the following query confirms the problem:
SELECT url, COUNT(*) as copies
FROM pull_requests
GROUP BY url
HAVING COUNT(*) > 1
ORDER BY copies DESC;
What do you expect to happen
When a user adds a repository scope that is already registered under a different connection (detected by matching html_url / clone_url across connections), the UI should display a clear warning before the user saves, for example:
"This repository is already connected via Connection 'GitHub Production'. Collecting it here will create duplicate pull requests and issue records, which will inflate all metrics for this repository."
The warning should not block the action - there are legitimate reasons to have the same repository under multiple connections (different scope configs, different team tokens). But the user should be able to make an informed choice.
Additionally, a backend diagnostics endpoint would help existing installations detect the problem:
GET /api/scope-duplicates
Returns a list of repository URLs that appear under more than one connection, along with the affected connection IDs, so administrators can audit and clean up existing configurations.
How to reproduce
- Add the same GitHub repository to DevLake under two different connections.
- Run blueprints for both connections.
- Query pull_requests grouped by url - every PR will appear twice.
- Note that at no point during configuration does the UI warn about this.
Anything else
Proposed implementation
Backend - one new API handler that queries _tool_github_repos (and equivalent tables for other plugins) grouped by html_url, returning repos that appear under more than one connection:
GET /api/plugins/github/scope-duplicates
Config-UI - when a user selects a repository scope in the blueprint or connection wizard, call the endpoint and render a dismissible warning banner if the selected repo URL is already registered elsewhere.
Additional context
This issue affects all data-source plugins that support multiple connections to the same platform instance (GitHub, GitLab, Bitbucket, etc.).
A related workaround exists: deduplicating views over the domain tables using url as a natural key. We are willing to contribute that as a stopgap alongside the UI fix if it would be useful to the project ( see : konflux-ci#106 )
Version
main
Are you willing to submit PR?
Code of Conduct
Search before asking
What happened
When the same physical repository is added to DevLake under more than one connection - which the UI currently allows without any warning - every entity collected from that repository (pull requests, issues, board associations, repo-commit links) is stored as a separate record for each connection.
For example a repository connected via four connections, every pull request appears four times in the pull_requests table with four different primary keys. Any metric computed over these tables - PR count, cycle time, throughput, DORA lead time - is inflated by the number of connections pointing at the same repo.
The UI gives no indication that this configuration will produce duplicate data. Users following the suggested multi-connection workaround issue#7684 are silently creating corrupted metrics.
Verification: All duplicate records share the same url field (e.g., https://github.com/owner/repo/pull/123). Running the following query confirms the problem:
What do you expect to happen
When a user adds a repository scope that is already registered under a different connection (detected by matching html_url / clone_url across connections), the UI should display a clear warning before the user saves, for example:
"This repository is already connected via Connection 'GitHub Production'. Collecting it here will create duplicate pull requests and issue records, which will inflate all metrics for this repository."
The warning should not block the action - there are legitimate reasons to have the same repository under multiple connections (different scope configs, different team tokens). But the user should be able to make an informed choice.
Additionally, a backend diagnostics endpoint would help existing installations detect the problem:
Returns a list of repository URLs that appear under more than one connection, along with the affected connection IDs, so administrators can audit and clean up existing configurations.
How to reproduce
Anything else
Proposed implementation
Backend - one new API handler that queries _tool_github_repos (and equivalent tables for other plugins) grouped by html_url, returning repos that appear under more than one connection:
GET /api/plugins/github/scope-duplicatesConfig-UI - when a user selects a repository scope in the blueprint or connection wizard, call the endpoint and render a dismissible warning banner if the selected repo URL is already registered elsewhere.
Additional context
This issue affects all data-source plugins that support multiple connections to the same platform instance (GitHub, GitLab, Bitbucket, etc.).
A related workaround exists: deduplicating views over the domain tables using url as a natural key. We are willing to contribute that as a stopgap alongside the UI fix if it would be useful to the project ( see : konflux-ci#106 )
Version
main
Are you willing to submit PR?
Code of Conduct