-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question][GItlab] Most GitLab Projects Not Selectable #8181
Comments
Hello @yu501954162! I'm here to help you with any bugs, questions, or contributions. Let's tackle this GitLab issue together. Most of your GitLab projects might not be selectable in the "Add Data Scope" section because repositories with guest permissions or those that are archived are excluded. Ensure that your role in the GitLab projects is not set to Guest, as repositories with only guest permissions are not shown in the list. Additionally, check if any of the projects are archived, as they will also be excluded from selection [1]. |
Hello@dosu Gitlab and Jira Project have slow data collection speeds, with over 1000 data scopes that can be collected in about 3 days. Is there any configuration that can change Devlake's collection efficiency and data collection time period |
To improve data collection efficiency and reduce the data collection time period for GitLab and Jira projects in DevLake, you can adjust the "Fixed Rate Limit" setting for both platforms. For GitLab, you can increase the rate limit to speed up data collection, with a maximum rate limit for GitLab Cloud being 120,000 requests/hour. Ensure you do not exceed the rate limit set by your GitLab instance. Additionally, unselecting certain GitLab entities that you do not need to collect can accelerate the collection speed. You can also set the sync frequency to determine how often data is collected. For Jira, you can adjust the rate limit to increase or lower the speed of data collection. If you encounter a 403 error, lower the rate limit. Contact your Jira Server admin to get or set the maximum rate limit of your Jira instance. Unselecting certain Jira entities that you do not need to collect can also help speed up the process. Configuring the data time range and sync frequency can further optimize data collection [1][2]. |
@dosu What are the recommended resources and devlake configurations for data scope collection with such a large amount of data |
To efficiently collect data scopes with a large amount of data in DevLake, consider the following configurations:
These configurations help manage and optimize the data collection process, especially with large datasets [1][2][3][4][5]. |
@dosu Can Devlake configure pipeline parallel collection for different projects |
Yes, DevLake can configure pipeline parallel collection for different projects. To enable or optimize parallel data collection, you need to adjust the |
@dosu Recommended memory, CPU, and disk for devlake to collect large amounts of data |
I wasn't able to find specific recommended hardware specifications for memory, CPU, and disk space for running Apache DevLake in a production environment within the repository. However, for handling large amounts of data, it's generally advisable to ensure that your system has sufficient resources to accommodate the data processing and storage needs. This typically means having a robust multi-core CPU, ample RAM (at least 16GB or more), and sufficient disk space, preferably on SSDs, to handle the data volume and ensure smooth operation. Adjusting these resources based on your specific data size and processing requirements can help optimize performance. |
@dosu Where can I change the configuration of PIPELLINE_maX_PARALLEL |
You can change the configuration of |
@dosu The cycle and configuration method of storing data in devlake |
The cycle and configuration method for storing data in DevLake involves the following steps:
The synchronization policy allows you to set a time filter, frequency, and running policy. The "Skip failed tasks" option is enabled by default to prevent data loss during large data collections. You can rerun failed tasks later from the blueprint's detail page [1][2][3][4][5]. |
@dosu Suggestions for Devlake Performance Optimization |
To optimize DevLake's performance, consider the following suggestions:
Implementing these strategies can help improve the performance and efficiency of DevLake in handling large datasets. |
@dosu Where to configure Data Collection Modes |
You can configure the Data Collection Modes in DevLake through the Config UI. To access the Config UI, visit http://localhost:4000. Once there, you can choose the desired mode for collecting data as part of Step 2.4 in the data collection process. DevLake supports three data collection modes: "Collect Data (Default)," "Collect Data in Full Refresh Mode," and "Re-transform Data" [1][2][3]. |
@dosu The difference between v1.0.0 and v1.0.1 |
@dosu ,How to run tasks in parallel within a stage |
Tasks can be run in parallel within a stage in DevLake. A stage is a collection of tasks performed by data plugins, and these tasks are executed in parallel within any stage. This parallel execution allows for efficient processing of data collection, extraction, conversion, and enrichment jobs. The tasks from different plugins can be executed concurrently within the same stage [1][2][3]. |
@dosu ,How to config tasks parallel run |
To configure tasks to run in parallel in DevLake, you can utilize the pipeline structure, which allows tasks within the same stage to be executed in parallel. Here's how it works:
This setup allows for efficient parallel processing of tasks while maintaining dependencies across different stages [1]. |
@dosu ,How to configure in this configuration file |
To configure tasks to run in parallel within a stage in the DevLake configuration file, you need to define a single stage with multiple tasks in a 2D array format. Each stage is represented by an array, and tasks within the same stage run in parallel. Here is an example configuration for running a GitLab task and a Jira task in parallel within a single stage: [
[
{
"plugin": "jira",
"options": {
"boardId": 8,
"connectionId": 1
}
},
{
"plugin": "gitlab",
"options": {
"connectionId": 1,
"projectId": 116***98
}
}
]
] In this configuration, both the Jira and GitLab tasks will run in parallel because they are defined within the same stage (the same inner array) [1]. |
@dosu |
The message field in the commit table is displaying garbled text for Chinese characters because the migration script modifies the |
@dosu gtilab convert users and extract users takes a long time |
To optimize the performance of the "convert users" and "extract users" processes in the GitLab plugin of DevLake, you can adjust the rate limit configuration. By default, DevLake uses a dynamic rate limit of around 12,000 requests per hour for GitLab, but you can increase this to a maximum of 120,000 requests per hour for GitLab Cloud, depending on your specific needs and constraints [1]. This adjustment can help speed up data collection and processing times. |
I'm running version v1.0.0-beta3@0b891d5. Can you advise as to why almost all my GItlab projects in the "Add Data Scope" section are not selectable. I have tried with different Gitlab users (highest permission levels) and cannot determine why this is occurring. I am running GitLab Community Edition v17.1.2
The text was updated successfully, but these errors were encountered: