-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ms/task creation performance fixes #8741
base: develop
Are you sure you want to change the base?
Ms/task creation performance fixes #8741
Conversation
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes in this pull request involve modifications across three main files: Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MediaReader
participant ImageListReader
participant ZipReader
User->>MediaReader: Request image processing
MediaReader->>ImageListReader: Check source paths
ImageListReader->>ImageListReader: Use _source_path_set for membership
ImageListReader->>MediaReader: Return processed images
MediaReader->>ZipReader: Get image preview
ZipReader->>ZipReader: Handle 3D previews
ZipReader->>User: Return image preview
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
utils/dataset_manifest/core.py (1)
647-651
: Consider adding type hints and fixing indentationWhile the optimization is excellent, consider these improvements:
- Add type hints to improve code maintainability
- Fix the indentation on line 650 to match the surrounding code
Here's the suggested improvement:
- def get_subset(self, subset_names): + def get_subset(self, subset_names: List[str]) -> Tuple[List[int], List[Dict[str, Any]]]: index_list = [] subset = [] # First, create a dictionary mapping image names to their indices name_to_index = {name: index for index, name in enumerate(subset_names)} # Now, loop through the images and check against the dictionary for _, image in self: image_name = f"{image.full_name}" - if image_name in name_to_index: + if image_name in name_to_index:cvat/apps/engine/media_extractors.py (3)
Line range hint
1-1
: Consider adding dimension type validationWhile the dimension handling is correct, it would be beneficial to add explicit validation of the dimension type to prevent potential issues with invalid values.
Add validation in the
IMediaReader.__init__
method:def __init__( self, source_path, *, step: int = 1, start: int = 0, stop: Optional[int] = None, dimension: DimensionType = DimensionType.DIM_2D ): + if not isinstance(dimension, DimensionType): + raise ValueError(f"Invalid dimension type: {dimension}") self._dimension = dimension
Line range hint
1-1
: Consider adding performance monitoringTo validate and track the performance improvements, consider adding performance monitoring and logging.
Add timing decorators to key methods:
import time import logging from functools import wraps def log_performance(func): @wraps(func) def wrapper(*args, **kwargs): start_time = time.time() result = func(*args, **kwargs) duration = time.time() - start_time logging.info(f"{func.__name__} took {duration:.2f} seconds") return result return wrapperApply to key methods:
+ @log_performance def __contains__(self, media_file): return media_file in self._source_path_set
Line range hint
1-1
: Enhance error handling and documentationConsider improving error handling for corrupted files and invalid formats, and add documentation for error cases.
Add comprehensive error handling:
class MediaError(Exception): """Base exception for media processing errors.""" pass class InvalidFormatError(MediaError): """Raised when a file format is invalid or corrupted.""" pass class DimensionError(MediaError): """Raised when dimension-related operations fail.""" passUpdate error handling in methods:
def get_image_size(self, i): if self._dimension == DimensionType.DIM_3D: with open(self.get_path(i), 'rb') as f: properties = ValidateDimension.get_pcd_properties(f) + if not properties: + raise InvalidFormatError(f"Invalid or corrupted PCD file: {self.get_path(i)}") return int(properties["WIDTH"]), int(properties["HEIGHT"])cvat/apps/engine/task.py (1)
236-240
: Performance optimization: Use set for O(1) membership testingGood optimization! Converting the filter list to a set provides O(1) lookup time while maintaining the original file order in the final result.
Consider adding a comment explaining the performance optimization:
+# Convert to set for O(1) membership testing while maintaining original order without_extra_dirs_set = set(without_extra_dirs)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
cvat/apps/engine/media_extractors.py
(2 hunks)cvat/apps/engine/task.py
(2 hunks)utils/dataset_manifest/core.py
(1 hunks)
🔇 Additional comments (3)
utils/dataset_manifest/core.py (1)
644-646
: Great performance optimization!
Converting the list lookup to a dictionary-based approach improves the time complexity from O(n) to O(1) for each image name lookup. This is particularly beneficial when dealing with large datasets.
cvat/apps/engine/media_extractors.py (1)
206-206
: Excellent performance optimization!
Using a set for membership testing improves the time complexity from O(n) to O(1), which is particularly beneficial for tasks with large numbers of frames.
Also applies to: 311-311
cvat/apps/engine/task.py (1)
712-717
: Efficient validation of job file mapping
Good optimization! Using a set for membership testing improves performance while maintaining robust validation of job mapping files.
Let's verify the validation logic works as expected:
✅ Verification successful
Let me gather more information about the validation logic by checking the implementation of the file validation:
Validation logic is correctly implemented and comprehensive
The optimization and validation are well implemented with multiple layers of protection:
- Uses O(1) set lookups for performance
- Validates that every file in job mapping exists in server files
- Provides clear error messages identifying missing files
- Filters the server files to only include those specified in job mapping
- Has comprehensive validation checks preventing incompatible parameter combinations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that job file mapping validation catches missing files
# Test: Search for ValidationError exceptions in job file mapping validation
rg -A 5 "raise ValidationError.*job.*mapping"
Length of output: 2688
Script:
#!/bin/bash
# Search for the validation logic around the set membership check
rg -B 10 -A 10 "server_files_set = set\(data\['server_files'\]\)"
Length of output: 1546
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8741 +/- ##
===========================================
+ Coverage 74.04% 74.05% +0.01%
===========================================
Files 409 409
Lines 43786 43788 +2
Branches 3984 3984
===========================================
+ Hits 32420 32427 +7
+ Misses 11366 11361 -5
|
Quality Gate passedIssues Measures |
Hi, thank you for the work done. Please consider comments below.
|
Task creation performance on >50k frames (#8602)
Motivation and context
In our environment we are expected to create tasks with more than 50k frames for single job. I understand this is unusual. This revealed some performance issues on task creation code that may also impact, in less extent, other cases.
How has this been tested?
Tested manually in local environment. Ran unit and integration tests.
For single job task, 140k frames, creation time dropped from 37 minutes to 90 seconds (using cloud storage source without copying images locally).
Checklist
develop
branch- [ ] I have updated the documentation accordingly- [ ] I have added tests to cover my changes- [ ] I have increased versions of npm packages if it is necessary(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)
License
Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation