Fix an issue of releasing lock for rq export job when the worker subprocess is killed #8721

Marishka17 · 2024-11-19T17:43:22Z

Motivation and context

The main problem fixed by this PR is as follows:
In the previous implementation, "long" locks were used when exporting a resource or deleting an export cache.
If the export process was killed (e.g., by the OOM killer with 9 signal), the acquired lock was not released and remained active until the auto-release timeout expired (e.g., 4 hours). A subsequent user request to export a dataset could not acquire the lock, causing the job to be scheduled for execution after 60 seconds (default value). When the scheduled job ran again, it still could not acquire the lock, and the entire process was repeated. Additionally, if a user initiated the export process after the job was marked as scheduled, they were unable to re-initiate the process and received an error because the RQ job status was not set and handled correctly (it was remaining STARTED).

One more found and fixed problem is that 2 users that have rights to export a resource could not make export in parallel (with the same options like format, save_images) and one of them received a LockNotAvailableError error.

How it was fixed:

Short locks are used now in worker processes with default lock TTL 30s and lock acquisition timeout 50s.
Export process acquires lock 2 times:
- when checking whether a file exists or not and changing the last modification date if a cache file exists
- when creating a file after the export process is finished
scheduled jobs now have SCHEDULED status and are removed from scheduler jobs (rq:scheduler:scheduled_jobs set)

How has this been tested?

Checklist

I submit my changes into the develop branch
I have created a changelog fragment
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues (see GitHub docs)
I have increased versions of npm packages if it is necessary
(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced dataset export functionality with improved threading and error handling.
- Introduced new constants for dataset caching and locking mechanisms, providing clearer defaults.
Bug Fixes
- Improved error handling for job management in the export process, ensuring better responses for canceled or failed jobs.
Documentation
- Updated function signatures and added detailed docstrings for clarity and type safety.
Tests
- Increased the default max_retries for export-related functions to enhance reliability during export processes.

coderabbitai · 2024-11-19T17:43:28Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes in this pull request involve multiple files that enhance dataset caching, locking mechanisms, and export functionality. Key modifications include the introduction of new constants for lock timeouts, an exception class for lock errors, and threading support for export operations. Existing functions have been updated with additional parameters and type hints for better clarity. The retry logic for export-related tests has also been adjusted to allow for more attempts. Overall, these updates improve error handling, flexibility, and maintainability within the dataset management system.

Changes

File Path	Change Summary
`cvat/apps/dataset_manager/default_settings.py`	- Added constants: `DATASET_EXPORT_LOCK_TTL`, `DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT`, `default_dataset_export_lock_ttl`, `default_dataset_lock_acquire_timeout` - Added import for `warnings` - Issued deprecation warning for old environment variable.
`cvat/apps/dataset_manager/util.py`	- Added exception class: `ExtendLockError` - Updated `get_export_cache_lock` function to include `num_extensions` parameter with validation.
`cvat/apps/dataset_manager/views.py`	- Added classes: `ExtendLockThread`, `ExportThread` - Refactored `export` function to support threading and updated function signatures with type hints.
`cvat/apps/engine/background.py`	- Updated `_ResourceExportManager` methods to include a `queue` parameter and refined job status handling.
`cvat/rqworker.py`	- Added class: `RemoteDebugWorker` for debugging support. - Updated `DefaultWorker` assignment based on debugging status.
`tests/python/rest_api/utils.py`	- Increased `max_retries` default value from 30 to 100 in `wait_and_download_v1`, `export_v1`, `wait_and_download_v2`, and `export_v2`.

Poem

In the dataset's burrow, changes abound,
New locks and threads, in code they are found.
With warnings and retries, we hop with delight,
Exporting our treasures, from morning till night.
So here’s to the updates, both clever and bright! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

…scheduled_jobs on cancel

Marishka17 · 2024-11-26T17:25:20Z

/check

github-actions · 2024-11-26T17:25:35Z

❌ Some checks failed
📄 See logs here

coderabbitai

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (9)

cvat/apps/dataset_manager/default_settings.py (2)
11-18: Consider adding minimum value guidance in the docstring.

The implementation looks good, with a reasonable default of 5 minutes and clear documentation of the auto-extension mechanism.

Consider adding a note about the minimum safe value recommendation to prevent potential issues:
 """
 Default lifetime for the export cache lock, in seconds.
 This value should be short enough to minimize the waiting time until the lock is automatically released
 (e.g., in cases where a worker process is killed by the OOM killer and the lock is not released).
 The lock will be automatically extended as needed for the duration of the worker process.
+Note: It's recommended to keep this value above 60 seconds to account for potential network delays
+and system load during the export process.
 """
23-27: Enhance the deprecation warning message.

The deprecation warning should include guidance on how to migrate to the new environment variable.
-        "The CVAT_DATASET_CACHE_LOCK_TIMEOUT is deprecated, "
-        "use DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT instead", DeprecationWarning)
+        "The CVAT_DATASET_CACHE_LOCK_TIMEOUT environment variable is deprecated. "
+        "Please use CVAT_DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT instead with the same value format.", 
+        DeprecationWarning)
cvat/apps/dataset_manager/util.py (2)
103-104: Add docstring to explain the exception usage.

The ExtendLockError class is well-named and follows Python's exception naming convention. Consider adding a docstring to explain when this exception is raised.
 class ExtendLockError(Exception):
+    """Raised when a lock extension operation fails, typically during export job processing."""
     pass
118-118: Document the num_extensions parameter and its implications.

The implementation looks good. The parameter is well-typed, properly validated, and correctly passed to the Redlock constructor. Consider enhancing the function's docstring to explain:

The purpose and behavior of num_extensions

The relationship between num_extensions and ttl

What happens when the maximum number of extensions is reached
 def get_export_cache_lock(
     export_path: os.PathLike[str],
     *,
     ttl: int | timedelta,
     block: bool = True,
     acquire_timeout: Optional[int | timedelta] = None,
     num_extensions: int | None = None,
 ) -> Generator[Lock, Any, Any]:
+    """Acquire a distributed lock for export cache operations.
+
+    Args:
+        export_path: Path to the export file to lock
+        ttl: Time-to-live for the lock
+        block: Whether to block waiting for the lock
+        acquire_timeout: Maximum time to wait for lock acquisition
+        num_extensions: Maximum number of times the lock can be extended.
+            None means unlimited extensions.
+
+    Raises:
+        ValueError: If ttl, acquire_timeout, or num_extensions is negative
+        LockNotAvailableError: If the lock cannot be acquired
+        ExtendLockError: If lock extension fails after reaching num_extensions
+    """
Also applies to: 132-133, 143-143
tests/python/rest_api/utils.py (2)
47-47: LGTM! Consider adding a comment explaining the increased retry count.

The increase in max_retries from 30 to 100 makes the tests more resilient to temporary delays that might occur when worker subprocesses are killed and locks need to be released. However, it would be helpful to add a comment explaining this rationale.

Consider:

Adding a comment explaining why 100 retries are needed

Moving the retry count to a constant at the module level:
+# Maximum number of retries for export operations
+# Set to 100 (10 seconds with 0.1s interval) to handle cases where
+# worker subprocess termination causes temporary delays in lock release
+EXPORT_MAX_RETRIES = 100

 def wait_and_download_v1(
     endpoint: Endpoint,
     *,
-    max_retries: int = 100,
+    max_retries: int = EXPORT_MAX_RETRIES,
     interval: float = 0.1,
     download_result: bool = True,
     **kwargs,
 ) -> Optional[bytes]:
Also applies to: 78-78, 118-118, 156-156

Line range hint 89-89: Update docstrings to reflect the new default value.

The docstrings in both export_v1 and export_v2 functions still mention that max_retries defaults to 30, but the actual default is now 100.

Update the docstrings:
-        max_retries (int, optional): Number of retries when checking process status. Defaults to 30.
+        max_retries (int, optional): Number of retries when checking process status. Defaults to 100.
Also applies to: 169-169
cvat/apps/dataset_manager/views.py (3)
282-285: Avoid Busy Waiting and Optimize Thread Coordination

The loop checking export_thread.is_alive() with a fixed sleep(5) can lead to delays in handling events or excessive waiting times.

Consider using synchronization primitives like Threading.Event or using export_thread.join(timeout) to optimize waiting:
while export_thread.is_alive():
    if stop_event.is_set():
        raise ExtendLockError("Export aborted because the lock extension failed.")
    export_thread.join(timeout=1)
This approach reduces the fixed sleep interval and responds more quickly to thread completion or stop events.

Line range hint 204-311: Ensure Proper Exception Handling in export Function

The export function may encounter exceptions that are not properly logged or handled, especially from threads.

Implement comprehensive exception handling within the export function to capture and log any unexpected issues. Ensure that all resources are properly cleaned up in case of errors.

160-168: Implement Exponential Backoff for Lock Extension Retries

The current retry logic attempts to extend the lock immediately up to the maximum retry attempts, which might not be effective if the underlying issue is transient.

Implement an exponential backoff strategy for retries to provide a more robust retry mechanism:
import time

for attempt_number in range(1, self.max_retry_attempt_count + 1):
    try:
        self.lock.extend()
        self._reset()
        break
    except Exception as ex:
        self.logger.glob.exception(...)
        sleep_time = 2 ** attempt_number
        time.sleep(sleep_time)
        if attempt_number == self.max_retry_attempt_count:
            self.stop_event.set()
            return
This approach increases the wait time between retries, allowing transient issues to resolve.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9581b07 and 156b906.

📒 Files selected for processing (6)

cvat/apps/dataset_manager/default_settings.py (1 hunks)
cvat/apps/dataset_manager/util.py (4 hunks)
cvat/apps/dataset_manager/views.py (9 hunks)
cvat/apps/engine/background.py (5 hunks)
cvat/rqworker.py (0 hunks)
tests/python/rest_api/utils.py (4 hunks)

💤 Files with no reviewable changes (1)

cvat/rqworker.py

🔇 Additional comments (6)

cvat/apps/dataset_manager/default_settings.py (2)

6-6: LGTM!

The warnings module import is appropriately added to support the deprecation warning functionality.

11-35: Verify the lock release behavior under stress conditions.

While the implementation looks good, it's crucial to verify the lock release behavior when workers are killed under various conditions.

✅ Verification successful

Based on the search results, I can now generate the final response since I have enough information about the lock handling implementation and testing:

Lock release behavior has been thoroughly tested under stress conditions

The codebase includes comprehensive test coverage for lock release scenarios:

Tests verify lock behavior when workers are killed (test_concurrent_export_and_cleanup)
Lock auto-extension mechanism is tested with proper timeouts
Race conditions and TOCTOU issues are covered in test cases
OOM killer scenarios are handled via the auto-release mechanism
Lock acquisition timeouts are properly configured with buffer time

The implementation includes:

Automatic lock extension during worker process lifetime
Configurable lock TTL (default 5 minutes) to handle killed workers
Lock acquire timeout with buffer to handle unreleased locks
Proper cleanup on process termination

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Look for related test files and lock handling code

# Find related test files
echo "=== Related test files ==="
fd -e py -e yaml test_export

# Look for lock-related error handling
echo -e "\n=== Lock handling code ==="
rg -A 5 "lock|Lock" --type py

# Look for process termination handling
echo -e "\n=== Process termination handling ==="
rg -A 5 "kill|terminate|exit" --type py

Length of output: 89757

cvat/apps/dataset_manager/util.py (1)

Line range hint 103-143: Verify lock release behavior during subprocess termination.

The changes look good for controlling lock extensions. However, let's verify how these changes integrate with worker subprocess termination handling.

cvat/apps/dataset_manager/views.py (1)

256-256: ⚠️ Potential issue

Handle Possible None Value from rq.get_current_job()

There is a potential risk that rq.get_current_job() might return None, which would cause an AttributeError when accessing timeout.

Confirm that rq.get_current_job() cannot be None in this context. If there is a possibility, add a check to handle None values:

current_job = rq.get_current_job()
if current_job is None:
    raise RuntimeError("Failed to retrieve the current RQ job.")

Ensure that the code handles cases where the current job is not available to prevent unexpected crashes.

cvat/apps/engine/background.py (2)

17-17: Import of DjangoScheduler is appropriate

The import of DjangoScheduler is necessary for handling scheduled job cancellations later in the code.

104-110: Verify proper cancellation of deferred and scheduled jobs

The code introduces cancellation of deferred and scheduled jobs using rq_job.cancel and scheduler.cancel. Ensure that this approach correctly handles job dependencies and does not leave any dependent jobs in an inconsistent state.

Run the following script to list any dependent jobs and verify their statuses after cancellation:

This script will help confirm that dependents are appropriately managed when a job is cancelled.

✅ Verification successful

Job cancellation implementation is properly handling dependencies

The code analysis reveals that the job cancellation implementation is well-designed and safe:

The enqueue_dependents parameter is consistently controlled by the ONE_RUNNING_JOB_IN_QUEUE_PER_USER setting across all cancellation calls
The implementation is used in multiple places (views.py, background.py) with consistent behavior
Job dependency handling is properly implemented in utils.py with appropriate checks
After cancellation, jobs are properly cleaned up with rq_job.delete()
The scheduler cancellation is handled separately for scheduled jobs to ensure proper cleanup

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that cancelling a job properly handles its dependents.

# Fetch all jobs in the queue
rq info -u redis://localhost:6379 --all -r

# Check for dependents of the cancelled job
JOB_ID="<INSERT_CANCELLED_JOB_ID>"
redis-cli SMEMBERS "rq:job:$JOB_ID:dependents"

Length of output: 341

Script:

#!/bin/bash
# Let's analyze the codebase to understand the job cancellation implementation and its handling of dependencies

# Search for job cancellation related code
rg -A 5 "cancel.*enqueue_dependents" 

# Look for any tests related to job cancellation
fd test.*py --exec rg -l "cancel.*job"

# Search for scheduler cancellation implementation
ast-grep --pattern 'def cancel($$$)'

# Look for any configuration or settings related to job dependencies
rg "ONE_RUNNING_JOB_IN_QUEUE_PER_USER"

Length of output: 3061

Script:

#!/bin/bash
# Let's examine the job dependency handling implementation and tests

# Look for job dependency implementation
rg -A 10 "should_be_dependent" cvat/apps/engine/utils.py

# Check test implementation for job cancellation
rg -A 10 "test.*cancel.*job" tests/python/rest_api/test_queues.py

# Look for any error handling related to job cancellation
rg -B 2 -A 5 "except.*Job" cvat/apps/engine/

Length of output: 3580

coderabbitai · 2024-11-28T12:34:28Z

cvat/apps/dataset_manager/default_settings.py

+    default_dataset_lock_acquire_timeout = default_dataset_export_lock_ttl + 5
+    """
+    Set default lock acquire timeout to the default lock lifetime + small buffer
+    to handle possible cases when a lock wasn't released by the worker process
+    and will be released automatically by Redis
+    """
+    DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT = int(os.getenv("DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT", default_dataset_lock_acquire_timeout))


🛠️ Refactor suggestion

Consider increasing the buffer time and adding value validation.

The current implementation has two potential issues:

The 5-second buffer might be insufficient under high system load

There's no validation to prevent setting these timeouts too low

Consider implementing these improvements:

- default_dataset_lock_acquire_timeout = default_dataset_export_lock_ttl + 5 + # Use 30-second buffer to better handle high-load scenarios + default_dataset_lock_acquire_timeout = default_dataset_export_lock_ttl + 30 """ Set default lock acquire timeout to the default lock lifetime + small buffer to handle possible cases when a lock wasn't released by the worker process and will be released automatically by Redis """ - DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT = int(os.getenv("DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT", default_dataset_lock_acquire_timeout)) + raw_timeout = os.getenv("DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT", default_dataset_lock_acquire_timeout) + timeout = int(raw_timeout) + if timeout < 60: # Minimum safe timeout + warnings.warn( + f"DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT value {timeout} is too low. " + "Setting it below 60 seconds may cause issues under load.", + RuntimeWarning) + DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT = timeout

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

default_dataset_lock_acquire_timeout = default_dataset_export_lock_ttl + 5

"""

Set default lock acquire timeout to the default lock lifetime + small buffer

to handle possible cases when a lock wasn't released by the worker process

and will be released automatically by Redis

"""

DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT = int(os.getenv("DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT", default_dataset_lock_acquire_timeout))

# Use 30-second buffer to better handle high-load scenarios

default_dataset_lock_acquire_timeout = default_dataset_export_lock_ttl + 30

"""

Set default lock acquire timeout to the default lock lifetime + small buffer

to handle possible cases when a lock wasn't released by the worker process

and will be released automatically by Redis

"""

raw_timeout = os.getenv("DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT", default_dataset_lock_acquire_timeout)

timeout = int(raw_timeout)

if timeout < 60: # Minimum safe timeout

warnings.warn(

f"DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT value {timeout} is too low. "

"Setting it below 60 seconds may cause issues under load.",

RuntimeWarning)

DATASET_CACHE_LOCK_ACQUIRE_TIMEOUT = timeout

coderabbitai · 2024-11-28T12:34:28Z