-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing context to crawler handlers #103
Comments
curious which part of the current js implementation you see as iffy? :]
we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators? |
The part where we disable type checking so that we can push an invalid object through an inheritance chain and hope that it will come out valid 🙂
Yes, but the example was slightly incorrect - it should have been Passing a request_handler is possible (in the current implementation even), but since lambdas (anonymous functions) are limited to a single expression in Python, it kinda sucks - you need to make a named function before you instantiate the crawler.
We could, but I have a bad feeling about it - users being confused and all that. |
Ok good, for router the decorators actually make a lot of sense to me. |
A question regarding version B: As a user, how would I determine which arguments are available for injection into a handler? Will there be a mechanism in the Crawler (e.g. protocol or abstract method), that informs users about the handler interface, or will this information be provided only through documentation? In version A, I suppose, in the module with Crawler, there would be also a Context class to expect in the handler. |
I believe that documentation and possibly hints in exception messages if you attempt to use something we can't provide are sadly our only option here. I'd be overjoyed if we could think of some way to advertise this with python types.
Exactly. Though I'm afraid state-of-the-art python type checkers are not capable of inferring the argument type, you have to manually specify it. But it's one look in the docs instead of several. |
Handler decorator API
Using a handler registration as a decorator reflects a more Python-ic approach to this kind of functionality. It aligns with the practices seen in various frameworks. For example, web frameworks Flask and FastAPI use decorators to register handlers for routes. Similarly, Airflow uses them to register tasks for DAGs. And Pytest uses them for registering fixtures or marks. I would rather give priority to doing things in a Pythonic way, instead of striving for uniform API with the JS version of the library. More handler interfaces
To be honest, I don't see an advantage in offering multiple ways to achieve the same result. Especially if it means maintaining a larger and more complex codebase. I would prefer to settle on one approach - even if it's not my preferred one - rather than complicating the library. So which way to goFrom the user's perspective, version A seems to be a winner to me. It just provides a better developer experience - explicit declaration of the handlers interface, which means better type-checking and code suggestions. Now it depends, on whether the complexity and reduced flexibility outweigh its benefits. We already discussed it with Honza yesterday, and the main issue is the absence of Intersection type in Python. Thanks to that we are compelled to define explicit merged context classes and potential boilerplate around it. It's not great, but IMO it remains manageable. So I slightly prefer version A (However, not 100% sure). |
Understood, no problems with that, as long as we keep the same/similar DX. Which I don't think it would be the case with variant B.
Sounds good to me, a bit worse for maintenance, but much better observability for new users. |
I prefer version A. |
I believe it's clear that version A is the winner 🙂 |
The problem
We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.
What OG Crawlee does
Python version A (+/- current implementation)
context.
Python version B
This proposal is similar to how pytest fixtures or FastAPI dependencies work.
context.
prefixPlease voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.
The text was updated successfully, but these errors were encountered: