-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-s3: Inconsistent package list in cluster deployment #317
Comments
Continue the dicussion. The package list ( The package list is (lazy-)loaded and cached in memory via I'm digging a related but old cluster thread. Where @wzrdtales original asked for how to solve the Solution A: remove the cache of package list. It seems a no-brain choice. However it has racing problem. Say two users publish two new packages at the same time, two verdaccio instances both read the package list from S3, add the package name, write back to S3. The second write overwrite the first write, the first new added package is missing. So solution A is not working.. Solution B: get rid of package list file. Use the directory name to present an exist package.
Solution C: move the package list file to redis or other database. Take redis as an example, it faster enough so you don't need to worry about performance. And we can save the package list as a hash, i.e.
Thought? |
Updated solution B in #275, added scope package support. I implemented solution B first, since it's the most clean one without changing the structure of verdaccio storage plugin. Solution C however, depends on how we mix redis into the system.
|
For curious, I have a look at the google-cloud storage backend, created by @griffithtp. Surprisingly the implementation shall work well on cluster (but I'm not using google cloud :-(). This can be a good reference to help understand the issue. In the google-cloud storage,
Based on the information, I'm more confident to implement soltuion C as option C-1, a new storage backend called The C-1 solution will be similar to the google cloud,
|
Hi @favoyang , thanks for all this thread, sorry the delay, but I'll read carefully all your proposals, I know you are right and later on will post here my thoughts about it. |
I see, well, the #275 approach is interesting. When I did made the google-cloud plugin I could use the datastore to make it scalable. AWS plugin is a fork of verdaccio-s3-storage which at the same time a fork of the default First of all, I'll move this ticket to monorepo, due Verdaccio does not define itself as scalable by default, that depends of which storage are you using. Second, let's give a try to the option B, I'll run some test and see what happens, I'm worried about the performance that Third, I do no have any example for practice clustering with Verdaccio, my dream is to have a nice local setup with Kubernetes and easily scale and run more accurate test. I'd like to keep the |
The AWS version of google-cloud/datastore is DynamoDB. But I don't like the idea, because the plugin is called aws-s3, not aws. For example, I'm maintaining a deployment on digital ocean's space, an aws-s3 compatible solution. Digital ocean platform doesn't have DynamoDB. I will update solution B #275 later. It's the simplest one to solve the scalable issue at some level without other dependency.
I use #275 in a deployment for a while (so far low traffic). This deployment doesn't support scope packages, so the list package endpoint performs O(1). It's the most ideal scenario. This also brings an interesting point of splitting the data store for different types of data.
I won't surprise in a real world cluster, the store may be named as s3-mysql / s3-postgresql / s3-mysql-redis. |
+1, got bit by this too (using ECS+S3, will probably scale to 1 instance in the meantime).
|
I'm facing the same issues described by @favoyang, when trying to deploy multiple Verdaccio instances with the aws-s3-storage plugin. I dived into this thread and has a look at your pull request #275 , but I'm worried about some changes added there. Currently, only published packages on the proxy are saved in the database json file. With the changes in your pull request, every packages on the storage are considered as "internal" packages; and therefore, concerned by "add"/"remove"/"search" api. Maybe the solution A you exposed (the simplest one I think, without adding a database mechanism) should be reconsidered, with the addition of a simple locking mechanism? |
You're right. Because there's no way to filter out uplinks from the directory name.
Prefix search itself seems fast enough, though you may still have multiple network round-trips because data can be returned chunk by chunk. But I don't really worry about it, because it's rare to use verdaccio to serve many thousands of packages. The verdaccio UI will be the first bottleneck.
Refs barolab/verdaccio-minio#13, a robust solution would be:
To be honest, I'm not interested to implement the correct but useless solution A. It's working for the cluster, but not designed for the cluster. |
Thanks for taking care of it. If you follow the related issue #317, you will notice that PR #275 implemented solution B, using the s3 list action to construct the package list. But in #317 (comment), @lneveu found an issue of the solution that it can not distinguish user-submitted (internal) packages and the proxied (cached) packages. Because verdaccio by default saved them all into the same directory. I'm not using a proxy setting, so I didn't realize the issue in my production deployment. @lneveu also I haven't got enough time to implement a file locking mechanism. It needs to be carefully designed so whenever you want to write the package list (verdaccio db file), e.g. when adding or removing a package to the registry,
Basically, it's a transaction lock. It's useless to only lock on the write method, you will still get the racing issue (see discussion of solution A #317 (comment)). But I'm not sure if it's possible to implement a transaction lock, what I got is a
Anyway, I prefer to close #275, unless anyone can figure out a way to distinguish internal packages and the proxied packages to make it useful. Or if anyone familiar with the project can tell if it's possible to implement a transaction lock in a storage plugin. I also want to point out that implementing a file lock didn't make the system scalable, as it keeps the single point write bottleneck. But at least it fixes the concurrent issue when more than one people submitting packages at the same time, and let one of them fail. But it could good enough for most people actually looking for scale verdaccio for a heavy read. In the meanwhile, I created two other projects. By combining them together you can use s3 for tarball serving and use Redis as the database (for package list, package meta, and security info). If anyone looking for a scalable workaround, please checkout, |
@favoyang thanks for all the context, imo an optimistic lock would be a lot better for this use case (try to write, if version has changed fail and retry), but it doesn't seem like that is supported out-of-the-box, neither I think it'd be simple to implement. what do you think about using S3 native's lock capability? (still reading if that's a bucket-wide lock, or it can be used for a single file) |
I closed #275. Anyone who has more energy can propose an implementation of solution A using a locking mechanism. |
Describe the bug
The bug demostrates that verdaccio with s3 backend is stateful, with package-list cached in memory, and caused inconsistent and racing issues in a clustrer env. The solution is attached, and discussion are welcome.
I've setup a minimal cluster verdaccio deployment, using two verdaccio instances, s3 store and nginx as reverse proxy. The s3 plugin is slightly modified, but nothing really hit the core logic.
Before the test, I already have one package (
com.littlebigfun.addressable-importer
) in verdaccio.Let's publish another package (
com.bastianblokland.enumgenerator
) for testing.Logs show that the return code is 201, the publish is successful. The
NotFoundError
is harmless for new package. Notice the publish job is executed by verdaccio instance 0 (the log prefix tells).The added package is verified in S3.
Now the buggy part, let's curl the package list, twice. Notice that only the second call return the new added package.
Logs show that the second correct curl result is from verdaccio instance 0, the one just executed the publish command. The incorrect curl result is from verdaccio instance 1. We can run it for multiple times, the result is the same. The verdaccio instance 1 never return the new added package.
This behavior seems implying that verdaccio has some sort of local cache in memory of package list (verdaccio-s3-db.json). So until I restart verdaccio instance 1, there's no way to notify verdaccio instance to refresh the cache. I haven't check the source code yet, so it is just my guessing. But if this is true, it means verdaccio is not scalable, can only run with one instance. Well this isn't my expectation when discussing with @juanpicado on verdaccio/verdaccio#1459 (comment), where I ask for the the behaviour of how to handle a shared package list in cluster env.
I need some time to read getLocalDatabase method of https://github.com/verdaccio/verdaccio/blob/dbf20175dc68dd81e52363cc7e8013e24947d0fd/src/lib/storage.ts, to figure it out. But please guide me if you think there's something obvious I missed.
To Reproduce
You will need a simliar deployment - two instances of verdaccio managed by pm2, s3 backend. Nginx isn't necessary.
Expected behavior
All verdaccio instances should return the latest package list right after new package added (or removed).
Configuration File (cat ~/.config/verdaccio/config.yaml)
Environment information
verdaccio: 4.3.4 (modified: verdaccio/verdaccio#1580)
s3-plugin: 8.4.2 (modified: #249)
My modifications are made for #250, which is not related to this bug.
Debugging output
$ NODE_DEBUG=request verdaccio
display request calls (verdaccio <--> uplinks)$ DEBUG=express:* verdaccio
enable extreme verdaccio debug mode (verdaccio api)$ npm -ddd
prints:$ npm config get registry
prints:Additional context
The text was updated successfully, but these errors were encountered: