-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add branch for historical data points #9
Comments
Not knowing the details of the project - the workflow can be fairly simple to checkout some data branch ref, generate the summary file for a point in time (likely will require cloning the whole repository) and then adding, committing, and pushing. As for the data type and organization of that branch it's largely up to you, and what you might want to do with it. if you want data that can render into a site you could make a Jekyll collection and put the data in markdown front end matter to render into the site. If you want to mimic a static API you can just dump into json and have some known scheme for getting a particular page. Some examples I have (that might inspire) are caliper metrics that takes an approach to organize by UI, and then dump json named by the metric and then an index.json to summarize that. Or more complex is the RSEpedia that generates a site with Flask, and then saves the static pages (API and interfaces) there. Or any of these that use the GitHub API to do some search and then save results in some organized fashion to render from the repo:
I haven't used it, but you could also look into a tool to handle this for you like Datalad. Hope that helps! Let me know if you have specific questions. |
Thanks for the quick feedback. I essentially need to run, but just a few fast comments.
fedora-packages-static/.github/workflows/manual.yml Lines 46 to 50 in 8787cbf
|
oh that makes sense! So watchme is probably different in terms of goals - with watchme the idea is that you can store files in git over time, and then assemble all the results in one place upon an extraction. E.g., git A has a json structure with value 2, git timepoint B has 3, the results will be [2,3]. But it sounds like you want to be able to just get the entire summary statistics for a given timepoint. You are correct you'll hit a point of failure if you have, say, 100 of these running at once all trying to fetch and push - it's just hugely likely that upstream will change in the process and then the push will fail. Even for watchme, trying to run extractions on a cluster was a massive failure because git just isn't intended to work that way. Are you able to have some control with respect to how many are running / pushing? If you think it's a reasonable number, I think it's reasonable to try, and you just need to come up with the directory organization to ensure that things are modular. |
Yea, I think about 2 steps CI/CD pipeline.
[1],[2],[3] are writing to This way there is always a static dataset in a single JSON file that can be used without additional filesystem loops. |
It would be convenient to store historical info in git as well.
Right now to get the graph of how many packages was there over time, one needs to fetch the whole repository and then process
git diff
forindex.html
with some script. That not convenient.The solution is to have a branch named
data
with stats added after each CI build. Stats should probably be in a separate files to prevent merge conflicts.@vsoch can you recommend a simple way/scheme for the task? I remember I've seen some of your experiments with managing datasets in
git
.The text was updated successfully, but these errors were encountered: