Add branch for historical data points #9

abitrolly · 2021-12-22T13:16:41Z

It would be convenient to store historical info in git as well.

Last refreshed on 2021-05-31 (62240 packages).

Right now to get the graph of how many packages was there over time, one needs to fetch the whole repository and then process git diff for index.html with some script. That not convenient.

The solution is to have a branch named data with stats added after each CI build. Stats should probably be in a separate files to prevent merge conflicts.

@vsoch can you recommend a simple way/scheme for the task? I remember I've seen some of your experiments with managing datasets in git.

The text was updated successfully, but these errors were encountered:

vsoch · 2021-12-22T15:49:25Z

Not knowing the details of the project - the workflow can be fairly simple to checkout some data branch ref, generate the summary file for a point in time (likely will require cloning the whole repository) and then adding, committing, and pushing. As for the data type and organization of that branch it's largely up to you, and what you might want to do with it. if you want data that can render into a site you could make a Jekyll collection and put the data in markdown front end matter to render into the site. If you want to mimic a static API you can just dump into json and have some known scheme for getting a particular page.

Some examples I have (that might inspire) are caliper metrics that takes an approach to organize by UI, and then dump json named by the metric and then an index.json to summarize that. Or more complex is the RSEpedia that generates a site with Flask, and then saves the static pages (API and interfaces) there. Or any of these that use the GitHub API to do some search and then save results in some organized fashion to render from the repo:

https://github.com/vsoch/chonker-awards/blob/main/find-chonkers.py
https://github.com/spack/spack-stack-catalog
https://github.com/singularityhub/singularity-catalog
watchme uses git as a database to show changes of a file over time.

I haven't used it, but you could also look into a tool to handle this for you like Datalad.

Hope that helps! Let me know if you have specific questions.

abitrolly · 2021-12-22T16:04:25Z

Thanks for the quick feedback. I essentially need to run, but just a few fast comments.

watchme is the tool that I remember. Right.
The output produced by sources in this repo is already automatically uploaded to gh-pages branch.

fedora-packages-static/.github/workflows/manual.yml

Lines 46 to 50 in 8787cbf

	- name: Upload to GitHub Pages
	uses: crazy-max/[email protected]
	with:
	# Git branch where assets will be deployed
	#target_branch: # optional, default is gh-pages

It should not be a problem to upload stuff to the data branch the same way. The only concern I have about directory layout structure, so that parallel steps won't conflict on commits. Maybe I am overthinking the problem while trying to eliminate all these points of failures.

vsoch · 2021-12-22T20:12:34Z

oh that makes sense! So watchme is probably different in terms of goals - with watchme the idea is that you can store files in git over time, and then assemble all the results in one place upon an extraction. E.g., git A has a json structure with value 2, git timepoint B has 3, the results will be [2,3]. But it sounds like you want to be able to just get the entire summary statistics for a given timepoint. You are correct you'll hit a point of failure if you have, say, 100 of these running at once all trying to fetch and push - it's just hugely likely that upstream will change in the process and then the push will fail. Even for watchme, trying to run extractions on a cluster was a massive failure because git just isn't intended to work that way.

Are you able to have some control with respect to how many are running / pushing? If you think it's a reasonable number, I think it's reasonable to try, and you just need to come up with the directory organization to ensure that things are modular.

abitrolly · 2021-12-23T17:30:27Z

Yea, I think about 2 steps CI/CD pipeline.

[1]^^^^\
[2]------[4]-----[5]
[3]___/

[1],[2],[3] are writing to data/ folder in parallel data- branches. They use non-conflicting taskname-datetime-random format. The raw data. When they have finished, a job [4] merges these branches into data branch and processes raw data to update static JSON datasets.

This way there is always a static dataset in a single JSON file that can be used without additional filesystem loops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add branch for historical data points #9

Add branch for historical data points #9

abitrolly commented Dec 22, 2021

vsoch commented Dec 22, 2021

abitrolly commented Dec 22, 2021

vsoch commented Dec 22, 2021

abitrolly commented Dec 23, 2021

Add branch for historical data points #9

Add branch for historical data points #9

Comments

abitrolly commented Dec 22, 2021

vsoch commented Dec 22, 2021

abitrolly commented Dec 22, 2021

vsoch commented Dec 22, 2021

abitrolly commented Dec 23, 2021