Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working directory needs more efficient re-use or cleanup #1506

Open
bergmeister opened this issue Apr 13, 2018 · 79 comments
Open

Working directory needs more efficient re-use or cleanup #1506

bergmeister opened this issue Apr 13, 2018 · 79 comments

Comments

@bergmeister
Copy link

Agent Version and Platform

  • 2.131.0
  • Windows

VSTS Type and Version

VSTS but agent is on-premise

What's not working?

We created new agent machines in Azure. For fast builds, the working directory is on the D drive, which is only around 30GB big.
We have a big monolithic repo with multiple build definitions and it seems that the agent keeps multiple checkouts on disk when running different build definitions and branches although they all point to the same repository. Git was designed to have fast branch switching strategies and therefore I do not see this as necessary. This leads to the agent running out of disk space after a few builds in a few hours. It is not a reasonable solution to add cleanup steps to our builds because of that as proposed in #708
Therefore either the agent needs to have a setting to clean up the working directory afterwards or be more efficient in re-using the same repository for different branches and build definitions.

@TingluoHuang
Copy link
Contributor

@bergmeister just let you know we do have a feature in our backlog to share .git folder across definitions.

@bergmeister
Copy link
Author

bergmeister commented Apr 13, 2018

@TingluoHuang Thanks for clarification. What is the timeline for it? It would be great to have public tracking for that. And why does the agent create a separate folder foreach build definition, even if it is from the same repo? From my point of view, all one needs is 1 folder per git remote for all build definitions.

@ericsciple
Copy link
Contributor

@bergmeister Would a shared .git folder help in your case? The way we are thinking about the feature, each definition would still get it's own build directory for checkout and other folders.

Also have you considered the shallow fetch option?

How many build definitions do you have?

@bergmeister
Copy link
Author

@ericsciple It would help but not solve the problem (I am already using shallow fetch). The repo is a couple of gigs and fully compiled its size triples but we have around 10 build definitions to build sub-components. Appropriate Azure VM sizes have only 16-32 GB of disk space on the temp disk (which I should be able to use because it is much more performant and gives me free cleanup when the machine shuts down overnights)
The way how I think about it is that you only need one folder per repository. Git was designed to be able to switch branches very fast. One agent can only run one build at a time, so no concurrency issues and I think every build runs something like git clean -dfx before the start of a build. Why would you want/need a separate folder for each build definition?

@ericsciple
Copy link
Contributor

Note, the sources directory is a subdirectory within the build directory. git clean only cleans the sources directory.

@bryanmacfarlane
Copy link
Contributor

Yeah, as long as we keep it down to per agent, then we assure an agent runs one job at a time so no concurrency. Very important. The shared .git would basically allow you have many definitions share the .git on disk. Technically we don't even have to repoint .git via config - we just need to ensure it's keyed by the git repo location instead of definition + repo.

@ericsciple
Copy link
Contributor

@bryanmacfarlane it sounds like we may want to consider sharing the checkout folder too, or the entire build directory. Sharing the entire build directory introduces more challenges for multi-checkout, so I would rather not go that far. Note the cloud build scenario is also the entire build directory, although I never got an answer whether they need to share the entire build directory, or whether .git folder is sufficient.

@ericsciple
Copy link
Contributor

@bergmeister how many megabytes are all of the checked-out files? (exclude the .git folder from the calculation)

@bergmeister
Copy link
Author

bergmeister commented Apr 14, 2018

@ericsciple I would need to check when I am back in the office but a bit around 1-2GB. But this issue is not about how big the repo is or how many build definitions I use, it is about the agent having multiple clones of the same repo for which I do not see a reason why that should be. Also there is not a built in setting to do a cleanup of the likes of git clean -dfx after each build, which is not a problem for me but could be a problem for someone with lots of small repos instead of one monolithic one.

@ericsciple
Copy link
Contributor

@bergmeister I'm also curious how big the .git folder is. I do understand, ideally you want to reuse the same build directory across multiple definitions. Simply sharing the .git folder (the cloned repo) gets a lot of customers a long way. Sharing the entire build directory would need to be opt-in or use a tag to control which definitions shared the same folder or something. It would break compat if every existing build definition started sharing the entire build directory - other directories exist inside the build directory not just the sources folder.

The only way our infrastructure supports this today, is with the Don't sync sources option. You would also need to check the enable-scripts-access-to-oauth-token checkbox. The get sources step prints the command lines it runs (fetch/checkout). You would need to run similar command lines and the environment variable SYSTEM_ACCESSTOKEN contains the credential (masked as *** in the logs). The step could be wrapped up into a script in your repo, or a task group,, and reused across multiple definitions.

@bergmeister
Copy link
Author

bergmeister commented Apr 15, 2018

@ericsciple I do not see how sharing the same folder is breaking back-compat:

  • the build agent defines source, output paths, etc. as variables and people should use those. The source path variable would simply point to the new, central place
  • if someone used a relative path beween source and output path (which I think should not be supported anyway since the build variables should be used for that), then you could still just provide folder mappings (this works on Windows and Unix systems) to give the user/system the illusions that the folder is still there were it used to be

And even if there is a special/hairy case, then it should at least be a configurable option of the vsts build agent to have a centrally shared checkout, which would help many people.

@ericsciple
Copy link
Contributor

I think we are saying similar things using different terminology. By build directory, I am referring to the directory specified by the AGENT_BUILDDIRECTORY variable:

AGENT_BUILDDIRECTORY=D:\a\1
BUILD_BINARIESDIRECTORY=D:\a\1\b
BUILD_SOURCESDIRECTORY=D:\a\1\s
COMMON_TESTRESULTSDIRECTORY=D:\a\1\TestResults
SYSTEM_ARTIFACTSDIRECTORY=D:\a\1\a

@alexdrl
Copy link

alexdrl commented Apr 16, 2018

We're having the same problem with our repository. The code is in a Git repository, which has a size of 100 MB (including .git folder, which is 45 MB), the problem we encounter is that we have a lot (almost 50) builds that point to the same repository, which makes each agent have 5 GB only of source code, this multiplied by 8 agents in one machine, is 40 GB duplicated. We would be happy if the agent shared each sources repository folder, as long as the following assumption is met:

Yeah, as long as we keep it down to per agent, then we assure an agent runs one job at a time so no concurrency.

If the sources directory is shared between builds that map to the same Git repository, each agent will only have 100MB of source, having the artifacts in each build directory.

Also, to save some space, we needed to make a git clean -fdx command after each build, because as the sources folder is not shared, we had a lot of unused DLL files in agents that are not currently running builds, which will get removed when the agent runs the same build another time, something that could not happen in a long time.

As I am writing this post, I have tried to change the $(Build.SourcesDirectory) variable to $(Agent.WorkFolder)\s, with no luck, as it seems that is specified in SourceFolder.json. Is there any workaround that does not include the Don't sync sources option?

@oskarm93
Copy link

We have the same issue in my company. We use private Azure VMs as build servers and many of them have to be sized up as bigger Dv2 series rather than BMS, because they require 100GB+ temporary drives for work directories. Clean-up steps are well and good, but seem like workarounds for agent inefficiency.
When I look at agent's work folder, it usually contains ~15 folders with the same content, eating up space like its candy. Don't see a reason why not have only one folder per Git repo per agent.

@jsheetzati
Copy link

jsheetzati commented Apr 16, 2018

Also running into similar issues with a 1GB legacy git repo + multiple build definitions against the same repo. Shallow fetch helps but does not fully solve the problem.

@alexdrl
Copy link

alexdrl commented May 8, 2018

Any timeline on this? Our agents are getting bigger and bigger with new build definitions, and growth of the repository.

@glaenzesch
Copy link

Hi 👋

we had the same issue and solved this with the "Maintenance" tab of the agent pools.
Important: This settings is only available at "collection level"

image

With this setting the TFS sends a maintenance job to the agent and he will clear up the working folder.

We use Team Foundation Server 2018.1 on-premise.
Hope this helps 😉

@alexdrl
Copy link

alexdrl commented May 8, 2018

@glaenzesch This is a half solution, as the working directory will get filled with new folders, corresponding to each build definition, as builds get queued...

@ppejovic
Copy link

ppejovic commented Jun 8, 2018

To get around this in our on-prem instance of TFS I've written a custom build task, that is typically added as the last step of the build, which will (if not previously done) move the repo into a shared location (if not already there) and then update the build sources directory path for the definition (in SourceFolder.json) with the new location. The next time the build runs it will be using the repo in the shared location.

The shared folder the repo is added to is a hash of the following (in ps):

"$agentId\$collectionId\$teamProject\$repository"

This means there is a repo shared per agent, so there are no concurrency issues. I'm sure fiddling SourceFolder.json isn't a supported scenario, but we have a huge repo to contend with and it's worked for us nicely so far.

@bergmeister
Copy link
Author

@TingluoHuang Any progress/timeline on this?

@alexdrl
Copy link

alexdrl commented Aug 3, 2018

@ppejovic As Microsoft does not seem to have this marked as urgent or with priority, could you share the code of the custom build task?

Thank you in advance.

@littleninja
Copy link

@alexdrl I tried writing my own script but ran into the problem of another process (the build/release?) using files in the build directory. We only just started using a free extension, Post Build Cleanup.

Helpful resources:

@MichelZ
Copy link

MichelZ commented Dec 19, 2018

Would you mind updating us if this is still on the radar for the agent, and if you might be able to share a rough timeline? (2019H1? 2019? Next 3 years?)
@TingluoHuang @ericsciple @bryanmacfarlane

@alexdrl
Copy link

alexdrl commented Dec 19, 2018

This would be ideal. We "solved" this problem activating deduplication in Windows Server 2016, which in turns is a mess, because when the maintenance job of the agents is executed, the free space drops down to 0 and sometimes throws errors.
We tried to modify the builds as @littleninja and @ppejovic said, but had no luck.

This is curious, if Microsoft is using this agent code, why this optimization is not implemented?

@alexdrl
Copy link

alexdrl commented Feb 14, 2019

Is microsoft/azure-pipelines-yaml#113 going to help with the repository caching? Each build in each agent is not caching the repository cloning, which in turn is worse in terms of performance, because with each git clone, the build slows down a lot. We think that package restore caching is great, but improving repository download times (which affects also the people that does not use packages) is necessary.

@bergmeister
Copy link
Author

The caching feature will not help my case (agents on Azure VMs) because as long as each build definition clones its own repository I still have to pay for the disk space of repositories that are lying around.

@oskarm93
Copy link

Today I ran out of disk space on the build servers AGAIN. Using self-hosted build servers on Azure IaaS VMs. B4ms series with 32 GB temp drive. We have 8 build agents' work folders on the temp drive.
Here's a visual representation of what my problem is:

image

Why do many agents have separate folders called 1,2,3 under them? I would understand if they stored different git repositories, so you don't have to clone it from scratch every time. But in my case they do not. All of these directories contain the same git repository. Even if I build different branches, I would want the same git folder to be used, and just different branches checked out there. This would be the first step to massively reduce duplication.

Interestingly enough, releases are already well behaved when it comes to cleaning up after themselves. A release definition, no matter where the artifacts are coming from, will always go into the same r1, r2 folders, and will clean up the contents before re-downloading required artifacts. This is why we never have to clean up our test machines, because they just cycle through the same amount of storage space.

Only then do we need to talk about caching - of _tool and _tasks folders. I understand that agents may be running under different user accounts, but if they don't - can the tools not be stored under the user profile folder instead? It would save my steps re-downloading .NET Core, NuGet, VSTest, Helm etc every time I clean the temp drive.

@bergmeister
Copy link
Author

@xenalite I agree with you but the D series are much better in terms of how much temporary disk space you get for money (B4ms has only 32 GB for $128.71)

  • D11_v2 has 100 GB for $127.22 (but only 2 instead of 4 cores compared to B4ms)
  • D4_v3 has 200GB and 4 cores (and is only slightly more expensive at $160.70)

@PaulVrugt
Copy link

We would love to have this feature too. Please don't forget support for scaleset agents. It might be tricky to set agent settings in this scenario since the agent is automatically installed by azure devops when provisioning instances

@github-actions
Copy link

This issue has had no activity in 180 days. Please comment if it is not actually stale

@github-actions github-actions bot added the stale label Nov 12, 2022
@PaulVrugt
Copy link

Not stale, just lack of response from Microsoft

@github-actions github-actions bot removed the stale label Nov 12, 2022
@echalone
Copy link
Contributor

Not stale, just lack of response from Microsoft

we need a bot for this ^^

@github-actions
Copy link

This issue has had no activity in 180 days. Please comment if it is not actually stale

@github-actions github-actions bot added the stale label May 15, 2023
@ChristianStadelmann
Copy link

Not stale, just lack of response from Microsoft

Same.

we need a bot for this ^^

Definitely.

@EugenMayer
Copy link

I implemented https://github.com/EugenMayer/azure-agent-self-hosted-toolkit which fixes all of that issues. Cleanup and pollutions (for the next job). Check the project README

@balchen
Copy link

balchen commented May 15, 2023

I read the README. Couldn't see how these tools solve the issue at hand, which is to build several pipelines from the same git repo and avoiding multiple checkouts of the repo on disk.

@EugenMayer
Copy link

It fixes the cleanup part since --once will ensure that after the job, the client is disconnected (ensures no job is started), cleans up the workdir and reconnects. This takes about 5seconds to make the agent available again.

What this does not fix ('all was therefor wronge') is making the checkout more efficient. This is out of scope and to be honest, re-using the workdir on an agent an with an smell 'that there is probably a job that has been run before' does only work in env. where an agent does only one specific job for one pipeline. IMHO a super-specific (i would say uncommon;v case

@balchen
Copy link

balchen commented May 15, 2023

OK, so not "all of the issues", but the issue of cleaning up the workdir after each build. Which is a solution, but definitely a second choice when the repo is 30 GB (as stated in the original issue) and you need to check out 30 GB for every single build, even on the same pipeline.

In regards to your second paragraph, it seems a number of people want this, so it can't be that super-specific and uncommon.

@EugenMayer
Copy link

In regards to your second paragraph, it seems a number of people want this, so it can't be that super-specific and uncommon.

To be honest, there is IMHO not a single CI/CD solution that can do what you ask here. Travis, CircleCI, Gitlab, Bitbucket Cloud, Bamboo, GoCD, Concourse, Buildkite to name a few. - i'am not even sure this can be custom-crafted in jenkins without actually doing the exact same thing you would du in az pipelines anyway.

So if that is a common issue, this is a huge missing of all those toolkits. In fact, offering an agent runner that actually is generic at one point (ephemeral) but at the same time suddenly 'knows that there is a folder locally that can be used' is just nothing that can be introduced in a sane manner.

There are caches for that purpose - but of course they will fetch 30GB anyhow.

So to be honest, no offense, waiting for MS to implement this feature is something that i assume is a bad bet.

What you most probably want to do is using a step that downloads your repos from a some static server that runs on the agent host locally. Of course, this does not make sense if you need a history. But if you really have 30GB assets in a github repo and in need for the history ... well the issue goes deeper i guess.

Do not get me wrong, if that feature happens, happy for you all. But if you place bets on this, i would assume the odds are very bad - esp. considering that the current CI/CD space does not care for something like this AFAICs (yet).

@balchen
Copy link

balchen commented May 15, 2023

I see. Thank you for telling us that. Since you obviously have no interest in this feature, how about just staying away from it?

@github-actions github-actions bot removed the stale label May 15, 2023
@PaulVrugt
Copy link

Well at least the above discussion made sure the stale label was removed

@echalone
Copy link
Contributor

echalone commented Jun 2, 2023

Hi, I think I've already programmed the solution for this, including UnitTests, as we need it too, and I've also made two pull requests for two different functions.

This pull request here would (on self hosted agents) allow repositories not just to be put in the build directory, but also in the work directory, therefore allowing self hosted agents to reuse repositories between build pipelines: #3475 This option would of course be needed to be specifically enabled on self hosted agents in the .agent settings file, so not to pose a security risk on public agents. There is also a UnitTest testing if the agent continues to throw an error if somebody tries to do this on an agent for which it wasn't specifically enabled. Repositories above the work directory level or with an absolute path continue to be not allowed, even if this new "AllowWorkDirectoryRepositories" option is set (those UnitTests are also included).

And this second pull request would allow to set the default working directory (not to be confused with the agent work directory of the previous pull request) to the checkout path of the desired repository in a multi-checkout scenario during the checkout step: #3479 This would allow to call scripts and use files in the desired repository of a multi-checkout scenario without the need to use relative paths or build variables to point to the correct repository. Just define during the Yaml checkout steps which repository should basically be the working directory for all build steps and you're done. UnitTests are of course again included.

I also got a third pull request which would bugfix the primary/self repository detection (in some specific scenarious there's some undesired behaviour) which would be good to be fixed for the work directory function, as well as bugfixing some UnitTest localization problems (5 UnitTests aren't working correctly on some non-english systems): #3473

Also, I can only reiterate that I think I've actually already programmed the solution for this and the pull request is active. But sadly it's taking Microsoft a really long time to review Pull Requests for this software :/ those Pull Requests are now 2 years old... I'm still keeping them up to date and hope one day the feature(s) and fixes in my Pull Requests will be included in the agent.

@6heads
Copy link

6heads commented Jan 19, 2024

Hi everyone! As a quick update - we are planning to review PRs above soon, thanks for contribution! @mjthurlkill agent re-uses already existing repo for the same build definition, devops collection and repo - could you please share logs for the pipeline there checkout does not re-use existing large repository (in debug mode)? Please also make sure that you masked sensitive data if there is some.

Your reply is now almost two years old. Was the PR that extensive?

@echalone
Copy link
Contributor

echalone commented Jan 19, 2024

Your reply is now almost two years old. Was the PR that extensive?

My friend he's not even working for Microsoft any more 😆 I got a few PRs waiting for them to review for about 2-3 years now, so far they managed to include one ^^ thankfully it was the most important one.

@jrnewton
Copy link

@kirill-ivlev any update on this issue and the related PRs?

@ADD-ACS
Copy link

ADD-ACS commented Apr 10, 2024

You can use #4423.

@jrnewton
Copy link

@ADD-ACS - not sure how that issue is related. My take - this issue is about cleanup of the work directory while #4423 is about changing the location of the work directory.

@balchen
Copy link

balchen commented Apr 12, 2024

@ADD-ACS - not sure how that issue is related. My take - this issue is about cleanup of the work directory while #4423 is about changing the location of the work directory.

The motivation for this issue is avoiding multiple checkouts of the same, very large repository -- typically one per pipeline. Either re-use of the repo or a different clean-up mechanism were suggested as ways to solve it.

The standard checkout is to _work//. #4423 allows us to to change this to a static location (e.g. just _work/), effectively forcing re-use of the repo between pipelines. That will provide a solution to the original issue in many circumstances.

@mjthurlkill
Copy link

You can tell how closely I have been following this issue...:-(
I'll try to find time to get some logs. However, I'm using a different solution now.
I have a template currently that uses partial clone/fetch (git fetch --filter=blob:none) and sparse-checkout. Should turn this into an extension, but would be nicer if it were part of the checkout command.
Partial fetch gives about the same benefit as shallow but isn't as problematic.
Partial fetch + sparse-checkout provides tremendous benefits.
It still has a source directory per pipeline, but it becomes less of a problem.
The main drawback is you need to specify the directories to be included in the sparse-checkout. If you are setting a CI trigger on the appropriate directories of the pipeline dependencies, you basically need to specify those directories there as well.
(would be nice to have a system variable that contained the directories specified for the trigger, or be able to set those directories from a variable (though to trigger they probably need to be static))
(There is the SelectiveCheckout extension in the marketplace, but it isn't quite there yet. It uses shallow instead of partial clone, which causes problems. In mine, I do different things if the sources dir is new/empty, shallow, partial, or not partial, sparse or not sparse, because different branches may have used different methods so needs to account for that.)

The normal clone size for one of my repos is 30gb.
For some of the pipelines, the partial/sparse size is 5-10gb. For some it is < 100mb.
This really speeds of the checkout, e.g. 10 secs for the small ones or 3 minutes for the large ones vs 30 minutes for a full clone.
Besides the speed, it really reduces the footprint on the agent server, eg sources directory consume < 1 to 10gb vs 30gb for each normally.
Also, I'm not sure we would want to, but now, given the size, it would be possible to use the ADO hosted agents instead of self hosted agents for at least some of our pipelines. And the checkout time would be reasonable.
I haven't done anything for multiple repos in this, but will have to look at that.

I'll review the thread about repo directory reuse described above. That still may be the ideal solution for my needs. The first clone into that directory will be slow, but after that it should be fast.
However, what I have right now is pretty good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests