Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

yuslepukhin
Copy link
Member

@yuslepukhin yuslepukhin commented Dec 10, 2024

Description

Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time

Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated.

This change enables pre-packed data to be stored on disk alongside with the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the X-session shared container
or a new container that shares pre-packed blobs within the session.

With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root SessionState owns and shares the data when X-session sharing is not enabled.

To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing ToGraphProtoWithExternalInitializers function.

To externalize the pre-packed weights, we introduce a new session option kOrtSessionOptionsSavePrePackedConstantInitializers. Note, that pre-packing should be enabled (default) for this to work.

ToGraphProtoWithExternalInitializersfunction is modified to recurse into subgraphs to make sure we properly account for local initializer names.

In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels.

Motivation and Context

Reduce memory usage by pre-packed initializers and externalize them.

 and pre-packed blobs sharing when weights sharing is not enabled.
 Memory map pre-packed blobs.
 Recurse into subgraphs in ToGraphProtoWithExternalInitializers
 to make sure all big weights are serialized along with their
 pre-packs that is to be shared between the subgraphs.
@yuslepukhin yuslepukhin force-pushed the yuslepukhin/prepack_serialize branch from 7fc9a93 to fab27a7 Compare December 10, 2024 23:00
@yuslepukhin yuslepukhin marked this pull request as ready for review December 11, 2024 18:38
// This includes options to align external initializer offset.
// For models running on CPU, ORT will try to use mmap to load external
// initializers. To use mmap, external initializer need to be offset aligned.
// ORT saves external initializers into signle data file, each initializer is
Copy link
Contributor

@tianleiwu tianleiwu Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: single #Pending

bool save_mode_on_ = false;
Subgraph* parent_ = nullptr;
KeyToBlobMap& key_to_blobs_;
WeightToPrePacksMap sorted_by_weight_for_writing_;
Copy link
Contributor

@tianleiwu tianleiwu Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name sorted_by_weight_for_writing_ is confusing. I have not seen the sorting logic. #Pending

const auto blob_length = std::get<1>(blob);
SafeInt<FileOffsetType> end_of_blob{blob_offset};
end_of_blob += blob_length;
ORT_RETURN_IF(blob_offset < 0 || static_cast<std::uintmax_t>(end_of_blob) > file_length,
Copy link
Contributor

@tianleiwu tianleiwu Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use static_cast<FileOffsetType>(end_of_blob) so that it has same data type as file_length. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to file_length type.

for (size_t f = 1; f < split_fields.size(); ++f) {
const auto& blob = split_fields[f];
auto blob_fields = utils::SplitString(blob, ";", false);
if (blob_fields.size() == 3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here blob_fields.size() != 3 is ignored. Shall we return failure in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been pondering on this, and decided to be forgiving. After all, if a pre-packed blob is not loaded, then it is going to re-generated.


class Subgraph {
public:
Subgraph(Subgraph* par, KeyToBlobMap& key_blobs, bool overwrite_for_save)
Copy link
Contributor

@tianleiwu tianleiwu Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use full name for parent instead of par #Pending

@yuslepukhin yuslepukhin requested a review from edgchen1 December 12, 2024 22:47
/// </summary>
using WeightToPrePacksMap = NodeHashMap<std::string, InlinedHashSet<std::string>>;

Status ToGraphProtoWithExternalInitiallizersImpl(
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status ToGraphProtoWithExternalInitiallizersImpl(
Status ToGraphProtoWithExternalInitializersImpl(

Some more doco of the parameters would help here. not clear what modified_external_file_path is. #Pending

Comment on lines +16 to +18
// initializer) of the data file. To use mmap, each offset need to be aligned
// which means offset need to divisible by allocation granularity(64KB for
// windows and 4K for other OSes). With align_offset to true, ORT will align
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to align individual entries in the file? Naively I would have expected you'd mmap the whole file, and the internal offset would only need to be aligned for the data type. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is historical. This PR does not change that, but we can revisit cons and pros.

Comment on lines 35 to 37
// The allocation Granularity for mmap() support.
// Typically 64KB for Windows & 4KB for other OSes. Default to 64KB.
int64_t allocation_granularity = 65536;
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the default based on the platform we're building on?

I thought the prepack logic could potentially be platform or architecture specific (e.g. different layout is preferred on ARM64 vs AMD64), so I'm wondering whether you would only be creating this serialized pre-packed data on the target platform, and hence can set the default based on that. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will set it it 4096 for non-windows, but generally, one has to call sysconf to find out the page size.


class PrepackedForSerialization;

// These options that affect how the model initializers are saved.
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// These options that affect how the model initializers are saved.
// These options affect how the model initializers are saved.
``` #Pending

Comment on lines 253 to 257
// Use this config when save pre-packed constant initializers to an external data file.
// This allows to minimize ONNX model file size and memory map pre-packed initializers on
// model load.
// - "0": Default is not save pre-packed initializers to a data file.
// - "1": Save pre-packed constant initializers to an external data file.
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Use this config when save pre-packed constant initializers to an external data file.
// This allows to minimize ONNX model file size and memory map pre-packed initializers on
// model load.
// - "0": Default is not save pre-packed initializers to a data file.
// - "1": Save pre-packed constant initializers to an external data file.
// Use this config when saving pre-packed constant initializers to an external data file.
// This allows you to minimize the ONNX model file size and memory map pre-packed initializers on
// model load.
// - "0": Default is not save pre-packed initializers to a data file.
// - "1": Save pre-packed constant initializers to an external data file.

How does it allow minimizing the ONNX model file size? Would be good to explain in the comment. Naively I would have thought we need the same amount of initializer data and at best we wouldn't increase the total data size, and at worst we'd double it (having original and prepacked data and choosing which to load at runtime). #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more explanations. The native data is also mem-mapped, but if the reference count during pre-packing drops to zero, they are unmapped.

ONNX_NAMESPACE::GraphProto& output_graph_proto,
std::ostream& external_stream,
int64_t& external_offset) const {
// Process subgraphs
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can this please say what the processing is intending to do? it's hard to review when the intention isn't stated. also easier for the next developer if you say what the processing is so they don't need to read and understand the entire implementation to infer it. #Pending

Comment on lines +4103 to +4104
auto hit = std::find_if(output_graph_proto.mutable_node()->begin(),
output_graph_proto.mutable_node()->end(),
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do we need a mutable Node to check the name? #Pending

Copy link
Member Author

@yuslepukhin yuslepukhin Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this, so we can obtain an non-const iterator that allows us to refer to and modify the node attributes.

Comment on lines 4120 to 4121
ORT_RETURN_IF_NOT(sub_hit != result_node.mutable_attribute()->end() && utils::HasGraph(*sub_hit),
"Subgraph ", name, " not found in node ", node.Name());
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this says 'what' happened but not 'why' it's an error, which limits the usefulness of the error message to the user. #Pending

// be empty. Else, save external data file in same directory as the model.
const std::filesystem::path modified_external_file_path = model_file_path.parent_path() / external_file_path;
// A recursive function that does bottom up with subgraphs
Status Graph::ToGraphProtoWithExternalInitiallizersImpl(
Copy link
Contributor

@skottmckay skottmckay Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this name might be confusing. IIUC the bulk of converting the the GraphProto is handled by the call to ToGraphProtoInternal. This implementation seems to be about updating the GraphProto with prepacked initializers once that GraphProto is created.

i.e. it's more like 'add prepacked weights to graph proto' than being the overall implementation of 'create graph proto with external initializers' #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original name with Impl added to it, in the same manner we do graph processing when partitioning or function inlining bottom up.

It actually externalizes the initializers that exceed the specified threshold.

I added the pre-packs here.
ToGraphProtoInternal is utilized to produce the graph proto that needs to be augmented with external data.

Comment on lines 1495 to 1497
/// that contains the initializer data. However, it may be an outerscope initializer.
/// Thus we need to keep track of the pre-packed blobs that are not serialized in this
/// graph, so the parent can make sure it is being serialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just keep the prepacked weight in the same Graph instance as the initializer? Seems complicated to be doing this sort of tracking vs. keeping the original and prepacked values in the same Graph instance.

Maybe that needs a similar lookup logic as we have for initializers with an outerscope search, but that's hopefully relatively trivial to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants