Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

yuslepukhin · 2024-12-10T20:47:56Z

Description

Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time

Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated.

This change enables pre-packed data to be stored on disk alongside with the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the X-session shared container
or a new container that shares pre-packed blobs within the session.

With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root SessionState owns and shares the data when X-session sharing is not enabled.

To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing ToGraphProtoWithExternalInitializers function.

To externalize the pre-packed weights, we introduce a new session option kOrtSessionOptionsSavePrePackedConstantInitializers. Note, that pre-packing should be enabled (default) for this to work.

ToGraphProtoWithExternalInitializersfunction is modified to recurse into subgraphs to make sure we properly account for local initializer names.

In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels.

Motivation and Context

Reduce memory usage by pre-packed initializers and externalize them.

and pre-packed blobs sharing when weights sharing is not enabled. Memory map pre-packed blobs. Recurse into subgraphs in ToGraphProtoWithExternalInitializers to make sure all big weights are serialized along with their pre-packs that is to be shared between the subgraphs.

onnxruntime/core/framework/prepacked_weights_container.h

onnxruntime/core/framework/session_state.cc

onnxruntime/test/framework/session_state_test.cc

onnxruntime/test/framework/tensorutils_test.cc

tianleiwu · 2024-12-12T21:17:55Z

include/onnxruntime/core/graph/model_saving_options.h

+// This includes options to align external initializer offset.
+// For models running on CPU, ORT will try to use mmap to load external
+// initializers. To use mmap, external initializer need to be offset aligned.
+// ORT saves external initializers into signle data file, each initializer is


typo: single #Pending

tianleiwu · 2024-12-12T21:33:32Z

onnxruntime/core/framework/prepacked_weights_container.h

+    bool save_mode_on_ = false;
+    Subgraph* parent_ = nullptr;
+    KeyToBlobMap& key_to_blobs_;
+    WeightToPrePacksMap sorted_by_weight_for_writing_;


The name sorted_by_weight_for_writing_ is confusing. I have not seen the sorting logic. #Pending

tianleiwu · 2024-12-12T22:12:19Z

onnxruntime/core/framework/tensorprotoutils.cc

+          const auto blob_length = std::get<1>(blob);
+          SafeInt<FileOffsetType> end_of_blob{blob_offset};
+          end_of_blob += blob_length;
+          ORT_RETURN_IF(blob_offset < 0 || static_cast<std::uintmax_t>(end_of_blob) > file_length,


nit: use static_cast<FileOffsetType>(end_of_blob) so that it has same data type as file_length. #Pending

This is due to file_length type.

tianleiwu · 2024-12-12T22:31:10Z

onnxruntime/core/framework/tensor_external_data_info.cc

+          for (size_t f = 1; f < split_fields.size(); ++f) {
+            const auto& blob = split_fields[f];
+            auto blob_fields = utils::SplitString(blob, ";", false);
+            if (blob_fields.size() == 3) {


Here blob_fields.size() != 3 is ignored. Shall we return failure in that case.

I have been pondering on this, and decided to be forgiving. After all, if a pre-packed blob is not loaded, then it is going to re-generated.

tianleiwu · 2024-12-12T22:35:27Z

onnxruntime/core/framework/prepacked_weights_container.h

+
+  class Subgraph {
+   public:
+    Subgraph(Subgraph* par, KeyToBlobMap& key_blobs, bool overwrite_for_save)


nit: use full name for parent instead of par #Pending

skottmckay · 2024-12-13T09:39:25Z

include/onnxruntime/core/graph/graph.h

+  /// </summary>
+  using WeightToPrePacksMap = NodeHashMap<std::string, InlinedHashSet<std::string>>;
+
+  Status ToGraphProtoWithExternalInitiallizersImpl(


Suggested change

Status ToGraphProtoWithExternalInitiallizersImpl(

Status ToGraphProtoWithExternalInitializersImpl(

Some more doco of the parameters would help here. not clear what modified_external_file_path is. #Pending

skottmckay · 2024-12-13T09:43:53Z

include/onnxruntime/core/graph/model_saving_options.h

+// initializer) of the data file. To use mmap, each offset need to be aligned
+// which means offset need to divisible by allocation granularity(64KB for
+// windows and 4K for other OSes). With align_offset to true, ORT will align


Why do we need to align individual entries in the file? Naively I would have expected you'd mmap the whole file, and the internal offset would only need to be aligned for the data type. #Resolved

This is historical. This PR does not change that, but we can revisit cons and pros.

skottmckay · 2024-12-13T09:47:10Z

include/onnxruntime/core/graph/model_saving_options.h

+  // The allocation Granularity for mmap() support.
+  // Typically 64KB for Windows & 4KB for other OSes. Default to 64KB.
+  int64_t allocation_granularity = 65536;


Can we set the default based on the platform we're building on?

I thought the prepack logic could potentially be platform or architecture specific (e.g. different layout is preferred on ARM64 vs AMD64), so I'm wondering whether you would only be creating this serialized pre-packed data on the target platform, and hence can set the default based on that. #Pending

I will set it it 4096 for non-windows, but generally, one has to call sysconf to find out the page size.

skottmckay · 2024-12-13T09:47:38Z

include/onnxruntime/core/graph/model_saving_options.h

+
+class PrepackedForSerialization;
+
+// These options that affect how the model initializers are saved.


Suggested change

// These options that affect how the model initializers are saved.

// These options affect how the model initializers are saved.

``` #Pending

skottmckay · 2024-12-13T09:49:44Z

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

+// Use this config when save pre-packed constant initializers to an external data file.
+// This allows to minimize ONNX model file size and memory map pre-packed initializers on
+// model load.
+// - "0": Default is not save pre-packed initializers to a data file.
+// - "1": Save pre-packed constant initializers to an external data file.


Suggested change

// Use this config when save pre-packed constant initializers to an external data file.

// This allows to minimize ONNX model file size and memory map pre-packed initializers on

// model load.

// - "0": Default is not save pre-packed initializers to a data file.

// - "1": Save pre-packed constant initializers to an external data file.

// Use this config when saving pre-packed constant initializers to an external data file.

// This allows you to minimize the ONNX model file size and memory map pre-packed initializers on

// model load.

// - "0": Default is not save pre-packed initializers to a data file.

// - "1": Save pre-packed constant initializers to an external data file.

How does it allow minimizing the ONNX model file size? Would be good to explain in the comment. Naively I would have thought we need the same amount of initializer data and at best we wouldn't increase the total data size, and at worst we'd double it (having original and prepacked data and choosing which to load at runtime). #Pending

I added some more explanations. The native data is also mem-mapped, but if the reference count during pre-packing drops to zero, they are unmapped.

skottmckay · 2024-12-13T22:23:39Z

onnxruntime/core/graph/graph.cc

+    ONNX_NAMESPACE::GraphProto& output_graph_proto,
+    std::ostream& external_stream,
+    int64_t& external_offset) const {
+  // Process subgraphs


nit: Can this please say what the processing is intending to do? it's hard to review when the intention isn't stated. also easier for the next developer if you say what the processing is so they don't need to read and understand the entire implementation to infer it. #Pending

skottmckay · 2024-12-13T22:29:22Z

onnxruntime/core/graph/graph.cc

+      auto hit = std::find_if(output_graph_proto.mutable_node()->begin(),
+                              output_graph_proto.mutable_node()->end(),


nit: why do we need a mutable Node to check the name? #Pending

We do this, so we can obtain an non-const iterator that allows us to refer to and modify the node attributes.

skottmckay · 2024-12-13T22:30:26Z

onnxruntime/core/graph/graph.cc

+        ORT_RETURN_IF_NOT(sub_hit != result_node.mutable_attribute()->end() && utils::HasGraph(*sub_hit),
+                          "Subgraph ", name, " not found in node ", node.Name());


nit: this says 'what' happened but not 'why' it's an error, which limits the usefulness of the error message to the user. #Pending

skottmckay · 2024-12-13T22:36:03Z

onnxruntime/core/graph/graph.cc

-  // be empty. Else, save external data file in same directory as the model.
-  const std::filesystem::path modified_external_file_path = model_file_path.parent_path() / external_file_path;
+// A recursive function that does bottom up with subgraphs
+Status Graph::ToGraphProtoWithExternalInitiallizersImpl(


nit: this name might be confusing. IIUC the bulk of converting the the GraphProto is handled by the call to ToGraphProtoInternal. This implementation seems to be about updating the GraphProto with prepacked initializers once that GraphProto is created.

i.e. it's more like 'add prepacked weights to graph proto' than being the overall implementation of 'create graph proto with external initializers' #Resolved

This is the original name with Impl added to it, in the same manner we do graph processing when partitioning or function inlining bottom up.

It actually externalizes the initializers that exceed the specified threshold.

I added the pre-packs here.
ToGraphProtoInternal is utilized to produce the graph proto that needs to be augmented with external data.

skottmckay · 2024-12-13T22:47:44Z

include/onnxruntime/core/graph/graph.h

+  /// that contains the initializer data. However, it may be an outerscope initializer.
+  /// Thus we need to keep track of the pre-packed blobs that are not serialized in this
+  /// graph, so the parent can make sure it is being serialized.


Can we just keep the prepacked weight in the same Graph instance as the initializer? Seems complicated to be doing this sort of tracking vs. keeping the original and prepacked values in the same Graph instance.

Maybe that needs a similar lookup logic as we have for initializers with an outerscope search, but that's hopefully relatively trivial to implement.

include/onnxruntime/core/graph/graph.h

onnxruntime/core/framework/tensor_external_data_info.h

yuslepukhin requested review from snnn, skottmckay and frank-dong-ms December 10, 2024 20:47

yuslepukhin commented Dec 10, 2024

View reviewed changes

onnxruntime/core/framework/prepacked_weights_container.h Outdated Show resolved Hide resolved

yuslepukhin commented Dec 10, 2024

View reviewed changes

onnxruntime/core/framework/prepacked_weights_container.h Outdated Show resolved Hide resolved

yuslepukhin commented Dec 10, 2024

View reviewed changes

onnxruntime/core/framework/session_state.cc Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Dec 10, 2024

View reviewed changes

Fix CI complaints

fab27a7

yuslepukhin force-pushed the yuslepukhin/prepack_serialize branch from 7fc9a93 to fab27a7 Compare December 10, 2024 23:00

yuslepukhin added 3 commits December 10, 2024 15:43

Address macro invocation

110c53b

Remove structured binding as Androind does not like it

0dfac49

Exclude __wasm__ from testing external prepacks

eccd758

yuslepukhin marked this pull request as ready for review December 11, 2024 18:38

yuslepukhin requested a review from tianleiwu December 11, 2024 19:26

tianleiwu reviewed Dec 12, 2024

View reviewed changes

yuslepukhin requested a review from edgchen1 December 12, 2024 22:47

skottmckay reviewed Dec 13, 2024

View reviewed changes

edgchen1 reviewed Dec 13, 2024

View reviewed changes

include/onnxruntime/core/graph/graph.h Outdated Show resolved Hide resolved

onnxruntime/core/framework/tensor_external_data_info.h Outdated Show resolved Hide resolved

yuslepukhin added 4 commits December 17, 2024 12:00

Address review comments and test

148bbee

Address CI issues

2217814

Address CI issues

d5ceb84

Fix comments

9837a8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

yuslepukhin commented Dec 10, 2024 •

edited

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 17, 2024

tianleiwu Dec 12, 2024

yuslepukhin Dec 13, 2024

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 13, 2024

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 13, 2024

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 14, 2024

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 16, 2024 •

edited

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 16, 2024

skottmckay Dec 13, 2024

	Status ToGraphProtoWithExternalInitiallizersImpl(
	Status ToGraphProtoWithExternalInitializersImpl(


		class PrepackedForSerialization;

		// These options that affect how the model initializers are saved.

	// These options that affect how the model initializers are saved.
	// These options affect how the model initializers are saved.
	``` #Pending

		auto hit = std::find_if(output_graph_proto.mutable_node()->begin(),
		output_graph_proto.mutable_node()->end(),

		ORT_RETURN_IF_NOT(sub_hit != result_node.mutable_attribute()->end() && utils::HasGraph(*sub_hit),
		"Subgraph ", name, " not found in node ", node.Name());

Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

Are you sure you want to change the base?

Implement pre-packed blobs serialization on disk and their memory mapping on load #23069

Conversation

yuslepukhin commented Dec 10, 2024 • edited Loading

Description

Motivation and Context

tianleiwu Dec 12, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

tianleiwu Dec 12, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

tianleiwu Dec 12, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianleiwu Dec 12, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

skottmckay Dec 13, 2024 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuslepukhin commented Dec 10, 2024 •

edited

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

tianleiwu Dec 12, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

yuslepukhin Dec 16, 2024 •

edited

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading

skottmckay Dec 13, 2024 •

edited by yuslepukhin

Loading