-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pre-packed blobs serialization on disk and their memory mapping on load #23069
base: main
Are you sure you want to change the base?
Conversation
and pre-packed blobs sharing when weights sharing is not enabled. Memory map pre-packed blobs. Recurse into subgraphs in ToGraphProtoWithExternalInitializers to make sure all big weights are serialized along with their pre-packs that is to be shared between the subgraphs.
7fc9a93
to
fab27a7
Compare
// This includes options to align external initializer offset. | ||
// For models running on CPU, ORT will try to use mmap to load external | ||
// initializers. To use mmap, external initializer need to be offset aligned. | ||
// ORT saves external initializers into signle data file, each initializer is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: single #Pending
bool save_mode_on_ = false; | ||
Subgraph* parent_ = nullptr; | ||
KeyToBlobMap& key_to_blobs_; | ||
WeightToPrePacksMap sorted_by_weight_for_writing_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name sorted_by_weight_for_writing_ is confusing. I have not seen the sorting logic. #Pending
const auto blob_length = std::get<1>(blob); | ||
SafeInt<FileOffsetType> end_of_blob{blob_offset}; | ||
end_of_blob += blob_length; | ||
ORT_RETURN_IF(blob_offset < 0 || static_cast<std::uintmax_t>(end_of_blob) > file_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use static_cast<FileOffsetType>(end_of_blob)
so that it has same data type as file_length. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to file_length
type.
for (size_t f = 1; f < split_fields.size(); ++f) { | ||
const auto& blob = split_fields[f]; | ||
auto blob_fields = utils::SplitString(blob, ";", false); | ||
if (blob_fields.size() == 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here blob_fields.size() != 3 is ignored. Shall we return failure in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been pondering on this, and decided to be forgiving. After all, if a pre-packed blob is not loaded, then it is going to re-generated.
|
||
class Subgraph { | ||
public: | ||
Subgraph(Subgraph* par, KeyToBlobMap& key_blobs, bool overwrite_for_save) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use full name for parent instead of par #Pending
/// </summary> | ||
using WeightToPrePacksMap = NodeHashMap<std::string, InlinedHashSet<std::string>>; | ||
|
||
Status ToGraphProtoWithExternalInitiallizersImpl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status ToGraphProtoWithExternalInitiallizersImpl( | |
Status ToGraphProtoWithExternalInitializersImpl( |
Some more doco of the parameters would help here. not clear what modified_external_file_path is. #Pending
// initializer) of the data file. To use mmap, each offset need to be aligned | ||
// which means offset need to divisible by allocation granularity(64KB for | ||
// windows and 4K for other OSes). With align_offset to true, ORT will align |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to align individual entries in the file? Naively I would have expected you'd mmap the whole file, and the internal offset would only need to be aligned for the data type. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is historical. This PR does not change that, but we can revisit cons and pros.
// The allocation Granularity for mmap() support. | ||
// Typically 64KB for Windows & 4KB for other OSes. Default to 64KB. | ||
int64_t allocation_granularity = 65536; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set the default based on the platform we're building on?
I thought the prepack logic could potentially be platform or architecture specific (e.g. different layout is preferred on ARM64 vs AMD64), so I'm wondering whether you would only be creating this serialized pre-packed data on the target platform, and hence can set the default based on that. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will set it it 4096 for non-windows, but generally, one has to call sysconf to find out the page size.
|
||
class PrepackedForSerialization; | ||
|
||
// These options that affect how the model initializers are saved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// These options that affect how the model initializers are saved. | |
// These options affect how the model initializers are saved. | |
``` #Pending |
// Use this config when save pre-packed constant initializers to an external data file. | ||
// This allows to minimize ONNX model file size and memory map pre-packed initializers on | ||
// model load. | ||
// - "0": Default is not save pre-packed initializers to a data file. | ||
// - "1": Save pre-packed constant initializers to an external data file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Use this config when save pre-packed constant initializers to an external data file. | |
// This allows to minimize ONNX model file size and memory map pre-packed initializers on | |
// model load. | |
// - "0": Default is not save pre-packed initializers to a data file. | |
// - "1": Save pre-packed constant initializers to an external data file. | |
// Use this config when saving pre-packed constant initializers to an external data file. | |
// This allows you to minimize the ONNX model file size and memory map pre-packed initializers on | |
// model load. | |
// - "0": Default is not save pre-packed initializers to a data file. | |
// - "1": Save pre-packed constant initializers to an external data file. |
How does it allow minimizing the ONNX model file size? Would be good to explain in the comment. Naively I would have thought we need the same amount of initializer data and at best we wouldn't increase the total data size, and at worst we'd double it (having original and prepacked data and choosing which to load at runtime). #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more explanations. The native data is also mem-mapped, but if the reference count during pre-packing drops to zero, they are unmapped.
onnxruntime/core/graph/graph.cc
Outdated
ONNX_NAMESPACE::GraphProto& output_graph_proto, | ||
std::ostream& external_stream, | ||
int64_t& external_offset) const { | ||
// Process subgraphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can this please say what the processing is intending to do? it's hard to review when the intention isn't stated. also easier for the next developer if you say what the processing is so they don't need to read and understand the entire implementation to infer it. #Pending
auto hit = std::find_if(output_graph_proto.mutable_node()->begin(), | ||
output_graph_proto.mutable_node()->end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do we need a mutable Node to check the name? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this, so we can obtain an non-const iterator that allows us to refer to and modify the node attributes.
onnxruntime/core/graph/graph.cc
Outdated
ORT_RETURN_IF_NOT(sub_hit != result_node.mutable_attribute()->end() && utils::HasGraph(*sub_hit), | ||
"Subgraph ", name, " not found in node ", node.Name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this says 'what' happened but not 'why' it's an error, which limits the usefulness of the error message to the user. #Pending
// be empty. Else, save external data file in same directory as the model. | ||
const std::filesystem::path modified_external_file_path = model_file_path.parent_path() / external_file_path; | ||
// A recursive function that does bottom up with subgraphs | ||
Status Graph::ToGraphProtoWithExternalInitiallizersImpl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this name might be confusing. IIUC the bulk of converting the the GraphProto is handled by the call to ToGraphProtoInternal. This implementation seems to be about updating the GraphProto with prepacked initializers once that GraphProto is created.
i.e. it's more like 'add prepacked weights to graph proto' than being the overall implementation of 'create graph proto with external initializers' #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original name with Impl added to it, in the same manner we do graph processing when partitioning or function inlining bottom up.
It actually externalizes the initializers that exceed the specified threshold.
I added the pre-packs here.
ToGraphProtoInternal
is utilized to produce the graph proto that needs to be augmented with external data.
/// that contains the initializer data. However, it may be an outerscope initializer. | ||
/// Thus we need to keep track of the pre-packed blobs that are not serialized in this | ||
/// graph, so the parent can make sure it is being serialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just keep the prepacked weight in the same Graph instance as the initializer? Seems complicated to be doing this sort of tracking vs. keeping the original and prepacked values in the same Graph instance.
Maybe that needs a similar lookup logic as we have for initializers with an outerscope search, but that's hopefully relatively trivial to implement.
Description
Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time
Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated.
This change enables pre-packed data to be stored on disk alongside with the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the X-session shared container
or a new container that shares pre-packed blobs within the session.
With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root
SessionState
owns and shares the data when X-session sharing is not enabled.To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing
ToGraphProtoWithExternalInitializers
function.To externalize the pre-packed weights, we introduce a new session option
kOrtSessionOptionsSavePrePackedConstantInitializers.
Note, that pre-packing should be enabled (default) for this to work.ToGraphProtoWithExternalInitializers
function is modified to recurse into subgraphs to make sure we properly account for local initializer names.In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels.
Motivation and Context
Reduce memory usage by pre-packed initializers and externalize them.