Comparison of multiple images with the Phi-3 Vision model #563
Replies: 6 comments 4 replies
-
onnxruntime-genai currently does not have support for loading multiple images and executing the phi3vision model with multiple images. Depending on our internal prioritization, we can add support for multiple images per prompt scenario in the near future. I'll convert this issue into a discussion now. |
Beta Was this translation helpful? Give feedback.
-
you can try to input images separately and using previous messages you can get the output for multiple images |
Beta Was this translation helpful? Give feedback.
-
@baijumeswani since Phi 3.5 Vision is almost made to handle multiple images as opposed to Phi 3 Vision, will this be prioritized now? |
Beta Was this translation helpful? Give feedback.
-
What about just combining the two images into one image. Would that work? |
Beta Was this translation helpful? Give feedback.
-
Support for multiple images will be added in this PR for Phi-3.5 vision. More work is needed to support Phi-3.5 vision, however, and that work is in progress. |
Beta Was this translation helpful? Give feedback.
-
The new Phi-3 vision and Phi-3.5 vision ONNX models have now been released. The new models support no-image, single-image, and multi-image scenarios. |
Beta Was this translation helpful? Give feedback.
-
Does the Phi-3 Vision ONNX model support the comparison of multiple images? In the Phi-3Cookbook they show an example with two images. Based on that example, I would expect the code to look something like
`import onnxruntime_genai as og
import os
def processImage(image_path):
image = None
if (len(image_path) == 0):
print("No image provided")
else:
print("Loading image...")
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
image = og.Images.open(image_path)
return image
modelName = r'phi-3-vision-128k\cpu-int4-rtn-block-32-acc-level-4'
image1_path = 'Pic1.jpg'
image2_path = 'Pic2.jpg'
output_tokens = 3072
text = 'Please explain the similarities and differences between these two images'
model = og.Model(modelName)
tokenizer = og.Tokenizer(model)
processor = model.create_multimodal_processor()
tokenizer_stream = tokenizer.create_stream()
prompt = "<|user|>\n"
image1 = processImage(image1_path)
prompt += "<|image_1|>\n"
image2 = processImage(image2_path)
prompt += "<|image_2|>\n"
prompt += f"{text}<|end|>\n<|assistant|>\n"
inputs = processor(prompt, images=[image1,image2])
params = og.GeneratorParams(model)
params.set_inputs(inputs)
params.set_search_options(max_length=output_tokens)
generator = og.Generator(model, params)
output_str = ""
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
output_str += tokenizer_stream.decode(new_token)
print(output_str)
del generator
`
This code throws an error "RuntimeError: Unable to cast Python instance to C++ type (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)" I can't find the documentation on multimodal processor to see if there is a different way to get the inputs from the MultiModalProcessor class. Any help or suggestions would be appreciated. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions