-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt to multiply with overflow
when using i32
#138
Comments
https://github.com/hamirmahal/rstar-i32/ has some code that reproduces the issue. |
https://github.com/hamirmahal/rstar-i32/actions/runs/6836250994/job/18590953607#step:3:43 also shows the output on a machine that isn't mine, although you have to be signed in to see it via this link. |
I'm not sure if the error is in my code, or in
Rust code that resulted in the above outputuse rstar::Point;
#[derive(Clone, Copy, Debug, PartialEq)]
struct Data {
coordinates: [i32; 18],
}
impl Point for Data {
type Scalar = i32;
const DIMENSIONS: usize = 18;
fn nth(&self, index: usize) -> Self::Scalar {
self.coordinates[index]
}
fn nth_mut(&mut self, index: usize) -> &mut Self::Scalar {
&mut self.coordinates[index]
}
fn generate(mut generator: impl FnMut(usize) -> Self::Scalar) -> Self {
Data {
coordinates: [
generator(0),
generator(1),
generator(2),
generator(3),
generator(4),
generator(5),
generator(6),
generator(7),
generator(8),
generator(9),
generator(10),
generator(11),
generator(12),
generator(13),
generator(14),
generator(15),
generator(16),
generator(17),
],
}
}
}
fn main() {
let mut tree = rstar::RTree::new();
tree.insert(Data {
coordinates: [
30, 30, 62, 30, 31, 30, 20, 12, 30, 80, 40, 132, 28, 78, 140, 139, 195, 67,
],
});
tree.insert(Data {
coordinates: [
40, 9, 9, 144, 136, 149, 148, 140, 198, 36, 50, 167, 179, 234, 38, 2, 103, 38,
],
});
tree.insert(Data {
coordinates: [
38, 38, 38, 6, 2, 3, 155, 61, 61, 195, 155, 15, 70, 134, 158, 126, 94, 63,
],
});
tree.insert(Data {
coordinates: [
154, 154, 138, 179, 121, 75, 143, 31, 7, 67, 11, 3, 113, 113, 65, 65, 73, 65,
],
});
tree.insert(Data {
coordinates: [
221, 215, 209, 115, 210, 198, 224, 236, 111, 6, 7, 7, 85, 92, 203, 197, 36, 44,
],
});
tree.insert(Data {
coordinates: [
176, 240, 176, 176, 112, 241, 240, 240, 192, 44, 44, 12, 206, 204, 142, 4, 44, 2,
],
});
tree.insert(Data {
coordinates: [
212, 218, 202, 70, 70, 23, 23, 23, 1, 230, 250, 55, 30, 23, 60, 60, 92, 16,
],
});
} |
rstar uses a fold to calculate the area of the envelope (an diag.fold(one, |acc, cur| {
max_inline(cur, zero) * acc
}) During the insertion operation on line 78 of your code, you can see the problem:
Multiplying |
Ah, because the product, 24,420,917,248, is larger than the max If each coordinate is a So the maximum possible product currently is 255^18 = 2.078371831996e43, which I don’t even think the maximum unsigned integer size, Is calculating the area of the bounding box required to use an R-tree? I’m wondering if it’s possible to write an implementation that avoids doing this calculation altogether to avoid running the risk of overflow. Alternatively, would checking for overflow before the multiplication, and setting the product to
Hmm... where does the current value of Also, how were you able to see the current and total values in I tried adding diag.fold(one, |acc, cur| {
std::println!("acc: {}, cur: {}", acc, cur);
return max_inline(cur, zero) * acc;
}) to
Thanks for the detailed response. |
Just to clarify: You're using 18-dimensional coordinates, is that right? |
Ok, I see your example: So you are using 18-D coordinates. Just to ask a silly question: Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates? |
This is correct. |
I'm actually trying to find the element in my dataset with the closest Hamming distance to some input. The things I'm comparing currently have 18 bytes, hence the 18 coordinates that can each be as large as Do you think R-trees are a suitable data structure for this purpose? |
Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣 But more seriously... I've never worked with hamming distance. The definition I searched up was:
So, hamming distance is about bitwise distance - not euclidean distance. In other words:
But with typical numeric distance we get a very different answer:
So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance. |
Hmm. Hamming distance satisfies the triangle inequality, so I think it's possible to use R[*]-trees, but I have no idea how we're going to make it work for rstar… |
Ha!
That all sounds correct.
I ended up doing impl PointDistance for Data {
fn distance_2(&self, point: &[f32; 18]) -> f32 {
let hamming_distance: f32 = self
.hash
.as_slice()
.iter()
.zip(point)
.map(|(l, r)| {
// cast l and r to u8, because we know that the values are in the range [0, 255].
let l = *l as u8;
let r = *r as u8;
(l ^ r).count_ones() as f32
})
.sum();
// We must return the squared distance!
hamming_distance * hamming_distance
}
} but it seems to be quite computationally expensive, and not very efficient. I used |
What's the |
Oh, so I'm ideally trying to find the closest Hamming distance between |
Hm. If you're using the default |
Ah, I see… I’ll look into some different approaches. |
https://nnethercote.github.io/2021/12/08/a-brutally-effective-hash-function-in-rust.html FxHash will certainly be faster (https://crates.io/crates/fxhash) |
Do we support bnum ? It supports a u256, which should suffice for above calc? |
I think before we go ahead and make this crate more difficult to maintain, it should be established whether R* trees are actually suitable to querying Hamming distances. While it could be innovative cross-pollination between different areas of mathematics, I am somewhat doubtful that the geometric intuitions underpinning the construction of R* trees transfer over to code spaces. Generally, there exist a lot of metric spaces which are decidedly weird compared to the two- or three-dimensional Euclidean space our intuition is still based on even when alternative metrics like Manhattan are used. So, I think the first step would be to just modify the source of (But if the changes are too extensive, the answer might still be no as as maintainers, saying no is often what we need to do to keep complexity in check. We have also have represent the interests of our current user base which I guess is mostly GIS, games and then some simulation stuff in order of popularity.) Personally, my first guess when reading the original problem would be to look into Levenshtein automata which are used for typo resistant search, i.e. to find a matching term with the fewest possible edits away from the query term. |
Oh, I don't think anyone was actually proposing a change. Certainly when I said "make it work" I didn't mean it in the sense of modification. |
@hamirmahal I think you can get it working by just converting the [u8; 18] slice into a [f64; 18] slice before storing it in the hash variable. f64 or f32, with float should support the range you need. There'll be some loss in precision in that calculation, but I think it's only used for re-balancing of the r-tree that is constructed (and so must still work, with possibly poorer performance). |
@rmanoka I'll give that a look. Thanks for the suggestion. |
No description provided.
The text was updated successfully, but these errors were encountered: