Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5-minute Aggregation Datasets Comparison Between Old and Modernized PeMS System --- A Case Study #501

Open
thehanggit opened this issue Dec 18, 2024 · 4 comments

Comments

@thehanggit
Copy link
Contributor

The motivation, goal, and step-by-step process is illustrated in Snowflake Notebook. In general, we are comparing the differences of detector 5-minute aggregation tables in the clearing house as these tables serve as the foundation for all downstream models.

Overall, 93.20% of the data from districts 6, 8, 12 is identical. The remaining differences (6.8%) would be identified and categorized as follows.

  • 0.67% Attributed to rounding bias, which can be considered as identical data
  • 3.18% Potentially caused by data relay loss
  • 2.95% Currently unexplained, but extreme volume values (over 1000) observed in modernized PeMS data may provide clues for further investigation.
@kengodleskidot
Copy link
Contributor

Thank you for the analysis @thehanggit. I believe the extreme volume values will be addressed through the high flow value issue described in #278 which has been backlogged. Once a fix for the high flow values is implemented it would be interesting to see how that impacts the 2.95% of currently unexplained differences.

@thehanggit
Copy link
Contributor Author

@kengodleskidot

Thank you for the analysis @thehanggit. I believe the extreme volume values will be addressed through the high flow value issue described in #278 which has been backlogged. Once a fix for the high flow values is implemented it would be interesting to see how that impacts the 2.95% of currently unexplained differences.

Got you, which means these extreme values are observed instead of by normalization or other postprocessing, whereas the old PeMS system dealt with this issue in their 5-minute table.

@kengodleskidot
Copy link
Contributor

@kengodleskidot

Thank you for the analysis @thehanggit. I believe the extreme volume values will be addressed through the high flow value issue described in #278 which has been backlogged. Once a fix for the high flow values is implemented it would be interesting to see how that impacts the 2.95% of currently unexplained differences.

Got you, which means these extreme values are observed instead of by normalization or other postprocessing, whereas the old PeMS system dealt with this issue in their 5-minute table.

You are correct, the high flow values are being reported directly by the devices in the raw data. There may be instances where normalization results in high values, but I suspect that is very rare. I believe once a high flow value threshold methodology is determined at the appropriate level (detector, station, etc.) the flow value would be replaced by either an imputed flow value or a max flow value that would need to be determined at the same level. There is no documentation that I am aware of that details how existing PeMS handles high flow values.

@thehanggit
Copy link
Contributor Author

@kengodleskidot

Thank you for the analysis @thehanggit. I believe the extreme volume values will be addressed through the high flow value issue described in #278 which has been backlogged. Once a fix for the high flow values is implemented it would be interesting to see how that impacts the 2.95% of currently unexplained differences.

Got you, which means these extreme values are observed instead of by normalization or other postprocessing, whereas the old PeMS system dealt with this issue in their 5-minute table.

You are correct, the high flow values are being reported directly by the devices in the raw data. There may be instances where normalization results in high values, but I suspect that is very rare. I believe once a high flow value threshold methodology is determined at the appropriate level (detector, station, etc.) the flow value would be replaced by either an imputed flow value or a max flow value that would need to be determined at the same level. There is no documentation that I am aware of that details how existing PeMS handles high flow values.

That is clear enough! Thank you Ken. I will continue the analysis to find potential reasons for differences. Hope we can explain every piece of them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants