15 Aug 23:01

eladven

8fd91be

1.12.3

Main changes

New option to use multiple templates and/or num_demos in single dataset recipe. Unitxt will randomly sample from the provided templates and possible number of demos for each instance.
See example : https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it
See more details on how to deal with conflicting metric names in https://www.unitxt.ai/en/latest/docs/adding_metric.html#metric-outputs-with-multiple-metrics

Non backward compatible changes in catalog

change rag metrics name convention (e.g. "metrics.rag.mrr" -> "metrics.rag.context_correctness.mrr",) - catalog non backward compatible change by @assaftibm in #1104
Update summarization task and templates to support multiple reference summaries - by @yoavkatz in #1126
Fix belebele due to new convention by @elronbandel in #1145

Additions to catalog

Add DeepSeek-Coder format and system prompt by @oktie in #1105
Add a metric to calculate the ratio of references included in the prediction by @marukaz in #1091
adding RAG bge metrics by @assaftibm

New Features

Add option to run multiple templates and or num_demos in single dataset recipe. Now it is possible to give a list of templates or num_demos. Unitxt will randomly sample from the templates and for each instance assign a random template from the list. by @elronbandel in #1110
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it @dafnapension in #1124
MetricPipeline fields postpreprocess_steps has been renamed to postprocess_steps. The old field (postpreprocess_steps) still exists for backward compatible but depricated. by @dafnapension in #1117
Decrease runtime of demo examples
Add tests for RAG metrics by @matanor
Adding dedicated Unitxt warning and error classes to link online documentation by @yoavkatz in
The code now uses a central controllable deepcopy function by @elronbandel in #1120

Bug Fixes

Create a dedicated nltk a mixin, for downloading all versions of punkt which needed by metrics code. by @elronbandel in #1151
For bulk instance metrics, Replace mean function with nanmean to support aggregation in case of nan scores. by @elronbandel in #1150
Fix helm test by @elronbandel in #1109
Fix bug with RAG metrics: Fix use of minilm model by @assaftibm in #1115
Fix data classification of WML model to include 'public' classification by @yoavkatz in #1118
Fix WMLInferenceEngine by @pawelknes in #1122
Fix belebele HF path due to new convention by @elronbandel in #1145

Documentation changes

Improve debugging.rst wording
Improve examples.rst wording by @welisheva22 in #1138
Improve data_classification_policy.rst wording by @welisheva22 in #1139
Improve rag_support.rst wording by @welisheva22 in #1139
Improve production.rst wording by @welisheva22 in #1148
Improve the clarity of the code examples.
Improve load_datasets.rst wording by @welisheva22
Improve introduction.rst wording by @welisheva22
Improve installation.rst wording by @welisheva22
Improve adding_format.rst wording by @welisheva22
Improve adding_task.rst wording by @welisheva22
Improve adding_template.rst wording by @welisheva22
mprove adding_dataset.rst wording by @hanansinger
improve index.rst page by @yoavkatz
Fix link to llama blog in adding_format.rst by @andersonm-ibm in #1113
Added example of RAG response by @yoavkatz in #1121

New Contributors

@andersonm-ibm made their first contribution in #1113 by @welisheva22 in #1152

Contributors

oktie, marukaz, and 9 other contributors

Assets 2

31 Jul 14:46

yoavkatz

1.12.2

ce2992c

Unitxt 1.12.2

Main changes

Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

Non backward compatible changes

changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
Remove financebench card since it was removed from HF by @elronbandel in #1016
add validation to tldr, remove shuffle from billsum by @alonh in #1038
Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

Arena hard elad2 by @eladven and @OfirArviv in #1026
Add flores101 by @perlitz in #1053
Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
Add Finqa dataset by @ShirApp in #962
Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080

New Features

Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
Support for ensemble by metrics @eladven in #1047
Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
Real types in tasks and metrics by @elronbandel in #1045
Ability to create demo samplers based on instance by @yoavkatz in #1034
add judge input to the LLM as Judge metric scores by @OfirArviv in #1064

Bug Fixes

Solve problem with striping format at LLM as a judge code. by @eladven in #1005
Added seed to LLM as judges for consistent results by @yoavkatz in #1029
Fixed issues with fresh install by @yoavkatz in #1037
WML Inference Engine fix by @pawelknes in #1013
replace type and type in type error message by @perlitz in #1035
FinQA - filter problematic examples by @ShirApp in #1039
demo's target prefix is now taken from demo instance by @dafnapension in #1031
Make sure preparation times printed fully and nicely by @elronbandel in #1046
Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes

Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
Update llm_as_judge.rst by @yoavkatz in #1085
Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
Example improvements by @yoavkatz in #1022
Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
Fix some docs titles and links by @elronbandel in #1023
Add example of meta evaluation of llm as judge by @yoavkatz in #1025
Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
Added example for selection of demos by @yoavkatz in #1052

New Contributors

We want to thank the new contributors for their first contributions!

@welisheva22 made their first contribution in #1015
@luisaadanttas made their first contribution in #994
@benjaminsznajder made their first contribution in #1055
@hanansinger made their first contribution in #1057

Contributors

eladven, perlitz, and 13 other contributors

Assets 2

31 Jul 12:25

yoavkatz

1.12.0

8a40c3f

Unitxt 1.12.0

Main changes

Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

Non backward compatible changes

changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
Remove financebench card since it was removed from HF by @elronbandel in #1016
add validation to tldr, remove shuffle from billsum by @alonh in #1038
Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

Arena hard elad2 by @eladven and @OfirArviv in #1026
Add flores101 by @perlitz in #1053
Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
Add Finqa dataset by @ShirApp in #962
Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080

New Features

Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
Support for ensemble by metrics @eladven in #1047
Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
Real types in tasks and metrics by @elronbandel in #1045
Ability to create demo samplers based on instance by @yoavkatz in #1034
add judge input to the LLM as Judge metric scores by @OfirArviv in #1064

Bug Fixes

Solve problem with striping format at LLM as a judge code. by @eladven in #1005
Added seed to LLM as judges for consistent results by @yoavkatz in #1029
Fixed issues with fresh install by @yoavkatz in #1037
WML Inference Engine fix by @pawelknes in #1013
replace type and type in type error message by @perlitz in #1035
FinQA - filter problematic examples by @ShirApp in #1039
demo's target prefix is now taken from demo instance by @dafnapension in #1031
Make sure preparation times printed fully and nicely by @elronbandel in #1046
Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes

Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
Update llm_as_judge.rst by @yoavkatz in #1085
Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
Example improvements by @yoavkatz in #1022
Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
Fix some docs titles and links by @elronbandel in #1023
Add example of meta evaluation of llm as judge by @yoavkatz in #1025
Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
Added example for selection of demos by @yoavkatz in #1052

New Contributors

We want to thank the new contributors for their first contributions!

@welisheva22 made their first contribution in #1015
@luisaadanttas made their first contribution in #994
@benjaminsznajder made their first contribution in #1055
@hanansinger made their first contribution in #1057

Contributors

eladven, perlitz, and 13 other contributors

Assets 2

08 Jul 05:52

eladven

1.11.1

b23fb42

1.11.1

Non backward compatible changes

The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option for lazy load hf inference engine by @elronbandel in #980
Added a format based on Huggingface format by @yoavkatz in #988

New Assets

Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Improve the examples table documentation by @eladven in #976

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 5 other contributors

Assets 2

07 Jul 11:32

eladven

1.11.0

306fc50

1.11.0 (#996)

Non backward compatible changes

The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option for lazy load hf inference engine by @elronbandel in #980
Added a format based on Huggingface format by @yoavkatz in #988

New Assets

Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Improve the examples table documentation by @eladven in #976

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 5 other contributors

Assets 2

04 Jul 08:23

eladven

1.10.3

45f622e

1.10.3

Non backward compatible changes

The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option for lazy load hf inference engine by @elronbandel in #980
Added a format based on Huggingface format by @yoavkatz in #988

New Assets

Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Improve the examples table documentation by @eladven in #976

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 5 other contributors

Assets 2

04 Jul 06:17

eladven

1.10.2

97243ad

1.10.2

Non backward compatible changes

None - this release if fully compatible with the previous release.

New Features

added num_proc parameter - Optional integer to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option to lazy load hf inference engine and fix requirements mechanism by @elronbandel in #980
Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
Add metrics: domesticated safety and regard by @dafnapension in #983
Make input_format required field in InputOutputTemplate by @elronbandel in #982
Added a format based on Huggingface format by @yoavkatz in #988

Bug Fixes

Fix the error at the examples table by @eladven in #976
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. by @matanor in #969
Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 6 other contributors

Assets 2

01 Jul 08:04

eladven

1.10.1

59b0a62

1.10.1

Main Changes

Continued with major improvements to the documentation including a new code examples section with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed descriptions. New documentation of RAG tasks and metrics.
load_dataset can now load cards defined in a python file (and not only in the catalog). See example.
The evaluation results returned from evaluate now include two fields predictions and processed_predictions. See example.
The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has text as the default text_type. See example.

Non backward compatible changes

You need to recreate the any cards/metrics you added by running prepare//.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the type error.

The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.

Change Artifact.type to Artifact.type by @elronbandel in #933
change CopyFields operators name to Copy by @duckling69 in #876
Rename AddFields to Set, a name that represent its role better and concisely by @elronbandel in #903

New Features

Allow eager execution by @elronbandel in #888
Add view option for Task definitions in UI explorer. by @yoavkatz in #891
Add input type checking in LoadFromDictionary by @yoavkatz in #900
Add TokensSlice operator by @elronbandel in #902
Make some logs critical by @elronbandel in #973
Add LogProbInferenceEngines API and implement for OpenAI by @lilacheden in #909
Added support for ibm-watsonx-ai inference by @pawelknes in #961
load_dataset supports loading cards not present in local catalog by @pawelknes in #929
Added defaults to tasks by @pawelknes in #921
Add raw predictions and references to results by @yoavkatz in #934
Allow add-hoc metrics and template (and Add first version of standalone example of dataset with LLM as a judge ) by @eladven in #922
Add infer() function for end to end inference pipeline by @elronbandel in #952

Bug Fixes

LLMaaJ implementation of MLCommons' simple-safety-tests by @bnayahu in #873
Update gradio version on website by @elronbandel in #896
Improve demo by @elronbandel in #898
Fix demo and organize files by @elronbandel in #897
Make sacrebleu robust by @yoavkatz in #892
Fix huggingface assets to have versions and up to date readme by @elronbandel in #895
fix(cos loader): account for slashes in cos file name by @jezekra1 in #904
llama3 instruct and chat system prompts by @oktie in #950
Added trust_remote_code to HF dataset query operations by @yoavkatz in #911

Documentation

Update llm_as_judge.rst by @yoavkatz in #970
Michal Jacovi's completed manual review of the card descriptions by @dafnapension in #883
In card preparers, generate the tags with "singletons" rather than values paired with True by @dafnapension in #874
Improved documentation by @yoavkatz in #886
Update glossary.rst by @yoavkatz in #899
Add example section to documentation by @yoavkatz in #917
Added example of open qa using catalog by @yoavkatz in #919
Update example intro and simplified WNLI cards by @yoavkatz in #923
Update adding_metric.rst by @yoavkatz in #955
RAG documentation by @yoavkatz in #928
docs: update adding_dataset.rst by @eltociear in #927
prepare for description= that is different from those embedded automtically by @dafnapension in #937
Add simple LLM as a judge example, of using it without installaiotn by @eladven in #968
Add example of using LLM as a judge for summarization dataset. by @eladven in #965
Improve operators documentation by @elronbandel in #942

New Assets

Add numeric nlg dataset by @ShirApp in #882
Add to_list_by_hyphen_space processor by @marukaz in #872
Added tags and descriptions to safety cards by @bnayahu in #887
Add Mt-Bench datasets + add operators by @OfirArviv in #870
Touch up numeric nlg by @elronbandel in #889
split train to train and validation sets in billsum by @alonh in #901
modified wikitq, tab_fact taskcards by @ShirApp in #963
Implementation of TruthfulQA by @bnayahu in #931
Add bluebench cards by @perlitz in #918
Add LlamaIndex faithfulness metric by @arielge in #971
Expanded template support for safety cards by @bnayahu in #943

Testing and CI/CD

Add end to end realistic test to fusion by @elronbandel in #940
Moved test_examples to run the actual examples by @yoavkatz in #913
Use uv for installing requirements in actions by @elronbandel in #960
Add ability to print_dict to print selected fields by @yoavkatz in #947
Get rid of pkg_resources dependency by @elronbandel in #932
adapt filtering lambda to datasets 2.20 by @dafnapension in #930
Increase preparation log to error. by @elronbandel in #959

New Contributors

@ShirApp made their first contribution in #882
@oktie made their first contribution in #950

Full Changelog: 1.10.0...1.10.1

Contributors

eladven, oktie, and 15 other contributors

Assets 2

03 Jun 18:22

elronbandel

1.10.0

10ee34c

Unitxt 1.10.0

Main changes

Added support for handling sensitive data . When data is loaded from a data source using a Loader the user can specify the classification of the data (e.g. "public" or "proprietary"). Then Unitxt components such as metrics and inference engines checks if they are allowed to process the data based on their configuration. For example, an LLM as judge that sends data to remote services can be configured to only send "public" data to the remote services. This replaced the UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API option, which was a general flag that was not data dependent and hence error prone.
See more details in https://unitxt.readthedocs.io/en/latest/docs/data_classification_policy.html
Added support for adding metric prefix. Each metric has a new optional string attribute "score_prefix", that is appended to all scores it generates. This allows the same metric to be used on different fields of the tasks, and distinguish the output score.
New Operators tutorial and Loaders documentation

Backward

StreamInstanceOperator was renamed to InstanceOperator

New Features

Support for handling sensitive data sent to remote services by @pawelknes in #806 , @yoavkatz in #868
Added new NER metric using fuzzywuzzy logic by @sarathsgvr in #808
Added loader from HF spaces by @pawelknes in #860
Add metric prefix in main by @yoavkatz in #878
add MinimumOneExamplePerLabelRefiner to allow ensuring at least one example of each labels appears in the training data. by @alonh in #867

Bug Fix

Explorer UI crashed when no templates were defined in card by @yoavkatz in #855
Fix operator and metrics data by @yoavkatz in #878
Improved testing of cards by @yoavkatz in #861
FormTask deprecation by @yoavkatz in #856

New Assets

Adding go emotions dataset by @shaigrt in #865
Implementation of select safety benchmarks by @bnayahu in #854

Documentation

Update CONTRIBUTING.md by @elronbandel in #859
Adding operator tutorial and standarizing operators names by @elronbandel in #863
Fix code blocks in loaders docs by @elronbandel in #866
Typo fix in unitext operators docs by @duckling69 in #877
Add documntation to loaders by @elronbandel in #864
Changes to introduction page by @yoavkatz in #852

New Contributors

@sarathsgvr made their first contribution in #808
@bnayahu made their first contribution in #854
@shaigrt made their first contribution in #865
@duckling69 made their first contribution in #877

Full Changelog: 1.9.0...1.10.0

Contributors

shaigrt, alonh, and 6 other contributors

Assets 2

20 May 12:20

elronbandel

1.9.0

1c8a1a0

Unitxt 1.9.0

What's Changed

The most important things are:

Addition of LLM as a Judge Metrics and Tasks for both evaluating LLMs as judge and using them for evaluation of other tasks. Read more in the LLM as a Judge Tutorial
Addition of RAG response generation tasks and datasets as part of an effort to add comprhensive RAG evaluation to unitxt.
Renaming FormTask to Task for simplicity
Major improvments to documentation and tutorials

Breaking Changes 🚨

Ensure consistent evaluation of CI across implementations [Might change previous results] by @dafnapension in #844
Fix default format so it will be the same as formats.empty in catalog. Impacts runs that did not specify a format by @yoavkatz in #848
LoadJson operator moved from unit.processors to unitxt.struct_data_operators
Fixed YesNoTemplate and Diverse LabelSampler, to support binary task typing. YesNoTemplate now expect class field to contain a string and not a list of of strings with one elements by @yoavkatz in #836

Bug Fixes

Change processor type for to_list_by_comma_from_references by @antonpibm in #815
Handle empty text in Literal Eval by @antonpibm in #819
Fix clash between dir names and artifact names in catalog website by @elronbandel in #825
Ner typing had a mistake. by @yoavkatz in #832
Fix catalog reference by @elronbandel in #838
Fix default format by @yoavkatz in #848
Fixed YesNoTemplate and Diverse LabelSampler, to support binary task typing. by @yoavkatz in #836

New Features

Support prediction regex match by setting the operator as a postproce… by @antonpibm in #792
Add sample score output in test card by @yoavkatz in #803
Support for loading dictionaries by @pawelknes in #784
Add ability to fuse, split, MultiStreamScoreMean, and merge all by @dafnapension in #767
Changed default log verbosity to "info" instead of "debug" by @yoavkatz in #822
Skip artifact prepare and verify in catalog consistency tests by @elronbandel in #839
Add seperation between eagered streams and regular streams by @elronbandel in #846
Add precision and recall scores to f1_binary, max_f1_binary by @lilacheden in #824
Rename task by @elronbandel in #850

New Assets

Add basic format for llama3 models by @arielge in #812
Adding literal eval processor by @antonpibm in #813
Add RAG (response generation part) tasks and datasets by @perlitz in #811
Add 5 legalbench tasks (the 5 existing in HELM) by @perlitz in #827
Add financebench by @perlitz in #828
Add billsum dataset by @perlitz in #830
Add tldr dataset by @perlitz in #831
Add Attaq500 by @naamaz in #835
Add llm as judge mt-bench dataset and metrics by @OfirArviv in #791

Documentation

Documentation review by @yoavkatz in #805
Added documentation for global and huggingface metrics by @yoavkatz in #807
Touch up docs by @elronbandel in #809
Remove the contents from main menu by @elronbandel in #810
Add tags docs by @elronbandel in #814
Reviewing Unitxt tutorials by @michal-jacovi in #817
Fix the link to the operators tutorial by @elronbandel in #821
More documentation changes in metrics by @yoavkatz in #820
Update adding_task.rst by @michal-jacovi in #823
Fix missing mandatory new line in the begging of code block in documentation by @elronbandel in #829
Add description, homepage, and citation obtained from HF with datasets.load_dataset_builder by @dafnapension in #818
Updated documentation by @yoavkatz in #849

New Contributors

@antonpibm made their first contribution in #792
@michal-jacovi made their first contribution in #817

Full Changelog: 1.8.1...1.9.0

Contributors

naamaz, perlitz, and 9 other contributors

Assets 2

Releases: IBM/unitxt

1.12.3

Main changes

Non backward compatible changes in catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

Unitxt 1.12.2

Main changes

Non backward compatible changes

Changes in Catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

Unitxt 1.12.0

Main changes

Non backward compatible changes

Changes in Catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

1.11.1

Non backward compatible changes

New Features

New Assets

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors

1.11.0 (#996)

Non backward compatible changes

New Features

New Assets

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors

1.10.3

Non backward compatible changes

New Features

New Assets

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors

1.10.2

Non backward compatible changes

New Features

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors

1.10.1

Main Changes

Non backward compatible changes

New Features

Bug Fixes

Documentation

New Assets

Testing and CI/CD

New Contributors

Contributors

Unitxt 1.10.0