Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV files should use CRLF as row separator (or at least it should be possible to do so no matter what OS you are using!) #1722

Open
DJCrashdummy opened this issue Nov 23, 2024 · 4 comments
Assignees

Comments

@DJCrashdummy
Copy link

i user Miller for batch manipulating CSV files with an embedded pretty-print JSON (with LF line breaks in it) in one column.
everything worked fine, but i noticed one difference to the original: the row separator got changed from CRLF to LF without invoking it whatsoever!
then i spent quite some time because at the man-page there is a whole section about (changing) separators, but no mater what i tried (adding --rs crlf, --ors crlf, --rs '\r\n' or --ors '\r\n'), the output stayed the same.
after further research via internet i found the following on the website:
https://miller.readthedocs.io/en/6.12.0/reference-main-separators/#which-separators-apply-to-which-file-formats

CSV-RS: Always \n; not alterable (or \r\n on Windows)

there are a few things which are at least quite unfortunate here:

  • yes, i know CSVs and their row separators are not officially defined anywhere, but all RFCs (4180 and 7111) which try to achieve a "common ground", recommend or at least suggest to use CRLF.
  • i get why a program with output at the command line respects (by default) the conventions of the operating system it runs on... but when i can't specify or at least change that, it might get a problem at the latest when i use in-place editing (-I) or write the output into a file and share it with other people (running other OSes).
  • it seems that Miller can work with CRLF row separators (see the Windows note and even on Linux it read the input file quite well without specifying anything about row separators), but it seems that setting a specific output row separator is for whatever reason forbidden.
  • if this is the intended behavior and not just a temporary regression (i found older Linux versions of Miller which used CRLF as row separators in CSV files), it should be mentioned at the man-pages separator section similar to the JSON notes, that at least --ors is ignored and you don't have to bother with that.
  • and last but not least: it is IMHO a terrible practice that the same command with the same program and version produces different files depending on the OS!
    at least when --ors (perhaps even --rs) is used, there should be a warning, that it is ignored and \n|\r\n depending on your OS is used.

for context, the link to the whole case: https://stackoverflow.com/a/79189576/2351568

@DJCrashdummy
Copy link
Author

https://miller.readthedocs.io/en/6.12.0/new-in-miller-6/#line-endings

The --auto flag is now ignored. Before, if a file had CR/LF (Windows-style) line endings on input (on any platform), it would have the same on output; likewise, LF (Unix-style) line endings. Now, files with CR/LF or LF line endings are processed on any platform, but the output line-ending is for the platform. E.g. reading CR/LF files on Linux will now produce LF output.

i'm not sure if "line ending" refers to row separator in this paragraph, but if the row separator is not set to CRLF by default for all CSV files (as recommended), just using the same separator as detected from or specified for the input - no mater what OS you are using - seems to me a reasonable and far better default.
but well... as long as you can override it with an option in the command, the default is not that big of a problem.

@johnkerl
Copy link
Owner

johnkerl commented Nov 24, 2024

@DJCrashdummy my apologies for the negative experience you're having.

Here is a summary of events, and a list of imperfections:

In Miller <= 5:

  • Miller was written in C.
  • Lots of things were "coded from scratch", including my own hand-rolled CSV parser

In Miller 6:

A suggested workaround: Using mlr termcvt you can pipe your output through mlr termcvt lf2crlf. (This needs to be a pipe at the command line; this cannot participate in a then-chain.)

In summary:

  • Thank you for your feedback
  • Please let me know if the proposed workaround is any help for you
  • There are documentation improvements I can make about the status quo -- as you've carefully pointed out
  • I have multiple reasons to want to abandon the standard Go CSV I/O library and port my hand-rolled CSV parser from C to Go -- your feedback here only adds fuel to that fire

@johnkerl johnkerl self-assigned this Nov 24, 2024
@johnkerl johnkerl changed the title CSV files should use CRLF as row reperator (or at least it should be possible to do so no mater what OS you are using!) CSV files should use CRLF as row reperator (or at least it should be possible to do so no matter what OS you are using!) Nov 24, 2024
@DJCrashdummy DJCrashdummy changed the title CSV files should use CRLF as row reperator (or at least it should be possible to do so no matter what OS you are using!) CSV files should use CRLF as row separator (or at least it should be possible to do so no matter what OS you are using!) Nov 25, 2024
@DJCrashdummy
Copy link
Author

DJCrashdummy commented Nov 25, 2024

first and foremost @johnkerl: there is nothing to apologize for! - i have to thank you for developing & maintaining Miller as FOSS!


ok, wow... i didn't know that version 6 was a complete rewrite. kudos! 👍

See also https://miller.readthedocs.io/en/latest/new-in-miller-6/

yes, saw that... that's what i'm referring to in my first comment, that the --auto flag (if i guessed right that "line ending" represent row separators) should IMHO even be the default setting instead of removed.

The standard Go CSV I/O library has [...] not programmable line ending (Miller IRS/ORS). [...] handles CR/LF or LF for input, transparently and without problem.

that's IMHO odd... it handles both line ending without any problem and can even use both for output (depending on the OS), but doesn't let you choose which one it should be. 🤨

I should scrap the Go CSV I/O library and port my hand-rolled CSV parser from C to Go

if you do so, just a tiny heads up as i also tested older versions of Miller: be aware, that at least mlr 3.4.0 changed all line endings (even the ones within the field with pretty-print JSON!) to CRLF, which made the output unusable.

Or I could fork the (open source) standard Go CSV I/O library and make the input/output record separators programmable

if the necessary amount of work is similar, perhaps the better approach.

I said perhaps not enough here: https://miller.readthedocs.io/en/latest/new-in-miller-6/#line-endings

that's IMHO the clearest description of the whole topic, but as i never used Miller before, i wasn't interested in "what's new" in the first place and had no clue about the --auto flag. ...as long as it would have worked manually, i would have been still ok with it. 😉

I said perhaps not enough here: https://miller.readthedocs.io/en/6.12.0/reference-main-separators/#which-separators-apply-to-which-file-formats

well... that was still confusing somehow, as most of the separator section in the man-page seems to more or less contradict it, as there is nowhere mentioned that CSV row separator are not alterable at all.
and when it comes to contradicting docs, i tend to go with the included man-/info-page or the like.

Please let me know if the proposed workaround is any help for you

unfortunately not. piping the output through mlr termcvt --lf2crlf converts all line endings - even the ones within the field with pretty-print JSON in it - to CRLF, which is undoubtedly worse because then some programs butcher the JSON part.
so mlr termcvt seems to not be aware of the CSV format resp. ignores it.


FWIW: interestingly csvkit - written in python - has the same issue (producing CSV files with only LF as row separator)... may it be that there is an underlying multi-lang-lib with the same issue?

@DJCrashdummy
Copy link
Author

as we are talking about room for improvement for the docs, especially the man-page, may i present some suggestions:

SEPARATOR FLAGS
...
       Notes about line endings:

       * Default line endings (‘--irs‘ and ‘--ors‘) are newline
         which is interpreted to accept carriage-return/newline files (e.g. on Windows)
         for input, and to produce platform-appropriate line endings on output.

this is somehow confusing... as it referrers to default "line endings" again. and i'm still not sure if this is true for all file format or just CSVs. if it is basically true for all formats, it should be rather on top of the In brief: paragraph.

in general, i would reorganize the SEPARATOR FLAGS section a bit, mainly the single points of the In brief: paragraph, to not have following points somehow contradict prior points.

In brief:
1. Each file format has its own default separators.
2. Default line endings (‘--irs‘ and ‘--ors‘) are newline [...]
3. You can set separators differently between Miller's input and output [...]
4. Some separators are not programmable [...]
5. Most formats, such as CSV, don't support pair-separators [...]

everything else about specific formats (also For DKVP records [...]) belongs IMHO in the Notes about all other separators: paragraph. perhaps even sorting by separator or format may make it a bit better structured and calling it "Notes about specific separators/file formats:" may be in idea.
at least there should be one point like "ORS can't be changed for CSV files and will be determined according to your operating system (--ors is ignored and --rs equals --irs)."

and the last 2 points (You can specify separators [...]) with the table of Default separators by format: belong IMHO in no list but are rather a separate paragraph.
and well updating this table to be more like the one on the website should make everything clear.


BTW: speaking about the table...

               tsv      "     "    N/A    "\n"

i guess it it should be rather... 😜

               tsv      "\t"    N/A    "\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants