Dos2unix different behavior on macos and centos

bashcentoslinuxunixxml

I have a file which I need to convert to unix format, so I am doing dos2unix filename.txt. But the format of the file after conversion is different on centos when compared to the macos format.

I have tried to update the dos2unix version on centos but didn't help.

Centos:

-bash-4.1$ dos2unix -V
dos2unix 3.1 (Thu Nov 19 1998)

MacOs:

m-c02xd0nmjgh7:test files s0c03h1$ dos2unix -V
dos2unix 7.4.0 (2017-10-10)
With Unicode UTF-16 support.
Without native language support.

Original file format:

<?xml version="1.0" encoding="utf-8"?>
<ACES version="3.0">
  <Header>
    <Company>Disc Brakes Australia</Company>
    <SenderName>SEMA Data Co-op</SenderName>
    <SenderPhone>888-958-6698 option 2</SenderPhone>
    <TransferDate>2019-03-21</TransferDate>
    <BrandAAIAID>DMWK</BrandAAIAID>
    <DocumentTitle>SDC ACES XML File</DocumentTitle>
    <EffectiveDate>2019-03-21</EffectiveDate>
    <SubmissionType>FULL</SubmissionType>
    <VcdbVersionDate>2019-02-22</VcdbVersionDate>
    <QdbVersionDate>2019-02-22</QdbVersionDate>
    <PcdbVersionDate>2019-02-22</PcdbVersionDate>
  </Header>
  <App action="A" id="1">
    <BaseVehicle id="119723" />
    <SubModel id="973" />
    <EngineBase id="6067" />
    <Region id="3" />
    <Qty>2</Qty>
    <PartType id="1896" />
    <MfrLabel>T3 5000 Series T-Slot Slotted Rotor, Black Hat Test label 1</MfrLabel>
    <Position id="22" />
    <Part>DBA52120BLKS</Part>
  </App>
  <App action="A" id="2">
    <BaseVehicle id="119723" />
    <SubModel id="973" />
    <EngineBase id="3930" />
    <Region id="1" />
    <Qty>2</Qty>
    <PartType id="1896" />
    <MfrLabel>T3 5000 Series T-Slot Slotted Rotor, Black Hat Test label 2</MfrLabel>
    <Position id="22" />
    <Part>DBA52120BLKS</Part>
  </App>
  <Footer>
    <RecordCount>2</RecordCount>
  </Footer>
</ACES>

m-c02xd0nmjgh7:test files s0c03h1$ od -bc FileName.XML | head -10
0000000   357 273 277 074 077 170 155 154 040 166 145 162 163 151 157 156
         357 273 277   <   ?   x   m   l       v   e   r   s   i   o   n
0000020   075 042 061 056 060 042 040 145 156 143 157 144 151 156 147 075
           =   "   1   .   0   "       e   n   c   o   d   i   n   g   =
0000040   042 165 164 146 055 070 042 077 076 015 012 074 101 103 105 123
           "   u   t   f   -   8   "   ?   >  \r  \n   <   A   C   E   S
0000060   040 166 145 162 163 151 157 156 075 042 063 056 060 042 076 015
               v   e   r   s   i   o   n   =   "   3   .   0   "   >  \r
0000100   012 040 040 074 110 145 141 144 145 162 076 015 012 040 040 040
          \n           <   H   e   a   d   e   r   >  \r  \n

Macos file format after conversion:

m-c02xd0nmjgh7:test files s0c03h1$ dos2unix FileName.XML
dos2unix: converting file FileName.XML to Unix format...
m-c02xd0nmjgh7:test files s0c03h1$ od -bc FileName.XML | head -10
0000000   074 077 170 155 154 040 166 145 162 163 151 157 156 075 042 061
           <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
0000020   056 060 042 040 145 156 143 157 144 151 156 147 075 042 165 164
           .   0   "       e   n   c   o   d   i   n   g   =   "   u   t
0000040   146 055 070 042 077 076 012 074 101 103 105 123 040 166 145 162
           f   -   8   "   ?   >  \n   <   A   C   E   S       v   e   r
0000060   163 151 157 156 075 042 063 056 060 042 076 012 040 040 074 110
           s   i   o   n   =   "   3   .   0   "   >  \n           <   H
0000100   145 141 144 145 162 076 012 040 040 040 040 074 103 157 155 160
           e   a   d   e   r   >  \n                   <   C   o   m   p

Centos file format after conversion:

-bash-4.1$ dos2unix output.txt
dos2unix: converting file output.txt to UNIX format ...
-bash-4.1$ od -bc output.txt | head -10
0000000 357 273 277 074 077 170 155 154 040 166 145 162 163 151 157 156
        357 273 277   <   ?   x   m   l       v   e   r   s   i   o   n
0000020 075 042 061 056 060 042 040 145 156 143 157 144 151 156 147 075
          =   "   1   .   0   "       e   n   c   o   d   i   n   g   =
0000040 042 165 164 146 055 070 042 077 076 012 074 101 103 105 123 040
          "   u   t   f   -   8   "   ?   >  \n   <   A   C   E   S    
0000060 166 145 162 163 151 157 156 075 042 063 056 060 042 076 012 040
          v   e   r   s   i   o   n   =   "   3   .   0   "   >  \n    
0000100 040 074 110 145 141 144 145 162 076 012 040 040 040 040 074 103
              <   H   e   a   d   e   r   >  \n                   <   C

I want the same results as I get from unix2dos in a mac as showed above.

Best Answer

357 273 277 is octal representation of BOM (byte order mark) in UTF-8. The original file is with BOM. In one of your systems dos2unix removes it.

In my Debian man 1 dos2unix says:

-b, --keep-bom
Keep Byte Order Mark (BOM). When the input file has a BOM, write a BOM in the output file. […]

[…]

-r, --remove-bom
Remove Byte Order Mark (BOM). Do not write a BOM in the output file. […]

If you have the same (or similar) options available, use them. Example:

dos2unix -r FileName.XML

But your dos2unix on CentOS is very old (1998-11-19? over 20 years! this is even more awkward, considering the first CentOS release was in 2004). The changelog says -r and -b were added on 2014-07-07. Get a newer dos2unix.

Alternatively seek bomstrip. The description of bomstrip package in my Debian is:

Bomstrip is a very simple tool that removes BOM's (byte-order-marks) from UTF-8 files. UTF-8 does not have byte-ordering issues, so there is absolutely no need to have three bytes (the UTF-8-BOM) that do not say anything about the byte-order (since there is nothing to say).

Related Question