Parsing XML, JSON, and newer data file formats in UNIX using command line utilities

text processingxml

The Unix environment has some excellent tools for parsing text in various forms. However, of late, the data is not in the traditional (historical) formats (CSV, TSV, record-based or some other delimiter-based) it used to be before. Data these days is exchanged in structured formats like XML/JSON.

I know there are some good tools like sed, awk and Perl which can chew down nearly any form of data out there. However, to work with this sort of structured data, often one has to write a complete program, and, given the little time available to extract information, one has to sit down and figure out the whole logic of what one wants to query and put it down programmatically. Sometimes this is not OK – basically because the information extracted from those files acts as inputs for further work; also because of the time it takes to search for the appropriate solution and code it up. A command line tool is needed with sufficient switches to find, query and dump data.

I'm looking for tools that take a XML/JSON or other forms of structured data and dump it into other formats like csv, etc., so that from there one could use other commands to get any information out of it.

Are there any command line utilities you know of which do this kind of a job? Are there already awk/Perl scripts available to this?

Best Answer

for xml there is http://xmlstar.sourceforge.net/

XMLStarlet is a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands.

you can also use xsltproc and similar tools (saxon).

for json: i also think its better to just use python, ruby, perl and transform it.

Related Solutions

Unix Command-Line – Convert Between Unicode Normalization Forms

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

Bash – alter a line and remove tag using perl from xml file

With this as the sample input file:

$ cat client_23.xml 
<world>
    <hello>collect_model = 1</hello>
    <hello>enable_data = 0</hello>
    <hello>session_ms = 2*60*1000</hello>
    <hello>max_collect = string_integer($extract("max_collect"))</hello>
    <hello>max_collect = parenting(max_collect, max_collect, 1.0e99)</hello>
    <hello>output('{')</hello>
</world>
<derta-config>
    <data-users>2000</data-users>
    <test-users>2000</test-users>
    <attributes>hello world</attributes>
    <client-types>Client1</model-types>
    <target>price.world</target>
</derta-config>

We can make both changes using:

$ sed 's|<hello>collect_model = 1</hello>|<hello>collect_model = 0</hello>|; \|<derta-config>|,\|</derta-config>|d' client_23.xml 
<world>
    <hello>collect_model = 0</hello>
    <hello>enable_data = 0</hello>
    <hello>session_ms = 2*60*1000</hello>
    <hello>max_collect = string_integer($extract("max_collect"))</hello>
    <hello>max_collect = parenting(max_collect, max_collect, 1.0e99)</hello>
    <hello>output('{')</hello>
</world>

How it works

We have two sed commands. The first is a substitute, the second is a delete:

s|<hello>collect_model = 1</hello>|<hello>collect_model = 0</hello>|

Substitute commands have the form s|old|new|. So, here old is the original <hello>collect_model = 1</hello> and new is the replacement <hello>collect_model = 0</hello>.
\|<derta-config>|,\|</derta-config>|d

This defines a range of lines. The starting line contains derta-config> and the ending line contains </derta-config>. All lines within this range are deleted by the delete command d.

Best Answer

Related Solutions

Unix Command-Line – Convert Between Unicode Normalization Forms

Bash – alter a line and remove tag using perl from xml file

How it works

Related Question