[GSoC] Machine readable output from userland utilities

Tue Jun 3 16:04:43 UTC 2014

Hi everybody

I see there are several different ideas about how the output format should be specified.

I first started using an option named -O with the idea that this can be changed when the best variant is decided.

There is the idea with the environment variable that we discussed with Eitan:

On 29 May 2014 at 18:31, Eitan Adler wrote :
> On 29 May 2014 05:12, Zaro Korchev <zkorchev at mail.bg> wrote:
>> I thought about whether it is better to use an option or environment variable. I did it with option because it is easier to switch an option on/off. It appears that the flag -O is free in almost all tools. I have no problem making the use an environment variable.
> 
> My concern is that future standards may require this option (or at
> least, would be precluded from using it).  In addition, it may
> conflict with non-base utilities, such as coreutils ones.

----

There is the pipelining idea of Jonathan:

On 23 May 2014 at 16:27, Jonathan Anderson wrote :
> Imagine:
> 
> $ ifconfig | filterBy "ether" " 3c:07:.*" | sortBy "ether" | output my_ifconfig.format   # or "json" or "xml" or ...
> 
> A pipeline of little tools, each doing one thing well: how much more unix can you get? Currently, every command-line tool has to do two or three things:
> 1. its primary job,
> 2. output some arbitrary text format (that you're never allowed to change because other tools scrape it) and
> 3. (optionally) parse arbitrary text formats generated by users or some other tool.
> 
> Task 2 is annoying: in order to usefully query command-line tools, I have to write a parser. The tool has binary data, I want binary data, but we have to go through a dump/parse dance in order for me to get the data. This is the approach (again, from Plan 9) that brings you Linux sysfs. Perhaps David would now like to comment on his cross-platform "how much battery do I have" experience. :)
> 
> Task 3 isn't just annoying, however, it's risky. If every tool implements its own string protocol parsing, we greatly increase the risk of unnoticed bugs. Better to centralize as much string parsing as possible into a single library, which can be rigorously analyzed (and optimized!).
> 
> Imagine if geom didn't have to speak XML natively, but rather used a supported-everywhere-in-base data structure that users could convert into XML if they need it. Desktop applications are going to start requiring structured data passing via kdbus-like interfaces (currently based on GLib's GVariant), so we might as well have a structured representation that we like and are able to provide ABI support for (and, in the kdbus case, can possibly be converted to/from GVariant as required).

----

There is the long option idea of David:

> From : David Chisnall <theraven at theravensnest.org>
> Subject : Rép : [Machine readable output from userland utilities] report
> Date : 2 June 2014 16:31:11
> To : Zaro Korchev <zkorchev at mail.bg>
> Cc : soc-status at freebsd.org
> 
> On 2 Jun 2014, at 12:43, Zaro Korchev <zkorchev at mail.bg> wrote:
> 
>> At the moment both ls and vmstat are told to output JSON by specifying the -O option. However as I discussed with my mentor, this will be changed. The idea is to use an environment variable instead of the -O flag.
> 
> I don't like the idea of using an environment variable, because this is something that you might want to control on a per-command basis within a pipeline.  Especially with respect to incremental adoption, if you have some commands that will emit their default format, which is sent to sed / awk whatever, and some that will emit json natively, you don't want to suddenly have the output format from the legacy tools change once they gain machine-readable output support.
> 
> One *very* important thing to do is standardise the command-line flag that is used to specify the output format.  This may involve also converting some of the tools to use getopt_long if they don't already (lots of tools already use most single-digit options, so there's no possibility to define a single-letter flag that will be useable on all tools).  
> 
>> I understand your concerns about multi-threading. The idea is to have functions that serialize the object in an allocated buffer as it is constructed. Here is a more detailed example of what I mean:
> 
> It would be better to has some stream output API as the default.  If one back end only supports writing to buffers, then you can add an extra alloc / write / free sequence to hide it, but it would be good if the interface understands writing directly to file descriptors.  If the back end natively supports streaming, then you don't need to buffer the output.

As you have more experience I believe you can decide which is the best.

I like the pipelining and the long option idea the most. At the moment I'm working on porting more tools to use libsol so this decision is not urgent. I can change how the format is specified easily.

Zaro