Usage

Simple JCS library and command line application - probably not useful to many.

Synopsis

usage: tallipoika [-h] [--in-path IN_PATH] [--out-path OUT_PATH] [--serialize-only] [--version] [in_path_pos]

Stableson (Finnish: tallipoika) - a JSON Canonicalization Scheme (JCS) implementation.

positional arguments:
  in_path_pos           Path to the file to transform. Optional (default: STDIN)

options:
  -h, --help            show this help message and exit
  --in-path IN_PATH, -i IN_PATH
                        Path to the file to transform. Optional
                        (default: positional path value)
  --out-path OUT_PATH, -o OUT_PATH
                        output file path for transformed file (default: STDOUT)
  --serialize-only, -s  serialize only i.e. do not sort keys (default: False)
  --version, -V         show version of the app and exit

Example

Canonicalization of reference example for arrays:

% tallipoika < test/fixtures/reference_upstream_input/arrays.json
[56,{"d":true,"10":null,"1":[]}]

Serialization only:

% tallipoika -s < test/fixtures/reference_upstream_input/arrays.json
[56,{"1":[],"10":null,"d":true}]

Version

% tallipoika -V
Stableson (Finnish: tallipoika) - a JSON Canonicalization Scheme (JCS) implementation. version 2024.1.6+parent.g6c33ae2f

The testing node is a random machine, but in case it helps, the node identifier (as per bin/gen_node_identifier.py) is c79891e5-aabf-3a83-95b9-588edcd8327f. The machine type is a Mac mini, M1, 2020, with 16 GB of RAM, a nearly full (99%) SSD, and running macOS Sonoma 14.2.1.

Reference Test Data

Using small JSON files from the reference tests (approximately doubling the byte size every time) on a random developer machine and writing to the /dev/null sink:

% hyperfine --warmup 3 \
  'tallipoika < test/fixtures/reference_upstream_input/unicode.json > /dev/null' \
  'tallipoika < test/fixtures/reference_upstream_input/arrays.json > /dev/null' \
  'tallipoika < test/fixtures/reference_upstream_input/structures.json > /dev/null' \
  'tallipoika < test/fixtures/reference_upstream_input/weird.json > /dev/null'
Benchmark 1: tallipoika < test/fixtures/reference_upstream_input/unicode.json > /dev/null
  Time (mean ± σ):     101.1 ms ±   0.4 ms    [User: 28.4 ms, System: 12.0 ms]
  Range (min … max):    99.8 ms … 102.0 ms    28 runs

Benchmark 2: tallipoika < test/fixtures/reference_upstream_input/arrays.json > /dev/null
  Time (mean ± σ):     101.2 ms ±   0.6 ms    [User: 28.5 ms, System: 12.1 ms]
  Range (min … max):   100.2 ms … 102.8 ms    28 runs

Benchmark 3: tallipoika < test/fixtures/reference_upstream_input/structures.json > /dev/null
  Time (mean ± σ):     101.1 ms ±   0.6 ms    [User: 28.5 ms, System: 12.1 ms]
  Range (min … max):    99.8 ms … 102.8 ms    28 runs

Benchmark 4: tallipoika < test/fixtures/reference_upstream_input/weird.json > /dev/null
  Time (mean ± σ):     101.3 ms ±   0.5 ms    [User: 28.4 ms, System: 12.1 ms]
  Range (min … max):   100.2 ms … 102.4 ms    28 runs

Summary
  tallipoika < test/fixtures/reference_upstream_input/unicode.json > /dev/null ran
    1.00 ± 0.01 times faster than tallipoika < test/fixtures/reference_upstream_input/structures.json > /dev/null
    1.00 ± 0.01 times faster than tallipoika < test/fixtures/reference_upstream_input/arrays.json > /dev/null
    1.00 ± 0.01 times faster than tallipoika < test/fixtures/reference_upstream_input/weird.json > /dev/null

Broad size progression info (39 to 62 to 138 to 283 chars):

% wc test/fixtures/reference_upstream_input/{unicode,arrays,structures,weird}.json
       3       4      39 test/fixtures/reference_upstream_input/unicode.json
       8      12      62 test/fixtures/reference_upstream_input/arrays.json
       7      27     138 test/fixtures/reference_upstream_input/structures.json
      11      32     283 test/fixtures/reference_upstream_input/weird.json
      29      75     522 total

Canonicalization of the CSAF v2.0 JSON Schema

% hyperfine --warmup 3 'tallipoika < csaf_2_0.json > csaf_2_0.jcs.json'
Benchmark 1: tallipoika < csaf_2_0.json > csaf_2_0.jcs.json
  Time (mean ± σ):     103.8 ms ±   0.8 ms    [User: 29.8 ms, System: 12.5 ms]
  Range (min … max):   101.8 ms … 105.1 ms    28 runs

% wc csaf_2_0.json
    1343    4565   54123 csaf_2_0.json
% wc csaf_2_0.jcs.json
       0    2371   33804 csaf_2_0.jcs.json

Large Files and Comparing IO Mechanisms

Given the approx 25 Megabytes large JSON test file at https://github.com/json-iterator/test-data/raw/master/large-file.json at revision sha1:0bce379832b475a6c21726ce37f971f8d849513b from 2016-12-02 03:21:00 UTC with fingerprints:

artifact:json-iterator_test-data_sha1-0bce3798_large-file.json:
- blake2:f306519f67ddf66792eb6bbbcb48acedc7aedd2c9436c92877ecfa2bf36a7d1eaf0ef8895e0adaec54f532182c7a7b0dc8e057d485387ca1b00ba09bb8b79550
- blake3:529335c194bceb86f853b7ad2db103fcb63fd0d9d9501e8b8610ea043cb9485c
- bytes:26141343
- crc32:c4a131b8
- entropy:5.195488 per byte that is (64.9436 %)
- file:(Unicode text, UTF-8 text, with very long lines (15435))
- hex32:5b7b226964223a2232343839363531303435222c2274797065223a2243726561
- md5:67a1a08c5d0638f0af254d6c0243696d
- mime-encoding:(utf-8)
- mime-type:(text/plain)
- sha:6c5c3f760bb64426760682166e6df9218fe81b0f
- sha256:4fc1e52c4e609febd05d75a24c84bc6957fa4d2cfb0d5fbebbac650bdc7ed8c0
- sha384:e1b420c6b145f31a41a19b7f6365f8c25b337858c6d41c4132ef00deb3e3248e209e8b6c30e1fa7e072706927f393272
- sha512:47bc1b24ffc67c0ebf3625ccd9bf73af2f08956b550e3bf1699c64b8e5685d50f8569ea560bdc050e3b980476f220e262c264e4baabadae57988ee72e526f4a9
- ssdeep:49152:pjktmgtlFHs0ImDlf/2jhj1a7EksjuyOVQLHa7Ew4MePnj1hEyJWQCQzQQQQQ0kB:w
- tlsh:T12F47D0E342884496CF433EC0988DB7C892ABA05BDFC4EC49D7B5DC19C9585FB12CE65A

As stated below the throughput is around 30361606 incoming bytes per second (around 29 Megabytes/second).

The JCS "compression" in the canonicalization case for that input file is around (resulting jcs.json has a size of 26129992 bytes): 1.0004, so nearly no compression. But, the entropy has been reduced from 5.195488 down to 5.192266 i.e. from 64.9436 % down to 64.903325 % (the change is smaller than 0.05 % of 100 % or 0.06 % of the incoming entropy). Comparing canonicalization with serialization surprisingly finds:

% hyperfine --warmup 3 \
  'tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -s -o json-iterator_test-data_sha1-0bce3798_large-file.serialized.json' \
  'tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -o json-iterator_test-data_sha1-0bce3798_large-file.jcs.json'
Benchmark 1: tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -s -o json-iterator_test-data_sha1-0bce3798_large-file.serialized.json
  Time (mean ± σ):      1.076 s ±  0.014 s    [User: 0.935 s, System: 0.060 s]
  Range (min … max):    1.062 s …  1.100 s    10 runs

Benchmark 2: tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -o json-iterator_test-data_sha1-0bce3798_large-file.jcs.json
  Time (mean ± σ):     861.3 ms ±   1.9 ms    [User: 733.6 ms, System: 57.6 ms]
  Range (min … max):   859.3 ms … 865.2 ms    10 runs

Summary
  tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -o json-iterator_test-data_sha1-0bce3798_large-file.jcs.json ran
    1.25 ± 0.02 times faster than tallipoika json-iterator_test-data_sha1-0bce3798_large-file.json -s \
                                    -o json-iterator_test-data_sha1-0bce3798_large-file.serialized.json

Note: The only difference between the two transforms being that serialization does maintain the order of the incoming keys while canonicalization sorts the keys. This does not mean that there is no cost in ordering in the former case, as it may be more costly to maintain the incoming order on output than to allow sorting on output. Just some plausibility thinking, no real facts, as I did not bother to look at the different implementations from the underlying implementation in use (on that machine).

Comparing the above timings with the following (using stdin and stdout IO mechanisms instead of file paths) yields no surprises though:

% hyperfine --warmup 3 \
  'tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json -s > json-iterator_test-data_sha1-0bce3798_large-file.serialized.json' \
  'tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json > json-iterator_test-data_sha1-0bce3798_large-file.jcs.json'
Benchmark 1: tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json -s > json-iterator_test-data_sha1-0bce3798_large-file.serialized.json
  Time (mean ± σ):      1.069 s ±  0.002 s    [User: 0.938 s, System: 0.060 s]
  Range (min … max):    1.066 s …  1.075 s    10 runs

Benchmark 2: tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json > json-iterator_test-data_sha1-0bce3798_large-file.jcs.json
  Time (mean ± σ):     866.9 ms ±   4.8 ms    [User: 736.5 ms, System: 58.7 ms]
  Range (min … max):   862.9 ms … 877.9 ms    10 runs

Summary
  tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json > json-iterator_test-data_sha1-0bce3798_large-file.jcs.json ran
    1.23 ± 0.01 times faster than tallipoika < json-iterator_test-data_sha1-0bce3798_large-file.json -s \
                                    > json-iterator_test-data_sha1-0bce3798_large-file.serialized.json

Tallipoika