Processing Wikidata JSON dumps in parallel on the command line with dgsh

, .

Motivation

A few weeks ago, I wondered what the most common descriptions on Wikidata were. I usually use the Wikidata Query Service to answer any questions about Wikidata I have, but this time that didn’t work, since there are too many descriptions on Wikidata to check without running into the query service’s timeout. At first, I was left at guessing a description and checking how many times it was used (counting the occurrences of a single description is possible in the query service). However, after a quick chat on IRC, Nikki produced a script that went through a recent JSON dump of Wikidata and collected the most common descriptions – first, for just a handful of languages (pastebin), then for some more (pastebin).

First version

I liked the idea of going through the dump to collect these statistics, and I wanted to see if it was feasible to do this from a shell script, so I put together a quick pipeline and let it run on my laptop. I posted the result in a GitHub Gist, and the pipeline I used was:

time pv latest-all-20170718.json |
    head -n-1 |
    tail -n+2 |
    sed 's/,$//' |
    jq -r '
      .descriptions |
      .[] |
      (.language + "\t" + .value)
    ' |
    awk '
      {
        a[$0]++
      }
      END {
        for (k in a)
          if (a[k] >= 1000)
            print a[k] "\t" k
      }
    ' |
    sort -nr \
    > common-descriptions

Here’s the breakdown (feel free to skip any part you feel you already understand ☺):

Complaints

This pipeline works, but I wasn’t entirely happy with it, for two reasons:

  1. We’re using three different commands just to convert the input array to a more convenient form, none of which is very efficient for this task, and one of which does regex processing!
  2. The entire pipeline is CPU‑bound by jq, and that part is not parallelized, so we’re only using one CPU core.

Enter ja2l

To fix the first problem, I wrote a tiny C program to do the task of the three commands instead:

#include <stdio.h>
#include <stdlib.h>

int main() {
  char *line = NULL;
  size_t len = 0;
  ssize_t read;

  while ((read = getline(&line, &len, stdin)) != -1) {
    line[read-2] = 0;
    printf("%s\n", line);
  }

  free(line);
  exit(EXIT_SUCCESS);
}

It’s simple, but it does its job and is much more efficient than the three‑command pipeline. I eventually cleaned it up a bit (bugfixes, add command line options, error handling, license, …) and published it as ja2l (JSON array to lines).

Adding dgsh support

To fix the second problem, I had the idea of using dgsh to parallelize the processing. dgsh stands for “directed graph shell” and is an alternative shell (technically, a fork of bash) that extends the notion of pipelines to be directed acyclic graphs instead of just a single string. Each command can have any number of inputs and outputs, and when a pipeline is started, the programs negotiate the input/output file descriptors between them before starting to execute. This means that we can scatter the JSON lines of the Wikidata dump across several jq processes and then gather the results of those processes before summarizing the description counts and printing the most common descriptions.

That is, we transform the traditional pipeline shown in figure one into the parallelized version shown in figure two.

Figure one: The traditional shell pipeline.
Figure two: The parallelized pipeline. dgsh-tee is another name for cat, which simply concatenates the four output streams into one; the extra awk is needed to sum up the partial counts of each jq | awk sub‑pipeline.

The version of tee included with dgsh can do this scattering with the -s option, but I wanted to add support for this to ja2l as well: partly because it’s ever so slightly more efficient (ja2l already knows where the line breaks are because it has to remove trailing commas, tee has to look for them in the input stream), partly just to play with dgsh. It turns out that the extra code in ja2l is fairly small, and building with dgsh support is as simple as adding a preprocessor flag and a linker library.

Result

I am very happy to report that on a system with two physical processor cores (hyperthreaded), using ja2l and dgsh reports in a 2.16× speedup (wall‑clock time) over the original script, with a pv‑reported throughput of 69.1 MiB/s instead of 31.2 MiB/s (2.21×). I tested this with the following polyglot script:

function countDescriptions {
    jq -r '
      .descriptions |
      .[] |
      (.language + "\t" + .value)
    ' | awk '
      {
        a[$0]++
      }
      END {
        for (k in a)
          print a[k] "\036" k
      }
    '
}

function summarizeDescriptions {
    awk -F$'\036' '
      {
        a[$2] += $1
      }
      END {
          for (k in a)
            if (a[k] >= 1000)
              print a[k] "\t" k
      }
    ' |
    sort -nr
}

file=~/Downloads/latest-all-20170718/latest-100000-20170718.json

if {{ : ; }} ; then dgsh=true; else dgsh=false; fi 2>/dev/null

time if $dgsh; then
    pv -- "$file" |
        ./ja2l |
        {{
            countDescriptions &
            countDescriptions &
            countDescriptions &
            countDescriptions &
        }} |
        cat |
        summarizeDescriptions > common-descriptions-dgsh
else
    pv -- "$file" |
        head -n-1 |
        tail -n+2 |
        sed 's/,$//' |
        countDescriptions |
        summarizeDescriptions > common-descriptions-bash
fi

The script tests if it’s running under dgsh or bash, and chooses the parallel or non-parallel pipeline to run accordingly. It should be possible to write similar scripts for many other data processing tasks across a Wikidata dump that can be expressed in jq, awk and other Unix command line tools, and speed them up similarly.