Datahike: Reimportable, Schema first, Export

When may Datahike migration choke?

This can only occur if you migrate a Datahike database after alternately growing data, schema and data.

So you are unlikely to see this in a setting where you have a fixed, huge dataset you want to work with.

You will propably see this when you use Datahike as a backend to an application that you grow over time while it is in use.

What happens: Import of exported db chokes on schema

The Datahike migrate namespace exports the db by writing the eavt index to disk. This sorts by entity id.

Since a living project will add schema over time, and schema gets stored into the same index as data, some schema definitions recieve higher entity ids than some data entities, so they get sorted later in the export.

New datoms may then add attributes about entities with an entity id lower than that of the attributes schema entity id. This results in these entities in the export containing attributes that get defined by schema later in the export.

This means that this sortet export can not be importet in sorted order.

Solution

The solution is to sort the schema datoms to the front when importing.

Difference to Datomic

In datomic the schema is in a separate database from the data. This way datomic avoids this altogether.

Implementation

#!/usr/bin/env sh 
#
# Datahike mixes schema with data.  When exporting and
# importing data can be imported before schema.
# This will choke.
# This script sorts the schema before the data.

if [ -z ${1} ] || [ -z ${2} ]
 then
  echo "sort schema of datahike export to top
usage: infile outfile"
  exit 1
fi

if [ -f ${1} ]
 then
 :
else 
  echo "infile ${1} not found"
  exit 1
fi

# Collect the schema entity numbers and create one big regex
# grep finds all lines with :db/ident at the relevant position
# cut gets the entity id with a leading '['
# tr  deletes the leading '['
# This leaves us with entity ids, integers, separated by '\n'
# tr  replaces the '\n' by '|'
# The first sed expression eliminates the trailing '\n'
# The second sed expression escapes the '|'
# This results is 'ids' like '1|2|3' which fits into 'pattern'.

ids=`grep '^#datahike/Datom \[[0-9]* :db/ident' ${1} \
 | cut -d' ' -f2 \
 | tr -d [ \
 | tr "\n" "|" \
 | sed 's/|$//g;s/|/\\\\|/g'`

pattern="^#datahike/Datom \[\(${ids}\) "

# echo $ids
# echo $pattern

grep "${pattern}" ${1} > ${2}
grep -v "${pattern}" ${1} >> ${2}

Details

The following is a minimal example of extending a schema after some data has been transacted and using it.

(ns core
  (:require [datahike.api :as d]
            [datahike.migrate :as m]))

(def db-config {:store              {:backend :file
                                     :path    "datahike-db"}
                :schema-flexibility :write
                :keep-history?      true})

(d/create-database db-config)

(def conn (d/connect db-config))

(d/transact conn [{:db/ident :demo/name
                   :db/cardinality :db.cardinality/one
                   :db/valueType   :db.type/string}])

(d/transact conn [{:demo/name "Mike"}])

(d/transact conn [{:db/ident       :demo/likes
                   :db/cardinality :db.cardinality/many
                   :db/valueType   :db.type/string}])

(d/transact conn [{:db/id      2
                   :demo/likes "Pizza"}])

(m/export-db @conn "dump")

The attribute :demo/likes is defined after Mike has been added and is then used on Mike.

This dumps the following index.

#datahike/Datom [1 :db/cardinality :db.cardinality/one 536870913 true]
#datahike/Datom [1 :db/ident :demo/name 536870913 true]
#datahike/Datom [1 :db/valueType :db.type/string 536870913 true]
#datahike/Datom [2 :demo/likes "Pizza" 536870916 true]
#datahike/Datom [2 :demo/name "Mike" 536870914 true]
#datahike/Datom [3 :db/cardinality :db.cardinality/many 536870915 true]
#datahike/Datom [3 :db/ident :demo/likes 536870915 true]
#datahike/Datom [3 :db/valueType :db.type/string 536870915 true]

As you can see the attribute :demo/likes is defined after its first use in the index.

Important!! This can be importet back. It only chokes on bigger imports. The import works in chunks. I suspect, that the import chokes on schema that will only be declared in a future chunk.

In which case Datahike returns an error like:

20-12-04 13:06:35 beasty ERROR [datahike.db:1041] 
- Bad entity attribute  :user/roles
  at  [:db/add 134 :user/roles :admin 536871112],
  not defined in current schema
  {:error :transact/schema,
   :attribute :user/roles,
   :context [:db/add 134 :user/roles :admin 536871112]}
20-12-04 13:06:35 beasty ERROR [datahike.connector:74]
- Error during transaction clojure.lang.ExceptionInfo:
  Bad entity attribute :user/roles
  at [:db/add 134 :user/roles :admin 536871112],
  not defined in current schema
  {:error :transact/schema,
   :attribute :user/roles,
   :context [:db/add 134 :user/roles :admin 536871112]}

The bigger picture

This might be solved in Datahike, either by changing the export, or by changing the import or by separating data and schema. I do not know, whether importing into datahike not sorted by entity id, as I do here, has any detrimental implications for building the index.

Sidenote: grep faster than sed with extended regex

The code above grew on me since I developed it step by step on the command line. After using it and thinking I was finished I thought about simplifying it.

So I tried to replace the grep with the cut and the trs with a single sed with enhanced RegExes.

It turns out that the whole programm takes twice as long when I create the pattern like this:

ids=`sed -n \
 -E 's/^#datahike\/Datom \[([0-9]*) :db\/ident.*/\1/p' \
    ${1} \
 | tr "\n" "|" | sed 's/|$//g;s/|/\\\\|/g'` 

So the line itself takes closer to three times as long as the approach above. The first approach processes about 100,000 datoms/s on my laptop.