11. Dataimport

CubicWeb is designed to easily manipulate large amounts of data, and provides utilities to make imports simple.

The main entry point is cubicweb.dataimport.importer which defines an ExtEntitiesImporter class responsible for importing data from an external source in the form ExtEntity objects. An ExtEntity is a transitional representation of an entity to be imported in the CubicWeb instance; building this representation is usually domain-specific – e.g. dependent of the kind of data source (RDF, CSV, etc.) – and is thus the responsibility of the end-user.

Along with the importer, a store must be selected, which is responsible for insertion of data into the database. There exists different kind of stores, allowing to insert data within different levels of the CubicWeb API and with different speed/security tradeoffs. Those keeping all the CubicWeb hooks and security will be slower but the possible errors in insertion (bad data types, integrity error, …) will be handled.

11.1. Example

Consider the following schema snippet.

class Person(EntityType):
    name = String(required=True)

class knows(RelationDefinition):
    subject = 'Person'
    object = 'Person'

along with some data in a people.csv file:

# uri,name,knows
http://www.example.org/alice,Alice,
http://www.example.org/bob,Bob,http://www.example.org/alice

The following code (using a shell context) defines a function extentities_from_csv to read Person external entities coming from a CSV file and calls the ExtEntitiesImporter to insert corresponding entities and relations into the CubicWeb instance.

from cubicweb.dataimport import ucsvreader, RQLObjectStore
from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter

def extentities_from_csv(fpath):
    """Yield Person ExtEntities read from `fpath` CSV file."""
    with open(fpath) as f:
        for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False):
            yield ExtEntity('Person', uri,
                            {'name': set([name]), 'knows': set([knows])})

extenties = extentities_from_csv('people.csv')
store = RQLObjectStore(cnx)
importer = ExtEntitiesImporter(schema, store)
importer.import_entities(extenties)
commit()
rset = cnx.execute('String N WHERE X name N, X knows Y, Y name "Alice"')
assert rset[0][0] == u'Bob', rset

11.2. Importer API

Data import of external entities.

Main entry points:

class cubicweb.dataimport.importer.ExtEntitiesImporter(schema, store, extid2eid=None, existing_relations=None, etypes_order_hint=(), import_log=None, raise_on_error=False)[source]

This class is responsible for importing externals entities, that is instances of ExtEntity, into CubicWeb entities.

Parameters:
  • schema – the CubicWeb’s instance schema
  • store – a CubicWeb Store
  • extid2eid – optional {extid: eid} dictionary giving information on existing entities. It will be completed during import. You may want to use cwuri2eid() to build it.
  • existing_relations – optional {rtype: set((subj eid, obj eid))} mapping giving information on existing relations of a given type. You may want to use RelationMapping to build it.
  • etypes_order_hint – optional ordered iterable on entity types, giving an hint on the order in which they should be attempted to be imported
  • import_log – optional object implementing the SimpleImportLog interface to record events occuring during the import
  • raise_on_error – optional boolean flag - default to false, indicating whether errors should be raised or logged. You usually want them to be raised during test but to be logged in production.

Instances of this class are meant to import external entities through import_entities() which handles a stream of ExtEntity. One may then plug arbitrary filters into the external entities stream.

import_entities(ext_entities)[source]

Import given external entities (ExtEntity) stream (usually a generator).

class cubicweb.dataimport.importer.ExtEntity(etype, extid, values=None)[source]

Transitional representation of an entity for use in data importer.

An external entity has the following properties:

  • extid (external id), an identifier for the ext entity,
  • etype (entity type), a string which must be the name of one entity type in the schema (eg. 'Person', 'Animal', …),
  • values, a dictionary whose keys are attribute or relation names from the schema (eg. 'first_name', 'friend'), and whose values are sets. For attributes of type Bytes, byte strings should be inserted in values.

For instance:

ext_entity.extid = 'http://example.org/person/debby'
ext_entity.etype = 'Person'
ext_entity.values = {'first_name': set([u"Deborah", u"Debby"]),
                    'friend': set(['http://example.org/person/john'])}

Utilities:

cubicweb.dataimport.importer.cwuri2eid(cnx, etypes, source_eid=None)[source]

Return a dictionary mapping cwuri to eid for entities of the given entity types and / or source.

class cubicweb.dataimport.importer.RelationMapping(cnx, source=None)[source]

Read-only mapping from relation type to set of related (subject, object) eids.

If source is specified, only returns relations implying entities from this source.

cubicweb.dataimport.importer.use_extid_as_cwuri(extid2eid)[source]

Return a generator of ExtEntity objects that will set cwuri using entity’s extid if the entity does not exist yet and has no cwuri defined.

extid2eid is an extid to eid dictionary coming from an ExtEntitiesImporter instance.

Example usage:

importer = ExtEntitiesImporter(cnx, store, import_log)
set_cwuri = use_extid_as_cwuri(importer.extid2eid)
importer.import_entities(set_cwuri(extentities))

11.2.1. Stores

Stores are responsible to insert properly formatted entities and relations into the database. They have the following API:

>>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe')
>>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown')
>>> store.prepare_insert_relation(user_eid, 'in_group', group_eid)
>>> store.flush()
>>> store.commit()
>>> store.finish()

Some store requires a flush to copy data in the database, so if you want to have store independant code you should explicitly call it. (There may be multiple flushes during the process, or only one at the end if there is no memory issue). This is different from the commit which validates the database transaction. At last, the finish() method should be called in case the store requires additional work once everything is done.

  • prepare_insert_entity(<entity type>, **kwargs) -> eid: given an entity type, attributes and inlined relations, return the eid of the entity to be inserted, with no guarantee that anything has been inserted in database,
  • prepare_update_entity(<entity type>, eid, **kwargs) -> None: given an entity type and eid, promise for update given attributes and inlined relations with no guarantee that anything has been inserted in database,
  • prepare_insert_relation(eid_from, rtype, eid_to) -> None: indicate that a relation rtype should be added between entities with eids eid_from and eid_to. Similar to prepare_insert_entity(), there is no guarantee that the relation will be inserted in database,
  • flush() -> None: flush any temporary data to database. May be called several times during an import,
  • commit() -> None: commit the database transaction,
  • finish() -> None: additional stuff to do after import is terminated.
class cubicweb.dataimport.stores.NullStore[source]

Store that mainly describe the store API.

It may be handy to test input data files or to measure time taken by steps above the store (e.g. data parsing, importer, etc.): simply give a NullStore instance instead of the actual store.

class cubicweb.dataimport.stores.RQLObjectStore(cnx)[source]

Store that works by making RQL queries, hence with all the cubicweb’s machinery activated.

class cubicweb.dataimport.stores.NoHookRQLObjectStore(cnx, metagen=None)[source]

Store that works by accessing low-level CubicWeb’s source API, with all hooks deactivated. It may be given a metadata generator object to handle metadata which are usually handled by hooks.

Arguments: - cnx, a connection to the repository - metagen, optional MetadataGenerator instance

class cubicweb.dataimport.stores.MetadataGenerator(cnx, baseurl=None, source=None, meta_skipped=())[source]

Class responsible for generating standard metadata for imported entities. You may want to derive it to add application specific’s metadata. This class (or a subclass) may either be given to a nohook or massive store.

Parameters: * cnx: connection to the repository * baseurl: optional base URL to be used for cwuri generation - default to config[‘base-url’] * source: optional source to be used as cw_source for imported entities

11.3. MassiveObjectStore

This store relies on COPY FROM sql commands to directly push data using SQL commands rather than using the whole CubicWeb API. For now, it only works with PostgreSQL as it requires the COPY FROM command. Anything related to CubicWeb (Hooks, for instance), are bypassed. It inserts entities directly by using one PostgreSQL COPY FROM query for a set of similarly structured entities.

This store is the fastest, if the table is small compared to the volume of data to insert. Indeed, it removes all indexes and constraints on the table before importing, and reapply them at the end. This means that if the table is small compared to the amount of data you want to insert, this store is better than the others.

NOTE: Because inlined [1] relations are stored in the entity’s table, they must be set as any other attributes of the entity. For instance:

store.prepare_insert_entity("MyEType", name="toto", favorite_email=email_address.eid)
[1]An inlined relation is a relation defined in the schema with the keyword argument inlined=True. Such a relation is inserted in the database as an attribute of the entity whose subject it is.
class cubicweb.dataimport.massive_store.MassiveObjectStore(cnx, slave_mode=False, eids_seq_range=10000, metagen=None, drop=True)[source]

Store for massive import of data, with delayed insertion of meta data.

WARNINGS:

  • This store may only be used with PostgreSQL for now, as it relies on the COPY FROM method, and on specific PostgreSQL tables to get all the indexes.
  • This store can only insert relations that are not inlined (i.e., which do not have inlined=True in their definition in the schema), unless they are specified as entity attributes.

It should be used as follows:

store = MassiveObjectStore(cnx) eid_p = store.prepare_insert_entity(‘Person’,

cwuri=u’http://dbpedia.org/toto’, name=u’Toto’)
eid_loc = store.prepare_insert_entity(‘Location’,
cwuri=u’http://geonames.org/11111’, name=u’Somewhere’)

store.prepare_insert_relation(eid_p, ‘lives_in’, eid_loc) store.flush() … store.commit() store.finish()

Full-text indexation is not handled, you’ll have to reindex the proper entity types by yourself if desired.

Create a MassiveObject store, with the following arguments:

  • cnx, a connection to the repository
  • metagen, optional MetadataGenerator instance
  • eids_seq_range: size of eid range reserved by the store for each batch