10. Full Text Indexing in CubicWeb¶
When an attribute is tagged as fulltext-indexable in the datamodel,
CubicWeb will automatically trigger hooks to update the internal
fulltext index (i.e the appears
SQL table) each time this attribute
is modified.
CubicWeb also provides a db-rebuild-fti
command to rebuild the whole
fulltext on demand:
cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance
You can also rebuild the fulltext index for a given set of entity types:
cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance Ticket Version
In the above example, only fulltext index of entity types Ticket
and Version
will be rebuilt.
10.1. Standard FTI process¶
Considering an entity type ET
, the default fti process is to :
- fetch all entities of type
ET
- for each entity, adapt it to
IFTIndexable
(seeIFTIndexableAdapter
) - call
get_words()
on the adapter which is supposed to return a dictionary weight -> list of words as expected byindex_object()
. The tokenization of each attribute value is done bytokenize()
.
See IFTIndexableAdapter
for more documentation.
10.2. Yams and fulltext_container
¶
It is possible in the datamodel to indicate that fulltext-indexed
attributes defined for an entity type will be used to index not the
entity itself but a related entity. This is especially useful for
composite entities. Let’s take a look at (a simplified version of)
the base schema defined in CubicWeb (see cubicweb.schemas.base
):
class CWUser(WorkflowableEntityType):
login = String(required=True, unique=True, maxsize=64)
upassword = Password(required=True)
class EmailAddress(EntityType):
address = String(required=True, fulltextindexed=True,
indexed=True, unique=True, maxsize=128)
class use_email_relation(RelationDefinition):
name = 'use_email'
subject = 'CWUser'
object = 'EmailAddress'
cardinality = '*?'
composite = 'subject'
The schema above states that there is a relation between CWUser
and EmailAddress
and that the address
field of EmailAddress
is fulltext indexed. Therefore,
in your application, if you use fulltext search to look for an email address, CubicWeb
will return the EmailAddress
itself. But the objects we’d like to index
are more likely to be the associated CWUser
than the EmailAddress
itself.
The simplest way to achieve that is to tag the use_email
relation in
the datamodel:
class use_email(RelationType):
fulltext_container = 'subject'
10.3. Customizing how entities are fetched during db-rebuild-fti
¶
db-rebuild-fti
will call the
cw_fti_index_rql_limit()
class
method on your entity type.
10.4. Customizing get_words()
¶
You can also customize the FTI process by providing your own get_words()
implementation:
from cubicweb.entities.adapters import IFTIndexableAdapter
class SearchIndexAdapter(IFTIndexableAdapter):
__regid__ = 'IFTIndexable'
__select__ = is_instance('MyEntityClass')
def fti_containers(self, _done=None):
"""this should yield any entity that must be considered to
fulltext-index self.entity
CubicWeb's default implementation will look for yams'
``fulltex_container`` property.
"""
yield self.entity
yield self.entity.some_related_entity
def get_words(self):
# implement any logic here
# see http://www.postgresql.org/docs/9.1/static/textsearch-controls.html
# for the actual signification of 'C'
return {'C': ['any', 'word', 'I', 'want']}