Auto-tagging for publishers: taxonomies don’t have to be taxing
This is a guest post by Tom Morris, Chief Technology Officer at publishing and content specialist Ixxus. Here Tom outlines the challenges publishers face with metadata and the solutions Ixxus’ new auto-tagging software, Taxonixx, offers.
One issue that comes up time and time again, with publishers and other organisations who make heavy use of content, is the question of auto-tagging and auto-classification of metadata.
Maintaining accurate and consistent metadata, of course, is essential if you want to do anything actually useful with your content. This includes (but is nowhere near limited to) things like:
- Making content searchable, discoverable and surfaceable – aiding users as they seek relevant content.
- Cross-selling content or products to external users.
- Dynamic auto-assembly of content and real-time updates.
- Enabling you to navigate through linked and associated content.
- Giving you holistic insight into your content environment.
For organisations with a large volume of content assets, adding metadata and ensuring its consistency across your content environment can represent a massive manual overhead. This is time-consuming, costly to the business and heavily prone to human bias and error – meaning content may end up mistagged, or tags may be inconsistently applied.
For example, explains Renee Swank, Head of Discovery at Ixxus:
“A large publishing client we’re working with at the moment has a metadata problem with their old, legacy content. They want to investigate ways of reusing that content in new ways, in order to derive new value – but because no metadata was applied to it at the point of creation, they have no idea what they actually have! But by running that mountain of content through Taxonixx, they’ve been able to get an accurate view of what’s there, and they can now start repackaging that to create new service offerings and new revenue streams.”
A light-weight, robust, auto-tagging application
To tackle this problem, Ixxus have developed a new model called Taxonixx. Built upon MarkLogic, Taxonixx utilises MarkLogic’s XML database without requiring the documents themselves to be stored; making it a light-weight, cost-effective solution.
Taxonixx works by analysing your document, matching it against selected controlled vocabularies and then identifying all of the relevant entities from your document. Entities are then automatically and consistently applied as tags to your piece of content, giving you all the benefit of metadata tagging with none of the manual effort.
Sounds simple enough?
It’s actually quite complicated, for the simple reason that language and the way we use it is quite complicated!
For example, certain entities may require disambiguation: compare heteronyms such as ‘bass’ (the fish) and ‘bass’ (the instrument), or ambiguous proper nouns such as ‘Paris’ (are you talking about the city, or the B-list celebrity?). The taxonomy also needs to be intelligent enough to recognise when a word is used alone, or as part of a compound. For example, to take ‘Manchester United’ as an entity rather than as the separate entities ‘Manchester’ and ‘United’.
By leveraging Boolean operators, phrase or proximity matching, and disambiguation tools, Taxonixx delivers categorisation to high levels of precision. This intelligent semantic scoring also filters out irrelevant keywords, while permitting human intervention where necessary.
Taxonixx also offers a number of powerful customisation tools. The user can specify weighting for specific terms or types of content, tailor scoring based on document length, and set up hot zones in documents to be “paid more attention to”.
For example, you may wish entities in the heading section of a document to be considered more important, or to disregard anything beyond a certain point (particularly useful for documents containing footnotes, copyright information, indexes or other appendices).
- Automatic Categorisation
Taxonixx uses managed taxonomies to automatically identify the most suitable term(s) to describe the content provided.
- Multiple Taxonomies
Taxxonixx supports multiple polyhierarchical taxonomies, allowing the user to select and intersect vocabularies.
- Taxonomy Management
Librarians and ontologists can adjust the taxonomy directly via a web-based UI, enabling them to manage categories in the form of custom controlled vocabularies or taxonomies.
- Advance Rule Management
The integrated UI can be used to create custom scoring policies such as repeating-phrase boosting, title weighting, proximity weighting, synonym matching, wildcard matching, hot zone weighting, disambiguation handling, etc.
- Open Standards Support
Open Data or OWL based-taxonomies representing public open standards can be imported.
- Multi-Format Coverage
Content formats are growing by the day, and many organizations have specific file type needs. Taxonixx can handle over 200 file types, including DOC, PDF, PPT, XML, HTML and IDML.
- Lightweight Licensing
Taxonixx utilizes MarkLogic, and runs effectively on a minimum MarkLogic license. It can be installed on-premise, leveraged as part of MarkLogic’s pay-per-hour AWS MarketPlace service, or provided by Ixxus as a SaaS.
Tom Morris is Chief Technology Officer at Ixxus, specialising in designing and delivering content solutions to the publishing and media industry. As CTO, Tom has overseen a large number of Ixxus projects at major global publishing companies.