ETL Functions For Narrative Data Mart Population

Originally published July 28, 2005

In my last article, A Star Schema Model for Narrative Data, I introduced the concept of the Narrative Star Schema. Such a schema can potentially support very targeted search capabilities on narrative data—documents—that in their native form would need to be read by humans … a very time-consuming and sequential process.

In this article we’ll describe the types of software products that provide the ETL (Extract, Transform and Load) functions required to populate a data mart implemented using a Narrative Star Schema. These ETL functions can be summed up as tokenizing, recognizing, categorizing and linking. Figure 1 below shows a high-level flow of these functions.


 Figure 1: ETL Flow for populating the Narrative Data Mart.

There are two major functions in populating any star schema: building dimensions and building facts that reference the dimensions. In the world of narrative data, dimensions are taxonomies; facts are links. Documents, i.e., narrative data files, are the sources for facts. They may also be sources for dimensions in cases to which automated taxonomy building is applied.

The web search engines with which we are all familiar with are very good at tokenizing. This is where the process begins.

Tokenizing: Search Engines

Tokenizing is a process that divides up one or more narratives into distinct strings, or tokens, of characters, which are likely to have some meaning simply because of their occurrence in narrative form. Computer programs can be very good at parsing text into tokens, given a relatively simple set of rules for recognizing contiguous characters, spaces, punctuation and “stop words”—words whose sole purpose is to connect other, more meaningful character strings together in a narrative.

Here’s a news flash: if you think you’re searching the web when you submit a search query using Yahoo!, Google or Alta Vista, you’re wrong—what you’re actually doing is querying a data mart. The data marts underlying these tools are very large, very specialized and optimized data structures—specifically, indexes. In general, all they include are cross-references between tokens and web pages. The background processes of these search engines include ETL functions to populate these indexes.

For those readers inclined toward experimentation, a downloadable version of an interesting text processing and extraction product, WebQL, is available from QL2 Software.

The word-tokens extracted by such a search tool, rather than just being used to build an index, can be passed on to subsequent processes that can bestow meaning on the word-tokens.

Recognizing: Linguistics

The token-recognition rules underlying search engines are quite basic compared to the more advanced software that can apply additional rules to the word-tokens output from a search engine. Such software products typically combine sophisticated linguistic and statistical processing with extensive proprietary reference tables, dictionaries, “lexicons” or “knowledge bases,” to enable the inference of domain-specific meaning within a set of tokens. Products enabling this functionality are an outcome of studies in the fields of computational linguistics (CL) and natural language processing (NLP).

Linguistic processing enables recognition of parts of speech, e.g., the token “John” has a high likelihood of being a proper noun; “walked,” a verb in the past tense. Linguistics programs can also infer syntactical relationships among tokens in the context of a sentence (anybody remember diagramming sentences in grade school?), recognize synonyms, and adjust meanings based on the context of an entire document.

So when this step is done, we not only have a set of words, but also how they may be related syntactically, and some general idea as to what they could signify—most importantly, via a name designation.

Categorizing: Taxonomies

“Taxonomy building involves two tasks: the creation of the hierarchy, and the definition of the business rules that route documents to the appropriate category.”

A taxonomy can be thought of as a vocabulary arranged in a hierarchical fashion. One way to create a taxonomy is to pay a visit to Vivisimo. There you’ll find a web search engine that returns more than just a list of links—it builds a taxonomy at run time, categorizes its result set into this taxonomy and gives the user the ability to navigate the taxonomy hierarchy.

This real-time “dimensionalization” done by Vivisimo is termed clustering. Clustering can result in a different taxonomy each time it’s done, based on the search terms and input data. It’s like Forrest Gump’s box of chocolates—you never know what you’re going to get. 

On the other hand, classification, such as that supported by products such as Verity Collaborative Classifier, supports the construction and maintenance of pre-defined taxonomies. Taxonomy-management products also provide functions to define and manage rules for assigning extracted entities to categories—i.e., dimension members. These rules are invoked during the ETL process.

As pointed out in my previous article, taxonomies and dimensions are, in general, independent implementations of essentially equivalent reference data.

“There are usually a number of internal or external resources that can assist in taxonomy design or the creation of business rules, including existing … product lists …”

A strong corporate reference data management strategy can increase the effectiveness of managing reference data—on products, for example—in the various forms that may be required across an organization. As shown in figure 1, taxonomies and dimensions can, and should, be managed together as part of this strategy.

Connecting the Dots: Linking

The next transformation step in our narrative ETL flow involves recognizing and extracting links. Detecting links—connections between and among dimension members—is necessary to construct narrative facts. Just as is the case when constructing conventional fact table rows from iterative data, all dimension members referenced by a fact must have been recognized first, in earlier steps.

Link extraction builds on the prior extraction of entity instances and their categorization into dimension members. Recognizing links between members of any two dimension members—for example, a location for a person of interest—is challenging, let alone determining and building an n-dimensional fact. 

Verity’s Relational Taxonomies, Megaputer’s PolyAnalyst Text OLAP, and NetOwl Extractor’s Link and Event Configuration are products that deduce links between extracted entities.

“Verity's Relational Taxonomy technology lets users browse through multiple taxonomies at the same time to quickly locate highly relevant information where the taxonomies intersect.”

One step remains in our ETL process which is the assignment of confidence levels and the construction of our link/facts will be complete.

Confidence Levels

An important concept in the extraction of narrative facts and dimensions is that of confidence levels. Typically in the world of iterative data, each data value is explicitly related to its meaning, by way of its assignment to a specific named column or field. A distinguishing characteristic of narrative data is the absence of these specific data-metadata assignments. The transformation of narrative data to iterative data is in large part the execution of educated guesses, the outcome of which is the deduction of data-metadata assignments.

For example, in a narrative data source, there will be no “Person” table, with a “Name” column in which the value of “John Adams” is stored. In our narrative-iterative transformation, based on a given set of rules, we can declare, with some level of confidence, that “John Adams” is the “Name” of a “Person”, and make this assignment at run time. Well, exactly how confident are we that “John Adams” is the name of a person?

The value of this confidence level—effectively, metadata about this assignment—can be determined, again, based upon rules. Some rule/input combinations are known to yield more consistently accurate results. For example, in an input document, if the token “D.C.” immediately follows the token “Washington,” our confidence is increased that the combination designates the name of a Location, rather than a Party such as “Washington Irving.”

Confidence-level values are relative, of course, rather than absolute. Each narrative fact can and should be assigned a confidence-level value based on the confidence levels of the rules by which it was created.

Business Applications

A “narrative data mart” can be of considerable value to any enterprise in which a significant percentage of data of interest appears in narrative or “unstructured” form—that is, just about any enterprise. The following table shows some examples of potential narrative dimension members for three types of enterprises. The facts that could be potentially represented and accessed effectively by users can be visualized by juxtaposing examples of dimension members from multiple cells in any column.

“Utilizing Text OLAP exploration engine, the user of PolyAnalyst can easily define dimensions of interest to be considered in text exploration and quickly dissect the results of the analysis across various combinations of these dimensions to gain insights in the investigated issue.”

Application / Narrative Dimension

Healthcare

Law Enforcement

Customer Service

Event

Admission, Drug Reaction, Procedure, Symptom Presentation

Incident, Arrest

Help Desk Call, Service Call, Sale

Location

Home, Hospital

Jurisdiction, Municipal Area

Home, Office, Retail Outlet

Media Artifact

News Release, Government Advisory, Subscription Document

News Release, Communication Intercept

Email, Telephone transcription

Party

Patient, Doctor, Pharmacist

Suspect, Official

Customer, Technician, Reseller, Competitor

Topic

Protocol, Medication, Dosage

Weapon, Contraband, Stolen Property

Product, Feature

Fact Linkage

Family Relationship, Drug Interaction

Gang Membership

Product Configuration

Table 1: Narrative Data Mart Applications.

In the future I will present the user interface for this data mart, as well as other means for agile interaction with large quantities of narrative data.

 

  • William Lewis

    Wiliam has more than 20 years’ experience delivering data-driven solutions to business challenges across the financial services, energy, healthcare, manufacturing, software and consulting industries. Bill has gained recognition as a thought leader and leading-edge practitioner in a broad range of data management and other IT disciplines including data modeling, data integration, business intelligence, meta data management, XML and XSLT, requirements structuring, automated software development tools and IT Architecture. Lewis is a Principal Consultant at EWSolutions, a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using enterprise architecture, managed meta data environment, and data warehousing technologies. Visit http://www.ewsolutions.com/. William can be reached at wlewis@ewsolutions.com.

Recent articles by William Lewis


Related Stories


 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!