Best Practice Guide: Data Enrichment

Data enrichment is one of the key processes by which you can add more value to your data. It refines, improves and enhances your data set with the addition of new attributes. For example, using an address post code/ZIP field, you can take simple address data and enrich it by adding socio economic demographic data, such as average income, household size and population attributes. With data enrichment you can get a better understanding of your customer base, and potential target customers.

Enrichment techniques

There are 6 common tasks involved in data enrichment:

1. Appending Data

By appending data to your dataset, you amalgamate multiple data sources to create a more comprehensive, precise, and cohesive dataset compared to any single source alone. For instance, integrating customer data from CRM, Financial Systems, and Marketing systems provides a more holistic understanding of your customers than relying solely on one system.

Data appending, as a method of enrichment, also involves incorporating third-party data, such as demographic or geographic data by postcode/ZIP, into your dataset. Additional examples include exchange rates, weather data, date/time hierarchies, and traffic information. Enriching location data is particularly common, given its widespread availability across most countries.

2. Data Segmentation

Data segmentation involves dividing a data object (like a customer, product, or location) into groups based on predetermined variables (such as age, gender, income for customers). This segmentation aids in categorizing and describing the entity more effectively.

Common examples of customer segmentation include:

Demographic Segmentation: Based on gender, age, occupation, marital status, income, etc.
Geographic Segmentation: Based on country, state, or city of residence, with local businesses potentially segmenting by specific towns or counties.
Technographic Segmentation: Based on preferred technologies, software, and mobile devices.
Psychographic Segmentation: Based on personal attitudes, values, interests, or personality traits.
Behavioral Segmentation: Based on actions or inactions, spending/consumption habits, feature use, session frequency, browsing history, average order value, etc.
These segments can result in groups of customers such as Trend Setters or Tree Changers.

You can create your own segmentation by generating calculated fields in either an ETL process or within a metadata layer, leveraging the available data attributes.

3. Derived Attributes

Derived attributes are fields not initially stored in the original dataset but can be computed from one or more existing fields. For instance, ‘Age’ is commonly derived from a ‘date of birth’ field. These attributes are valuable as they often encapsulate frequently used analytical logic. Creating them within an ETL process or at the metadata layer streamlines the creation of new analyses and ensures consistency and accuracy in utilized measures.

Examples of derived attributes include:

Counter Field: Based on a unique ID within the dataset, facilitating easy aggregations.
Date Time Conversions: Extracting day of the week, month of the year, quarter, etc., from a date field.
Time Between: Calculating elapsed periods, like response times for tickets, using two date-time fields.
Dimensional Counts: Counting values within a field to generate new counters for specific areas, such as counts of narcotic offenses, weapons offenses, petty crimes, enabling simpler comparative analysis at the report level.
Higher-order Classifications: Deriving product categories from product names, age bands from ages.

Advanced derived attributes can result from data science models applied to the dataset, such as determining customer churn risk or propensity to spend.

4. Data Manipulation

Data imputation involves substituting values for missing or inconsistent data within fields.

Instead of considering the missing value as zero, which could distort aggregations, the estimated value aids in achieving a more precise analysis of the data.

For instance, if the value for an order is missing, it can be estimated based on past orders from that customer or for that particular bundle of goods.

5. Entity extraction

Entity extraction involves extracting structured data from unstructured or semi-structured data sources.

Through entity extraction, you can identify entities such as people, places, organizations, concepts, numerical expressions (dates, times, currency amounts, phone numbers), and temporal expressions (dates, times, duration, frequency).

For instance, by parsing data, you can extract a person’s name from an email address or determine the organization’s web domain to which they belong. Additionally, you can break down names, addresses, and other data elements into discrete components. For example, transforming an envelope-style address into separate data elements such as building name, unit, house number, street, postal code, city, state/province, and country.

6. Data Categorization

Data categorization entails labeling unstructured data to render it structured and amenable to analysis. This process comprises two main categories:

Sentiment Analysis: This involves extracting feelings and emotions from text. For instance, discerning whether customer feedback expresses frustration, delight, positivity, or neutrality.
Topic Modeling: This aims to determine the primary subject or theme of the text. It involves identifying whether the text discusses politics, sports, house prices, or other topics.

Both techniques enable the analysis of unstructured text, providing a deeper understanding of the underlying data.

Data Enrichment Best Practices

Data enrichment is seldom a one-time endeavor. In an analytics ecosystem where fresh data continuously flows into your system, recurring enrichment steps are necessary. Several best practices must be implemented to ensure desired outcomes are achieved and that data quality is maintained at a high standard. These practices encompass:

Reproducibility and Consistency

Every data enrichment task must demonstrate reproducibility, consistently yielding the expected results with each execution. Processes should be rule-based, ensuring repeated runs with confidence in consistent outcomes.

Clear Evaluation Criterion

Each data enrichment task requires a transparent evaluation criterion to assess its success. Post-execution, comparison against previous results confirms expected outcomes.

Scalability

Data enrichment tasks should exhibit scalability in resource allocation, timeliness, and cost-effectiveness. Processes must accommodate the growth of data over time, supporting scalability through automation and adaptable infrastructure.

Completeness

Data enrichment tasks must ensure completeness, addressing all potential scenarios, including cases where results are ‘unknown.’ Anticipating all possible outcomes guarantees valid results as new data is incorporated into the system.

Generality

Data enrichment processes should be universally applicable across different datasets. Ideally, created processes should be transferable, allowing for the reuse of logic across multiple tasks. This ensures consistent outcomes and upholds business rules across various subject domains.