Logical Knowledge Mannequin: Logical Normalization

Date:


It is a first a part of a miniseries underlining the advantages of organising a correct logical knowledge mannequin (LDM), often known as a semantic knowledge mannequin. Every weblog put up will present a selected characteristic of the GoodData LDM, which permits customers to create metrics which can be each advanced and reusable. Moreover, every characteristic gives additional details about your knowledge to the system and can assist you obtain your general aim.

The GoodData LDM options coated on this miniseries are logical normalization, which is highlighted beneath; attribute labels; and shared datasets. (For individuals who are usually not but aware of the fundamentals of an LDM, learn this introduction.)

All through the sequence, we’ll illustrate the performance by way of the next process: to create a reusable metric by computing the ratio of order quantity on the given report dimensionality to the overall order quantity for every Area. (This metric is identical one as in my earlier article about Multidimensional Analytical Question Language [MAQL]: A Information to MAQL Metrics Reusability.)

Observe: If you wish to observe alongside by constructing your individual LDM, please learn Get GoodData.CN and Create a Workspace within the documentation. For those who created a workspace and GoodData.CN is working, then execute the next command:

curl http://localhost:3000/api/entities/dataSources 
 -H "Content material-Kind: software/vnd.gooddata.api+json" 
 -H "Settle for: software/vnd.gooddata.api+json" 
 -H "Authorization: Bearer YWRtaW46Ym9vdHN0cmFwOmFkbWluMTIz" 
 -X POST 
 -d '{
     "knowledge": {
         "attributes": {
             "identify": "demo-denormalized",
             "url": "jdbc:postgresql://localhost:5432/demo",
             "schema": "demo_denormalized",
             "kind": "POSTGRESQL",
             "username": "demouser",
             "password": "demopass"
         },
         "id": "demo-denormalized",
         "kind": "dataSource"
     }
 }' | jq .

It connects the demo-denormalized knowledge supply, and it’s best to use that knowledge supply in your workspace. (In our documentation library, please learn the part on the best way to construct a logical knowledge mannequin; this describes the best way to create an LDM from the related knowledge supply.)

Afterward, it’s best to find yourself with the LDM pictured beneath. There are two reality datasets (Order strains and Marketing campaign channels) and a single Date dataset. We are going to use solely the Order strains on this first half.

LDM with two fact datasets (Order lines and Campaign channels) and a single Date dataset

Relations within the LDM

One essential property of the datasets is an implied 1:N relation between the first key (e.g., Order line id) and its attributes (e.g., Buyer id, Buyer identify, and so forth.). The system doesn’t know something extra, and by default, it assumes that the relations between attributes themselves are all M:N. In different phrases, it assumes that there are not any useful dependencies between attributes, and the dataset is within the third regular kind. In actuality, this may not be true (e.g., every State belongs to precisely one Area).

You’ll be able to verify it on the info pattern:

Data sample

The implications of this “relation” will probably be revealed later. (Observe that the info pattern is incomplete, and it’s included only for demonstration functions.)

Metric Definition

Let’s create the reusable metric Order Quantity Ratio to Area in MAQL. As described in our information to MAQL metrics reusability, set the metric format to % (rounded).

SELECT (SELECT SUM({reality/worth})) / 
(SELECT SUM({reality/worth}) BY {label/area}, ALL {label/state})

If we add our metric to the report with the Area and State dimension, we get this:

The created metric displays the same reusability capabilities as the metric noted here, even though it’s defined over a more straightforward LDM.

Observe that the created metric shows the identical reusability capabilities because the metric included right here, though it’s outlined over a extra easy LDM.

Logical Normalization

As acknowledged above, our LDM has an implicit (non-expressed) relation between Area and State. Nonetheless, we had been nonetheless capable of create a helpful and reusable metric computing ratio between States and Areas. There are two issues right here, although:

  1. We all the time have to incorporate Area in our report though it is pointless from the semantics perspective.
  2. Extra importantly, we don’t have the express 1:N relation (between Area and State) specified within the LDM, and we should simulate it within the metric utilizing the BY {label/area}, ALL {label/state}) assemble. It means additional psychological effort for the metric creators. Additionally, it in the end results in logic duplication as a result of it’s important to simulate the relation in every metric the place the property must be utilized.

We might ultimately prefer to have a brand new metric that might treatment our points — in different phrases, it could current anticipated outcomes when the report has only one attribute State, though it’s now not hardcoded within the metric.

To try this, we’ve got to encode the State-Area relation into the LDM. MAQL is wise sufficient to acknowledge the state of affairs when attributes from report and attributes from BY are from the identical hierarchy. MAQL replaces the report attributes with the BY attributes — it computes each components of the metric on completely different dimensionalities and routinely joins them collectively.

At present, the one strategy to specify attributes hierarchy within the GoodData LDM is to normalize the bodily database. (Sooner or later, we plan to enhance the state of affairs by way of so-called logical hierarchies.)

We are going to extract the State-Area attributes/columns to a separate dataset/desk and join it to Order Strains by way of State international key. Right here’s a easy transformation SQL script performing the denormalization of the supply knowledge:

CREATE TABLE states AS (SELECT DISTINCT state, area FROM order_lines);
ALTER TABLE states ADD CONSTRAINT pk_states PRIMARY KEY ("state");
ALTER TABLE order_lines ADD CONSTRAINT fk_order_lines_state FOREIGN KEY ("state") REFERENCES "states" ("state");
ALTER TABLE order_lines DROP COLUMN area;

To execute it, log in to the PostgreSQL database embedded in your working GoodData.CN utilizing the next command:

psql -h localhost -p 5432 -d demo -U demouser --password

It is going to ask you for a password, and it’s the identical one that you just used through the knowledge supply registration: demopass. Earlier than you execute the SQL script, set the search path to the proper schema demo_denormalized:

SET search_path TO demo_denormalized;

Now you can simply copy and paste the SQL script after which go to the tab Knowledge, the place it’s best to scan the bodily database once more. Don’t forget to pick out Change the prevailing mannequin:

Scan the data source

You must find yourself with the next mannequin:

Data model after scanning data source

Let’s now create the brand new metric Order Quantity Ratio to Area 2:

SELECT (SELECT SUM({reality/worth})) / 
(SELECT SUM({reality/worth}) BY {label/area})

We are able to now add the created metric to report and take away the Area. You see that it really works nice with simply State whereas additionally conserving the reusability elements. (Strive placing one other attribute to the report, reminiscent of Product, Date, and so forth.)

Report with added created metric and with region removed

Sadly, it doesn’t work appropriately if we need to evaluate Clients (as a substitute of States) order quantity to their Areas, though Area is functionally depending on the Buyer (by way of State).

It does not work correctly if we want to correct Customers (instead of States) order amount to their Regions.

We already know the best way to repair that, although: explicitly specific the Buyer-State-Area hierarchy by altering the bodily database mannequin.

CREATE TABLE clients AS (SELECT DISTINCT customer_id, customer_name, state FROM order_lines);
ALTER TABLE clients ADD CONSTRAINT pk_customers PRIMARY KEY ("customer_id");
ALTER TABLE order_lines DROP CONSTRAINT fk_order_lines_state;
ALTER TABLE order_lines ADD CONSTRAINT fk_order_lines_customer_id FOREIGN KEY ("customer_id") REFERENCES "clients" ("customer_id");
ALTER TABLE clients ADD CONSTRAINT fk_customers_state FOREIGN KEY ("state") REFERENCES "states" ("state");
ALTER TABLE order_lines DROP COLUMN customer_name;
ALTER TABLE order_lines DROP COLUMN state;

You must scan the bodily database once more. Don’t forget to pick out Change the prevailing mannequin. The outcome must be following:

The result of changing the physical database model

Now the Order Quantity Ratio to Area 2 metric lastly works as anticipated; it permits us to provide the next report solely with the Buyer id attribute.

Report only with the Customer id attribute

Additionally, be aware one essential benefit of abstracting your database tables to an LDM. Our metrics stored working even after we repeatedly modified the database schema and corresponding LDM.

Abstract

The article described the advantages of LDM methods to acquire helpful and reusable metrics. Nonetheless, our improved demo mannequin isn’t optimum but — we will’t reply all potential questions. Keep tuned for the second half.

Be taught Extra About MAQL

In the meantime, if you wish to study extra concerning the GoodData LDM and MAQL, verify our GoodData College programs. Additionally, see our neighborhood discussion board and neighborhood Slack channel for assist and additional information.

Header picture by John Schnobrich on Unsplash

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related

Wells Fargo CTO joins Financial institution Automation Summit US 2023

Steve Hagerman, chief expertise officer at Wells Fargo,...

A Information for 2023 Tax Season Webinar — Q&A

In our current webinar, "A Information for 2023...

Monetary literacy 101: the one information you may want to start out managing your funds like a professional

Monetary literacy 101: the one information you may...

What Is A Listicle? Write One For Content material Technique Worth

These days, listicles typically get a nasty rap...
%d bloggers like this: