The aspects-dbt from openedx

Add display names to models

Problem

The dbt models currently only have entity IDs. Superset dashboards should have the entity names, which are much more readable and would bring the dashboards in line with previous insights UX. This data currently lives in the event_sink schema in two tables: course_blocks, which has display names for entities like videos; and course_overviews, which contains the course name.

There are a couple different ways of approaching this problem and I'd love to get feedback before proceeding.

Approaches

As I see it, there are two high-level approaches:

Create new views in dbt that include display names
Create virtual datasets in Superset that build off the existing dbt models and joins in display names

The key performance difference between these two approaches is that ClickHouse will not use primary keys for right-hand side tables in joins. This means that the following query will not do any filtering of the course_blocks table despite filtering on org, which is the first part of the primary key in course_blocks:

select
    b.location as video_name,
    count(*) as num_plays
from
    xapi.video_plays vp
    join event_sink.course_blocks b
        on vp.video_id = b.location
where
    vp.org = 'edX' and b.org = 'edX'
group by 1

This could become a problem if the right-hand side table is sufficiently large, as ClickHouse defaults to reading the entire right-hand side into memory (though ClickHouse has a setting where queries could adapt to use a slower, more memory-resilient join algorithm).

The following query, however, will use the primary key in course_blocks:

with blocks as (
    select
        location,
        display_name
    from
        event_sink.course_blocks
    where org ='edX'
)
select
    b.display_name as video_name,
    count(*) as num_plays
from
    xapi.video_plays vp
    join blocks b
        on vp.video_id = b.location
where
    vp.org = 'edX'
group by 1

The advantage of Superset would be that we could use SQL templates to use a version of the second query to power dashboards. Given a dashboard filter with the name org, the templated version would look similar to this:

with blocks as (
    select
        location,
        display_name
    from
        event_sink.course_blocks
    where
        -- where_in is provided by Superset. If we want to filter on the org 'edX', where_in
        -- converts the filter value to something like this: `('edX')` 
        org in {{ filter_values('org') | where_in }}
)
select
    b.display_name as video_name,
    count(*) as num_plays
from
    xapi.video_plays vp
    join blocks b
        on vp.video_id = b.location
where
    vp.org {{ filter_values('org') | where_in }}
group by 1

Given this information, I think the dbt approach is more appealing from a developer experience perspective while sacrificing performance. The Superset approach has a performance benefit but at the cost of added complexity by defining what is essentially another dbt model but outside of dbt.

enrollments_by_day does not include the current day

I would expect there to be rows returned for the following query, but instead it is 0.

select count(*)
from reporting.enrollments_by_day
where enrollment_status_date = today()
and course_key = 'course-v1:UMONTREAL+GPT101+2023_04'
and enrollment_status = 'registered'
;

I'm able to get the expected rows by changing int_enrollment_windows window_end_date to:

dateTrunc('day', coalesce(window_end_at, today() + toIntervalDay(1))) AS window_end_date, but I'm not sure what the impacts of that change are. I think this is just a "midnight time boundary" issue?

openedx / aspects-dbt Goto Github PK

aspects-dbt's People

Contributors

Watchers

Forkers

aspects-dbt's Issues

Add display names to models

Problem

Approaches

enrollments_by_day does not include the current day

Models for enrollment reports

Models for video engagement reports

Models for problem engagement

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent