openedx / aspects-dbt Goto Github PK
View Code? Open in Web Editor NEWThe dbt project for Open edX Aspects!
License: Apache License 2.0
The dbt project for Open edX Aspects!
License: Apache License 2.0
The dbt models currently only have entity IDs. Superset dashboards should have the entity names, which are much more readable and would bring the dashboards in line with previous insights UX. This data currently lives in the event_sink
schema in two tables: course_blocks
, which has display names for entities like videos; and course_overviews
, which contains the course name.
There are a couple different ways of approaching this problem and I'd love to get feedback before proceeding.
As I see it, there are two high-level approaches:
The key performance difference between these two approaches is that ClickHouse will not use primary keys for right-hand side tables in joins. This means that the following query will not do any filtering of the course_blocks
table despite filtering on org
, which is the first part of the primary key in course_blocks
:
select
b.location as video_name,
count(*) as num_plays
from
xapi.video_plays vp
join event_sink.course_blocks b
on vp.video_id = b.location
where
vp.org = 'edX' and b.org = 'edX'
group by 1
This could become a problem if the right-hand side table is sufficiently large, as ClickHouse defaults to reading the entire right-hand side into memory (though ClickHouse has a setting where queries could adapt to use a slower, more memory-resilient join algorithm).
The following query, however, will use the primary key in course_blocks
:
with blocks as (
select
location,
display_name
from
event_sink.course_blocks
where org ='edX'
)
select
b.display_name as video_name,
count(*) as num_plays
from
xapi.video_plays vp
join blocks b
on vp.video_id = b.location
where
vp.org = 'edX'
group by 1
The advantage of Superset would be that we could use SQL templates to use a version of the second query to power dashboards. Given a dashboard filter with the name org
, the templated version would look similar to this:
with blocks as (
select
location,
display_name
from
event_sink.course_blocks
where
-- where_in is provided by Superset. If we want to filter on the org 'edX', where_in
-- converts the filter value to something like this: `('edX')`
org in {{ filter_values('org') | where_in }}
)
select
b.display_name as video_name,
count(*) as num_plays
from
xapi.video_plays vp
join blocks b
on vp.video_id = b.location
where
vp.org {{ filter_values('org') | where_in }}
group by 1
Given this information, I think the dbt approach is more appealing from a developer experience perspective while sacrificing performance. The Superset approach has a performance benefit but at the cost of added complexity by defining what is essentially another dbt model but outside of dbt.
I would expect there to be rows returned for the following query, but instead it is 0.
select count(*)
from reporting.enrollments_by_day
where enrollment_status_date = today()
and course_key = 'course-v1:UMONTREAL+GPT101+2023_04'
and enrollment_status = 'registered'
;
I'm able to get the expected rows by changing int_enrollment_windows
window_end_date
to:
dateTrunc('day', coalesce(window_end_at, today() + toIntervalDay(1))) AS window_end_date,
but I'm not sure what the impacts of that change are. I think this is just a "midnight time boundary" issue?
Reports being served:
WIP PR:
#1
Reports being served by these models:
Reports pulling from these models:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.