Commons:WMF support for Commons/Commons Impact Metrics/GLAM Prototype Data Model and Implementation Plan

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
This page is a work in progress page, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

Overview[edit]

This document describes the data model for the GLAM metrics prototype to be presented at the GLAM conference 2023 and to be used as a base for the final implementation of the GLAM metrics. It also explains the decisions made when designing the datasets, shows how to use the proposed datasets to extract the currently existing GLAM metrics, and gives implementation suggestions for the calculations to generate those datasets.

None of the choices made are final, the structure and naming of the datasets and their fields is tentative and open for discussion. So please add any comments that you see fit!

Assumptions[edit]

  • We are focusing on providing easy access to accurate metrics that cover the existing use cases in community’s GLAM tools, like BaGLAMa2 or GLAM Wiki Dashboard.
  • Each GLAM institution has a primary category that contains all their subcategories and media files.
  • We have access to the list of GLAM primary categories, either via an allow-list or a similar method.
  • There are about ~1000 GLAM primary categories and each of those has in average ~1000 media files. Collected this proxy from BaGLAMa2 and GLAM Wiki Dashboard.
  • The main release channels for these datasets are going to be dumps and AQS.

Datasets[edit]

This section describes 5 datasets that intend to cover most (if not all) the existing GLAM tool use cases. The proposed datasets are organized into two types: snapshot-based and incremental.

Snapshot-based datasets[edit]

These datasets describe the current state of things at the time the snapshot is taken. For instance, a snapshot of the state of all GLAM Commons categories and their properties at 2021-03. Or a snapshot of the state of all GLAM media files and their properties at 2023-10.

GLAM Commons categories[edit]

This dataset gives information about the GLAM Commons categories at the time the snapshot is taken. It includes both primary categories and all their subcategories, recursively down the tree.

Data size guess[edit]

Depending on how many subcategories a tree has, the size of this dataset will vary. Let’s assume there are 50 subcategories in an average tree (20 images per category), and that each category row is about 500 Bytes. The rough uncompressed size of this dataset’s snapshot should be 1000 * 50 * 500 = 25MB.

GLAM Commons media files[edit]

This dataset gives information about the GLAM Commons media files at the time the snapshot is taken. It includes category information so that the user can filter by category.

Data size guess[edit]

Since we assume there are about 1000 primary categories and 1000 media files per primary category, we expect there to be 1M media files in each snapshot. If each media file row takes 300 Bytes, the size of the snapshot should be around 300MB uncompressed.

Design considerations of snapshot-based datasets[edit]

  • Use category and media file names as unique identifiers instead of ids. It seemed to me to be simpler and more useful when reading query results.
  • You can not aggregate metrics for these datasets. The user must always filter or group by category / media file, and must always specify a snapshot. Otherwise, the results would contain duplicate counts.
  • Does not include urls for categories or media files. They can be easily rebuilt from category or media file names.
  • Does not include event-based metrics like media_files_viewed. These are subject to an interval of time and are not suitable for a snapshot-based dataset. To extract those, the user will have to query the incremental (event-based) datasets.
  • Release frequency should be monthly, and thus the snapshots should be monthly (YYYY-MM). The source data for these datasets (MediaWiki tables) is imported monthly and is also snapshot-based.
  • Hive should be good enough as underlying tech for the table. I don’t think we need Iceberg here, since monthly snapshots are handled well by Hive. But Iceberg is the default tech now, and we should use it anyway.
  • These datasets do not allow reruns or back-filling.

Incremental datasets[edit]

These datasets contain aggregated metrics about GLAM-related events that are tied to a timeline. For instance, views of articles containing media files belonging to a GLAM Commons category. Or media file uploads made to a Commons category by a given contributor.

GLAM article views by category[edit]

This dataset gives information about the number of views GLAM articles receive across time. And it breaks them down by the GLAM category. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.

Data size guess[edit]

This one is more difficult to model… Assuming we have 1M media files, let’s say 900K are used. Let’s also assume the number of GLAM articles equals the number of GLAM used media files. Depending on the granularity of the data, a bigger or smaller share of the articles will receive at least 1 view (which makes them appear in the dataset). Let’s imagine 50% of the articles are viewed. We still need to multiply that by the number of subcategories in a category tree, which we said was about 50 in average. However both dimensions (article and category) are not orthogonal, so their product would be reduced by a factor. Let’s assume it’s 0.25. And that we multiply by the number of months in a year. Assuming each record takes 250B, then: 900K * 0.25 * 50 * 0.5 * 12 * 250 = 17GB/year uncompressed. This calculation is crazy and has too many assumptions and probably many mistakes on my part, so it could be very far from reality.

GLAM article views by media file[edit]

This dataset also gives information about the number of views GLAM articles receive across time. However, it breaks them down by the GLAM media file. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.

Data size guess[edit]

This one is similar to the previous one in terms of difficulty and fuzziness… We can use the previous data size approximation (GLAM article views by category) to calculate this one. There are two differences: 1) this dataset contains two extra arrays of strings, and 2) in this dataset we have to multiply the media_file dimension cardinality by the article dimension cardinality (instead of category by article). Although the cardinality of the media_file dimension is much greater than the category dimension, the two dimensions in question here are very aligned; many images go in articles with the same name. But yea, I imagine this dataset will be a bit bigger than the previous one. Let’s say it’s double the size: ~34GB/year uncompressed.

GLAM Commons edits[edit]

This dataset contains a record for each edit (creation, update, deletion) to a Commons media file. It also records the user who did the edit and the categories the media file belongs to.

Data size guess[edit]

For this dataset we can assume that the number of edits is going to be close to the number of media files. A little bit over that, considering updates and deletes, say +30%. Let’s assume each record is going to take about ~300B. Times 1000000 media files, the overall size will be around ~400MB uncompressed.

Design considerations of incremental datasets[edit]

  • The two article views datasets have the potential to become medium-big (in terms of cluster). But in terms of downloadable data via dumps, they can become quite big if we choose to release them in a daily granularity. Thus, I believe the monthly granularity is better (to start at least), even if it doesn’t show possible daily trends of the data.
  • We can store the edits dataset without aggregation, since I believe it will be small enough.
  • In these datasets you can aggregate across most dimensions without breaking the data, i.e. counting duplicates. The only ones that you can not aggregate across are `category` in “GLAM article views by category”, and `media_file` in “GLAM article views by media file”.
  • Data size guesses are wild for article views datasets, take them with a bucket of salt.
  • Again, use category names, media file names and user names as unique identifiers. It’s simpler and more direct to chart.
  • Release frequency: Since the snapshot-based datasets are monthly, and these (incremental) datasets probably should have monthly granularity for size reasons, I guess the best would be to release these monthly as well.
  • Tech: In this case, I think Iceberg would be a better option. We could partition the article views datasets by both category / media file and time; since we always want to either filter or break down by category / media file. Also, because they are incremental, we can take advantage of Iceberg.
  • These datasets allow reruns or back-filling.

General design considerations[edit]

Reruns and back-fills[edit]

We can not rerun or back-fill snapshot-based datasets. We can do all that with incremental datasets, but with the drift explained below.

Category / media file drift[edit]

We will base incremental metrics on snapshot categories and media files, so there can be a minimal drift from month to month (a category might change name in the middle of a month, so we’ll see it as 2 different categories, one for the first part of the month, the other for the second). The drift can be bigger for reruns or back-fills.

Array fields[edit]

In some cases the array[string] fields are not strictly needed to fulfill the existing GLAM tools use cases. We might want to keep them because they are interesting and allow for extra querying options. But at the same time they add size to the data.

Category vs. category tree[edit]

I tried to store metrics for both primary GLAM categories and any subcategory within their category trees. I’m not sure how much that will be used by the GLAM people. I understand there’s value in having subcategory metrics, but at the same time reporting on them makes the datasets quite larger.

Usage examples[edit]

Total views per primary category[edit]

Get total views and time-since-tracked for all primary categories. From baglama2 main view.

select
    category,
    min(date),
    sum(views)
from glam_article_views_by_category
where array_contains(primary_categories, category)
group by category
order by category asc
limit 3;
category min(date) sum(views)
2021 Africa Cup of Nations 2021-05-23 45689457
Aeroseum 2013-08-11 5698677
American Currency Collection 2005-10-01 668799
  • This query would read all the data in the glam_article_views_by_category. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes (on is_primary?) might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
  • To determine whether a category is primary, the query uses "array_contains(primary_categories, category)". We could add an "is_primary" field maybe? This would allow for indexes.

Monthly views for a given category[edit]

Get monthly views for a given category. From baglama2 category view, or glamwikidashboard page-views view(monthly instead of daily).

select
    month,
    sum(views)
from glam_article_views_by_category
where category = 'My Category'
group by month
order by month asc
limit 3;

month sum(views)
2021-01 47536745
2021-02 87867334
2021-03 24565677
  • This query would benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.

Articles viewed and article views[edit]

Get number of articles viewed and article views broken down by wiki for a given category and month, like in baglama2 category view.

select
    wiki,
    count(distinct article),
    sum(views)
from glam_article_views_by_category
where
    category = 'My Category' and
    date between '2021-01-01' and '2021-02-01'
group by wiki
order by wiki asc
limit 3;
wiki count(distinct article) sum(views)
en.wikipedia.org 13 33475
he.wikipedia.org 2 655
hi.wikipedia.org 1 535
  • This query would also benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.

Breakdown by wiki and category[edit]

Get a breakdown of article views by wiki and category for a given month, like in baglama2 wiki view.

select
    article,
    views
from glam_article_views_by_category
where
    category = 'My Category' and
    wiki = 'my.wiki.org' and
    month = '2021-01'
order by article asc
limit 3;
article sum(views)
Minimum support price 45456
Sonia Mann 4564
Balbir Singh Rajewal 456
  • This query would also benefit from the Iceberg table being partitioned by category and time. If queried outside the cluster, indexes on the category and time fields would help.

List category media files[edit]

Get a list of media files and their media types, belonging to a given category, like in glamorgan main view.

select
    name,
    media_type
from glam_commons_media_files
where
    array_contains(categories, 'My Category') and
    snapshot = '2021-01'
limit 3;
name media_type
Saab AJ-37 Viggen 37094 57 (7609646264).jpg image
Inside the Aeroseum entrance tunnel (7490290326).jpg image
Aeroseum in Gothenburg, Sweden.jpg image
  • This query is fine to execute from anywhere and doesn’t require any special partitions/indexes, since the data is small.

Article views by category media files[edit]

Get the total article views for a given category, broken down by media file, like in glamorgan main view.

select
    media_file,
    sum(views)
from glam_article_views_by_media_file
where
    array_contains(categories, 'My Category')
group by
    media_file
order by sum(views) desc
limit 3;
name media_type
Saab AJ-37 Viggen 37094 57 (7609646264).jpg 96786
Inside the Aeroseum entrance tunnel (7490290326).jpg 42879
Aeroseum in Gothenburg, Sweden.jpg 5467
  • This query would read all the data in the glam_article_views_by_media_file dataset. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
  • If we assume that there’s only 1 primary category for each media file, we could have a `primary_category` field instead of a `primary_categories` field, and we’d be able to partition/index on that.

Prototype implementation suggestions[edit]

This is an implementation plan draft for the GLAM Commons Impact Metrics datasets, specifically for the prototype. They are in the order believed to be necessary.

  1. Collect/generate a list of GLAM primary categories. In production this could be an allow-list file in some repo, or as Desiree suggested structured data in Commons.
  2. Extract GLAM secondary categories (recursive subcategories of the primary categories), and create an intermediate category dataset (including primary categories). In production, we’d probably need a Spark fixed-point algorithm to extract those (using the category table plus the categorylinks table). We’ll have to take into account possible duplicates and even loops in the category “tree”. In the same job, I guess we could record the parent_categories and subcategories of each category in question.
    For the prototype, we choose a couple convenient primary category trees and manually extract all secondary categories.
    FIELD TYPE
    name string
    primary_categories array[string]
    parent_categories array[string]
    subcategories array[string]
  3. List all media files contained in GLAM categories (primary or secondary) and build an intermediate media file dataset with it. For this, a SparkSql query is probably enough, using the refined category information above (2), plus the image table and the imagelinks table. In the process we could include the media type and store the primary and direct categories of the media file.
    FIELD TYPE
    name string
    media_type string
    primary_categories array[string]
    categories array[string]
  4. We calculate glam_commons_categories and glam_commons_media_files. We need the data above: (2) and (3) plus imagelinks. Probably a set of SparkSql queries are enough? They don’t seem super complicated, but there are several metrics, so some tinkering to be done…
    Output: the tables as described in the data model.
  5. Calculate glam_article_views_by_category and glam_article_views_by_media_file. A SparkSql query that uses the data above (4) and pageviews_hourly (or med iarequests?). Probably we need several queries, one for category and one for category_tree. In my head, this needs some joining and aggregating, but doesn’t seem impossible.
    Output: the datasets as described in the data model.
  6. Calculate glam_commons_edits. A SparkSql query that uses the data above (4) plus an edit-based dataset like the revision table or MediaWiki History. This one is probably the easiest?
    Output: the dataset as described in the data model.
  7. Populate the spreadsheet with the data samples plus add queries and charts to it. Apparently, Google sheets lets you use SQL-like syntax to query data. There could be 1 sheet per initial dataset. And then more sheets as needed for the queries and charts?