Adventures in caching primary keys

Caching is one of the biggest contributing factors to Leaguepedia’s complexity. Because we have so many centrally-located data pages, user-facing pages constantly need to have their caches purged after edits are made, not to mention the cases where the expandtemplates API is used to fill static pages with generated code for quick lazy loading. This complexity is managed via a large number of automated processes stored as JS gadgets, which I refer to collectively as “Refresh Overview.” Editors are used to the idea that after a data page is updated, one must click the “Refresh Overview” button for that particular type of edit; the RO buttons are built into the UI in obvious places, and for the most part everyone is very good about properly updating content this way.

A benefit of the “Refresh Overview” practice is that I can get away with some relatively complex caching requirements without users knowing that these requirements exist - for example, in order to update a team’s roster, one must store a roster change in the Data namespace, then blank edit the player’s page to update the stored aggregated Tenures data, and then finally purge the team’s own page to display the updated Tenures data from the player page. The editor needs to know none of this, though; it’s all handled by the Roster Change Refresh Overview button.

If this all sounds incredibly complicated, it’s actually a lot worse: The reliance on RO to update global caching means that I have to ensure that it’s impossible to create unintended side effects when inserting an element, reordering elements, or rearranging data, because a single RefreshOverview click only does cache updates on its intended target pages (doing otherwise would be prohibitively expensive from a performance standpoint). When you consider that I’m using primary keys generated by indices within a page, and I can’t expect editors to respect existing orders within a page, this becomes somewhat daunting to avoid. Here’s the story of one such side effect and the process I undertook to correct it

(Note: All code in this post was originally written by me for Leaguepedia and is licensed under CC BY-SA 3.0.)

Motivation

Primary keys in RosterChanges and News

The table RosterChanges stores lines that consist of a player, a team, a direction (Join or Leave), and all relevant information about the player’s position on the team either before or after the change occurred; if direction is Join then the details are about the player’s status after the change, and if direction is Leave then the details are about the player’s status before the change. Despite the names Join and Leave, actually a roster change could be merely a position change or a modifier change (sub to starting, trainee, to sub, etc, without actually joining or leaving the team); in this case, the full change would be represented by two separate lines - the first one causes the player to “leave” the previous position and the second to “join” the new one immediately after.

In order to group something like the two halves of a role change together, each line contains a foreign key NewsId, which is the primary key of the NewsItems table (eventually this table will be renamed to News, but the current News table has a different structure and is in the process of being phased out). Roster changes with the same NewsId are considered part of the same change when queried.

The RosterChanges table also has its own primary key, RosterChangeId. There’s a subtle but incredibly impactful decision between NewsId and RosterChangeId: Any time the News key is used to join data together, that data will have been stored from the same parent template on the same page as the News item; however, this is NOT the case for the RosterChanges key, as I’ll explain below.

Complications due to Tenures

There’s a third table relevant to discuss: Tenures. Each row in the Tenures table represents a player’s tenure on a team. Therefore, there are two distinct roster changes of relevance: Both the join and the leave (technically there’s also an IsCurrent flag, and if this flag is set to True then there’s only one relevant roster change because they haven’t left yet).

How should the structure of such a table look? Well, we need to know all of the relevant information about the tenure (the position, the team name, modifiers such as sub, trainee, inactive); but also, we need to know some information about the endpoints - what’s the reference/citation for the join? For the leave? What about the dates? Note that all of these listed fields are already stored from RosterChanges on the data page. So if we were going to include them all in the Tenures table as well, we’d have a bunch of normalization violations and a ton of extra updating to do any time we wanted to add a field. For instance, say in the future we care about something called Contract Dispute Deadline for both the join and the leave (this doesn’t exist I made it up for the sake of illustration) - then instead of just modifying RosterChanges, we have to add TWO additional fields to Tenures!.

Obviously that situation is not ideal, so in order to avoid it, I decided instead to store the roster changes’ primary keys in Tenures. Then further queries of Tenures can join to two copies of RosterChanges (one for the Join and one for the Leave) and retrieve all necessary details.

Here’s a Lua example of such a query, extracted from Module:QueryTeamMembers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


function h.getTeamMembers(team)
	local query = {
		tables = {
			'Tenures=Ten',
			'PlayerRedirects=PR',
			'Players=P',
			'RosterChanges=RCJ',
		},
		join = {
			'Ten.Player=PR.AllName',
			'PR.OverviewPage=P._pageName',
			'Ten.RosterChangeIdJoin=RCJ.RosterChangeId',
		},
		fields = {
			'P.Player=Link',
			'P.ID',
			'P.Name',
			'P.NameFull',
			'COALESCE(P.NationalityPrimary, P.Country)=Country',
			'P.Residency',
			'RCJ.Role',
			'RCJ.RoleModifier',
		},
		where = {
			'Ten.IsCurrent="1"',
			('Ten.Team="%s"'):format(team),
		},
	}
	local result = util_cargo.queryAndCast(query)
	
	-- snip some data processing here
	
	return result
end

In this example, you can see that we only care about current tenures (Ten.Current="1"), so we only join to one copy of RosterChanges, but we could just as easily join to RosterChanges=RCJ and RosterChanges=RCL in the same query. (Don’t get caught up in the two distinct meanings of join here.)

Further complications due to TenuresUnbroken

In addition to Tenures, I have another table called TenuresUnbroken, which stores tenures unbroken by things like role changes, status changes, etc. For example, if a player joins a team as a substitute, one year later moves to the starting roster, and one year later leaves, he will have two one-year-long Tenures, and one two-year-long unbroken tenure (TenuresUnbroken). This invites in a nontrivial additional complexity: There are now more than two relevant RosterChangeIds. So instead of storing just RosterChangeIdJoin and RosterChangeIdLeave, I also store RosterChangeIds, which as a List (,) of String-type field, is actually another table representing a one-to-many relation and consisting of a TenuresUnbroken row’s primary key and the relevant RosterChangeIds for that row.

Here’s how a query of TenuresUnbroken looks in Lua (this is an excerpt of Module:TeamMembers):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80


function h.getChangesQuery(team, args)
	local query = {
		tables = {
			'TeamRedirects=TR',
			'TenuresUnbroken=Tenures',
			'TenuresUnbroken__RosterChangeIds=RCID',
			'RosterChanges=RC',
			'NewsItems=News',
			
			-- this PR is for a field, not a join
			'PlayerRedirects=PR',
		},
		join = {
			'TR.AllName=Tenures.Team',
			'Tenures._ID=RCID._rowID',
			'RCID._value=RC.RosterChangeId',
			'RC.NewsId=News.NewsId',
			'Tenures.Player=PR.AllName',
		},
		where = h.getChangesWhere(team, args),
		fields = h.getChangesFields(),
		oneToMany = h.getChangesOneToMany(),
		types = {
			NextIsRetired = 'boolean',
			IsCurrent = 'boolean',
			IsApproxDate = 'boolean',
		},
	}
	return query
end

function h.getChangesWhere(team, args)
	local tbl = {
		('TR._pageName="%s"'):format(team),
		SETTINGS.where,
	}
	return util_cargo.concatWhere(tbl)
end

function h.getChangesFields()
	local ret = {
		-- PlayerKey is needed here so we have a standardized way of matching up with
		-- PR.AllName in the playerExtraInfo field later on
		-- this is due to Cargo treating modified unicode letters the same as
		-- their non-unicode varieties
		'PR.AllName=PlayerKey',
		'Tenures.Player',
		'Tenures.NameLeave=Name',
		'Tenures._ID=TenuresPrimaryKey',
		'Tenures.DateJoin=DateJoin',
		'Tenures.DateLeave=DateLeave',
		'Tenures.ContractEnd',
		'Tenures.ResidencyLeave=Residency',
		'Tenures.NameLeave=Name',
		'Tenures.NextTeam',
		'Tenures.NextIsRetired',
		'Tenures.IsCurrent',
	}
	return ret
end

function h.getChangesOneToMany()
	local oneToMany = {
		groupBy = { 'TenuresPrimaryKey' },
		fields = {
			RosterChanges = {
				'RC.Role',
				'RC.RoleModifier',
				'RC.Status',
				'RC.Direction',
				'News.Date_Display',
				'News.Date_Sort',
				'News.IsApproxDate',
				'News.Source',
				'News.Sentence',
			}
		}
	}
	return oneToMany
end

You can see the explicit join of TenuresUnbroken to its child table TenuresUnbroken__RosterChangeIds (if you’re familiar with Cargo, this join is often accomplished by the syntactic sugar of the HOLDS operator). You can also see usage of my one-to-many support, which works similarly to the #cargo_query option merge similar cells=Yes. The condition SETTINGS.where is just the condition on Tenures.IsCurrent - though now that I felt a need to explain that here, I will probably edit this code to clarify.

One other important thing to realize about this architecture is that when I said I decided to store a primary key to avoid the “inconvenience” of storing two copies of every relevant roster change in Tenures, actually the situation in TenuresUnbroken would be untenable (haha) if I weren’t allowing to store/cache roster change keys here; because there can be three or more relevant roster changes, I would need an entirely separate TenuresUnbrokenRosterChanges table that merely duplicates all of the data that’s already in RosterChanges.

So we’re stuck caching a primary key that we’ve queried from elsewhere, and it’s not an option to change our schema in the slightest; we have to just put up with the caching nightmare. In the next three sections, I’ll explain why this is a nightmare even though RefreshOverview exists.

Choosing a suitable primary key

Under normal circumstances, a primary key for RosterChanges would probably look something like {{PAGENAME}}_N, where N is a page-global indexing variable, and {{PAGENAME}} is the name of the page storing the change. It’s easy to verify this is unique: We increment N each time we store a line, so only one line per page can have any given value of N, and if two items are stored from different pages, then {{PAGENAME}} will differ. In fact, this is the original choice I made for RosterChangeId.

The line-ordering issue

As I’ve alluded to above, sometimes lines on the page will be reordered. There’s two cases:

An editor adds some lines thinking the page is recent-at-the-top but actually it’s recent-at-the-bottom; this offsets all indices by 1
News is added in an order different from it’s true chronological order, and we have to rearrange

The first is impossible to avoid because people don’t read instructions, nor should they have to; and the second is impossible to avoid because in this case everyone is following correct procedure.

Note that neither scenario causes ANY problem for the NewsId primary key in NewsItems. The only tables that we join on this key are all stored from the same place, so even if a page querying these tables doesn’t get cache-refreshed after the edit, its data is still valid: in its version of the universe, no edit has been made to the data page yet, and when we do eventually update its cache, we still have an internally-consistent post-edit state.

Of course, RosterChanges.RosterChangeId is a different story: Here we are actually re-storing the key on a dependent page in the Tenures table. So once the data page is updated, we no longer have an internally-consistent state: the stored value in Tenures does not match the stored value in RosterChanges, and when we query Tenures from a team page, we lack a consistent value to join on.

Again, the problem here isn’t that we have to update cache: it’s that we have too many side effects of the actual update for RefreshOverview to handle. “All” we have to do, then, is limit the potential side effects from a re-ordering to those that will be addressed by a RefreshOverview: namely, those that affect the target pages (involved player & team).

To accomplish this, let’s go back to the reason that we have these side effects in the first place: the global index within the page to generate the primary key. Why do we have this global index? Well, (a) we need a unique key per line, and (b) it was convenient. But (a) just requires any unique key per line, and (b) is irrelevant. In an ideal happyland world, our unique index would be just fields that have actual meaning, e.g. date+team+player+direction; the only reason we can’t do this is that we could have multiple changes per date-player-team-direction combination.

But wait…when we RO, we update both player and team, so in particular we can change as many of the cached entries for the player & team as we want without any negative side effects! What if we just change our index from a global one to a per-player-team one?

In fact, this is the solution. Here’s the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


function h.setAndGetRosterChangeId(row)
	-- we need to make this encode enough data that it's impossible
	-- to have a case where an ID was changed and the page corresponding to it
	-- was NOT blank edited after the change.
	--
	-- because RO saves both player and team page and nothing else,
	-- including these two alongside a numerical counter is both
	-- necessary and sufficient
	local tbl = {
		mw.title.getCurrentTitle().text,
		row.Player,
		row.Team or 'No Team',
		util_vars.setGlobalIndex('rosterChangePlayer' .. row.Player .. (row.Team or '_')),
	}
	return util_table.concat(tbl, '_')
end

Pretend that we have a global singleton object CURRENT_INDEX_PER_PLAYER_TEAM_KEY and we’re incrementing values in this; since we can’t literally have that, we’re achieving the same effect via a set of individual variables prefixed by rosterChangePlayer.

Great!

The page-splitting issue

This worked perfectly…for a few months. Here’s the second issue.

My NewsData pages are created with the intention of each one hosting a full week’s data. In practice, this works about 90% of the time, but for the busiest 4-5 weeks of the year, there are simply too many changes to save them all on the same page, and instead we separate them by day. (Is this a 0NF violation? No, because I have a redirect from the would-be individual-page name to the week-long page name in all cases.)

The key setup from the previous section works fine when there’s an entire week on one page, and it also works fine when there’s only one day per page. But at the point of transition between the two cases, that’s when everything breaks - remember, our keys contain the page name in them!

As an aside, in some cases it will be vitally important to include the name of a page in a primary key, particularly if you are close to a 0NF violation - since wiki pages are (mostly) siloed from each other, you could arrive at a situation where two pages are writing the same primary key for a different set of data. But as that’s not a concern here, we’ll move on.

Fortunately, there’s a relatively easy solution to this: instead of using the page name as part of the primary key, just store the date there. We know all of our roster changes from a single date are stored from the same place, so we’ll lose no information, and then splitting among dates isn’t an issue. (At least, not for now…and let’s hope we never have a situation of needing two separate pages for the same date - or if we do, that one of them is extremely static.)

There’s one crucial issue, which is that I don’t want to modify any code aside from that one function h.getAndSetRosterChangeId from Module:NewsUtil. So we’re going to have to do a bit of wrangling to determine whether we are resetting or incrementing our index. Here’s the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


function h.setAndGetRosterChangeId(row)
	-- we need to make this encode enough data that it's impossible
	-- to have a case where an ID was changed and the page corresponding to it
	-- was NOT blank edited after the change.
	--
	-- because RO saves both player and team page and nothing else,
	-- including these two alongside a numerical counter is sufficient
	
	-- furthermore, we have to guarantee that when a page is split, the RC IDs remain static
	-- so let's also add a date param
	-- we don't want anything outside of this function to be called ever so first we'll
	-- check what the date was the last time we did this; if it changed then we need to
	-- reset our index, otherwise keep incrementing index and we're fine.
	local date = util_vars.getVar('Date')
	local lastDate = util_vars.getVar('rCPDate' .. row.Player .. (row.Team or '_'))
	util_vars.setVar('rCPDate' .. row.Player .. (row.Team or '_'), date)
	local index
	if lastDate and lastDate == date then
		index = util_vars.setGlobalIndex('rCP' .. date .. row.Player .. (row.Team or '_'))
	else
		index = util_vars.resetGlobalIndex('rCP' .. date .. row.Player .. (row.Team or '_'))
	end
	local tbl = {
		date,
		row.Player,
		row.Team or 'No Team',
		index
	}
	return util_table.concat(tbl, '_')
end

And we are done (finally)!

Conclusion

The store-query-store-query pattern can be extremely dangerous, especially when generated primary keys are part of the second store. Even with automated tools to make cache updates easy for editors, this pattern can result in situations that are extremely difficult to rectify. The best solution to this problem is to avoid the pattern altogether, but some situations mandate its use; in this case, choose your stored values with extreme care, and always be on the lookout for caching issues that may arise any time your requirements change.