Today I woke up to an error report in my #bugfix-requests
discord channel that my scoreboard parser was completely broken. “Huh, that’s weird,” I thought, “Nothing changed….” Things don’t break unless something changed. But it turned out something had in fact changed, and the story was so hilarious that I’m writing a completely new blog post for this week, and delaying my planned post to next week instead.
(Note: Unless otherwise stated, all code in this post was originally written by me for Leaguepedia and is licensed under CC BY-SA 3.0.)
Backstory
A couple days ago, there was a rather innocuous reddit post that linked to Leaguepedia. This is not a rare or even notable occurrence; people link to the wiki constantly from reddit, and in fact there are links to Leaguepedia in the standard “post-match-team” template that gets posted after every major-league series as well as some minor-league ones. But in this one, the OP (original poster) didn’t escape the closing parenthesis in one of the links.
They wrote:
[Coach Kim, I mean most people on this sub probably know coach Kim.](https://lol.gamepedia.com/Kim_(Kim_Jeong-soo))
When the correct syntax was:
[Coach Kim, I mean most people on this sub probably know coach Kim.](https://lol.gamepedia.com/Kim_(Kim_Jeong-soo\))
This is an extremely understandable mistake; it’s not something that I’d expect anyone to know unless it’s literally part of their job description (hi). So I decided it was on me to make their erroneous link resolve, and in fact we should create a redirect for every page ending in a closed parenthesis from the “typo’d” link to the correct page. So yesterday one of my admins wrote some code. This turned out to be slightly harder than expected.
Writing a query to discover pages
When I first pitched this task, I thought the query was a simple join of two copies of _pageData
, one to find existing pages, and one to make sure the redirect didn’t already exist. But what I overlooked was that if we did it as a simple query, we’d instead be looking for pages without any redirects - not at all what we wanted! So in fact the full query is pretty complicated:
{{#cargo_query:tables=_pageData=_pD,_pageData=_pD2
|join on=_pD._pageName=_pD2._pageNameOrRedirect
|fields=_pD._pageName, _pD._pageNamespace,_pD._isRedirect,_pD2._pageName=PD2 Page
|where= _pD._pageName LIKE "%)" AND _pD._isRedirect="0" AND _pD._pageNamespace="0"
|group by=_pD._pageName
|having=SUM(CASE WHEN CONCAT(_pD2._pageName, ')') = _pD._pageName THEN 100 ELSE 0 END) = 0
}}
Tables
Straightforward: Two copies of _pageData
Join
Still pretty straightforward: The first copy is to look at the page name, and the second is to look at redirects to that page.
Fields
The fields I thought would be useful to see are:
- both page names - though because it’s a one-to-many relation, and I’m grouping by the one, of course the
_pageName
in the second copy is a random one _pageNamespace
- Boring in the final query because I’m setting it to0
but when I was diagnosing what was wrong with the live query, I wanted to check if main namespace was0
or the empty string, and then I just left it here_isRedirect
- I couldn’t figure out why everything was a redirect at first, so I wanted to see this field. Eventually I realized it was because I was requiring things to have NO redirects due to my erroneous_pD2._pageName IS NULL
clause that I replaced (and nearly all of our pages have at least SOME redirects)
Where
I want to look at pages that end in )
, when means _pD._pageName LIKE "%)"
. I also decided to only care about main-namespace pages: making redirects for tooltip pages was certainly out of the question, and it seemed like too much of a pain to individually exclude namespaces. Maybe I could want Self
, but that has a different namespace number on different wikis, and I might run this on wikis other than Leaguepedia, so I just didn’t want to deal with that right now. Main namespace was enough.
_pD._isRedirect="0"
was interesting. If I didn’t use this clause, it doesn’t necessarily mean I’d start making double redirects, since I could have the new page redirect to the final target. But at the same time, I’d be creating a LOT (like, a lot) of extremely low-utility redirects for scoreboard pages that end in Week 1 (2)
etc and I just didn’t want to bother with it. Also, people typically copy-paste URLs, and URLs automatically update themselves to redirect targets. So this clause went in.
Group by
In order for my having
clause to work, I needed to group by _pD._pageName
. You will see.
Having
Okay, the interesting one. So I need to make sure that this “typo-correction” redirect doesn’t already exist. What is a typo-correction redirect? It’s the same name as the page without the )
. Let’s, uh, do some algebra:
Typo-correction-redirect = Actual page - )
Typo-correction-redirect + ) = Actual page
Or in other (SQL) words, CONCAT(_pD2._pageName, ')') = _pD._pageName
.
So I need to make sure that the above statement DOES NOT HOLD for ANY of the redirects. So if I assign 100 points when it DOES hold, and 0 otherwise, and sum it all up, then I want the sum to be 0. Or in SQL:
SUM(CASE WHEN CONCAT(_pD2._pageName, ')') = _pD._pageName THEN 100 ELSE 0 END) = 0
This type of construction of group by
the left-hand-side of a one-to-many and then SUM / CASE WHEN in a having
on the right-hand-side is really, really, really powerful. I meant to do an entire blog post on it one time, uh yeah lol.
Well anyway, cool, that’s it, we’re done!
The Python script
So like I said Jibril, one of my admins, wrote the final script after I wrote the query:
|
|
And it’s nicely set up to fit into my crontab framework so it can run hourly, hooray!
The commit
So that was a lot of backstory for the actual commit that was so funny to me that I had to write this ENTIRE BLOG POST ABOUT IT. Like I said, there was an error with my scoreboard parser, and we had just created all of these typo-correction redirects with unbalanced parentheses. So, I had to change:
|
|
to:
|
|
HAHAHAHAHAHAHAHA
See, the package re
stands for “regular expressions.” What I’m doing is a regex-based find-replace here. The first parameter needs to be a valid regular expression. And all of the page names that I’m creating, with a (
but no )
is, well, not a valid regex, unless the opening parenthesis is escaped. So to fix the issue, I just escaped the parenthesis and I was fine.
Here’s the commit on Github. The actual commit message isn’t quite this long, I promise.