Escape item["id"]

Today I woke up to an error report in my #bugfix-requests discord channel that my scoreboard parser was completely broken. “Huh, that’s weird,” I thought, “Nothing changed….” Things don’t break unless something changed. But it turned out something had in fact changed, and the story was so hilarious that I’m writing a completely new blog post for this week, and delaying my planned post to next week instead.

(Note: Unless otherwise stated, all code in this post was originally written by me for Leaguepedia and is licensed under CC BY-SA 3.0.)

Backstory

A couple days ago, there was a rather innocuous reddit post that linked to Leaguepedia. This is not a rare or even notable occurrence; people link to the wiki constantly from reddit, and in fact there are links to Leaguepedia in the standard “post-match-team” template that gets posted after every major-league series as well as some minor-league ones. But in this one, the OP (original poster) didn’t escape the closing parenthesis in one of the links.

They wrote:

[Coach Kim, I mean most people on this sub probably know coach Kim.](https://lol.gamepedia.com/Kim_(Kim_Jeong-soo))

When the correct syntax was:

[Coach Kim, I mean most people on this sub probably know coach Kim.](https://lol.gamepedia.com/Kim_(Kim_Jeong-soo\))

This is an extremely understandable mistake; it’s not something that I’d expect anyone to know unless it’s literally part of their job description (hi). So I decided it was on me to make their erroneous link resolve, and in fact we should create a redirect for every page ending in a closed parenthesis from the “typo’d” link to the correct page. So yesterday one of my admins wrote some code. This turned out to be slightly harder than expected.

Writing a query to discover pages

When I first pitched this task, I thought the query was a simple join of two copies of _pageData, one to find existing pages, and one to make sure the redirect didn’t already exist. But what I overlooked was that if we did it as a simple query, we’d instead be looking for pages without any redirects - not at all what we wanted! So in fact the full query is pretty complicated:

{{#cargo_query:tables=_pageData=_pD,_pageData=_pD2
|join on=_pD._pageName=_pD2._pageNameOrRedirect
|fields=_pD._pageName, _pD._pageNamespace,_pD._isRedirect,_pD2._pageName=PD2 Page
|where= _pD._pageName LIKE "%)" AND _pD._isRedirect="0" AND _pD._pageNamespace="0"
|group by=_pD._pageName
|having=SUM(CASE WHEN CONCAT(_pD2._pageName, ')') = _pD._pageName THEN 100 ELSE 0 END) = 0
}}

Tables

Straightforward: Two copies of _pageData

Join

Still pretty straightforward: The first copy is to look at the page name, and the second is to look at redirects to that page.

Fields

Screenshot of redirects query
The fields I thought would be useful to see are:

both page names - though because it’s a one-to-many relation, and I’m grouping by the one, of course the _pageName in the second copy is a random one
_pageNamespace - Boring in the final query because I’m setting it to 0 but when I was diagnosing what was wrong with the live query, I wanted to check if main namespace was 0 or the empty string, and then I just left it here
_isRedirect - I couldn’t figure out why everything was a redirect at first, so I wanted to see this field. Eventually I realized it was because I was requiring things to have NO redirects due to my erroneous _pD2._pageName IS NULL clause that I replaced (and nearly all of our pages have at least SOME redirects)

Where

I want to look at pages that end in ), when means _pD._pageName LIKE "%)". I also decided to only care about main-namespace pages: making redirects for tooltip pages was certainly out of the question, and it seemed like too much of a pain to individually exclude namespaces. Maybe I could want Self, but that has a different namespace number on different wikis, and I might run this on wikis other than Leaguepedia, so I just didn’t want to deal with that right now. Main namespace was enough.

_pD._isRedirect="0" was interesting. If I didn’t use this clause, it doesn’t necessarily mean I’d start making double redirects, since I could have the new page redirect to the final target. But at the same time, I’d be creating a LOT (like, a lot) of extremely low-utility redirects for scoreboard pages that end in Week 1 (2) etc and I just didn’t want to bother with it. Also, people typically copy-paste URLs, and URLs automatically update themselves to redirect targets. So this clause went in.

Group by

In order for my having clause to work, I needed to group by _pD._pageName. You will see.

Having

Okay, the interesting one. So I need to make sure that this “typo-correction” redirect doesn’t already exist. What is a typo-correction redirect? It’s the same name as the page without the ). Let’s, uh, do some algebra:

Typo-correction-redirect = Actual page - )
Typo-correction-redirect + ) = Actual page

Or in other (SQL) words, CONCAT(_pD2._pageName, ')') = _pD._pageName.

So I need to make sure that the above statement DOES NOT HOLD for ANY of the redirects. So if I assign 100 points when it DOES hold, and 0 otherwise, and sum it all up, then I want the sum to be 0. Or in SQL:

SUM(CASE WHEN CONCAT(_pD2._pageName, ')') = _pD._pageName THEN 100 ELSE 0 END) = 0

This type of construction of group by the left-hand-side of a one-to-many and then SUM / CASE WHEN in a having on the right-hand-side is really, really, really powerful. I meant to do an entire blog post on it one time, uh yeah lol.

Well anyway, cool, that’s it, we’re done!

The Python script

So like I said Jibril, one of my admins, wrote the final script after I wrote the query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


from river_mwclient.esports_client import EsportsClient
from river_mwclient.auth_credentials import AuthCredentials


def run(site: EsportsClient):
    result = site.cargo_client.query(
        tables='_pageData=Target,_pageData=Redirect',
        join_on='Target._pageName=Redirect._pageNameOrRedirect',
        where='Target._pageName LIKE "%)" AND Target._isRedirect="0" AND Target._pageNamespace="0"',
        having='SUM(CASE WHEN CONCAT(Redirect._pageName, ")") = Target._pageName THEN 100 ELSE 0 END) = 0',
        fields='Target._pageName=Link,Target._pageNameOrRedirect=Target',
        group_by='Target._pageName',
        limit='max'
    )

    for item in result:
        target = item["Target"]
        redirect = item["Link"][:-1]
        site.save_title(redirect, "#REDIRECT[[%s]]" % target, summary="creating redirects with missing parentheses")


if __name__ == '__main__':
    credentials = AuthCredentials(user_file='me')
    site = EsportsClient('lol', credentials=credentials)  # Set wiki
    run(site)

And it’s nicely set up to fit into my crontab framework so it can run hourly, hooray!

The commit

So that was a lot of backstory for the actual commit that was so funny to me that I had to write this ENTIRE BLOG POST ABOUT IT. Like I said, there was an error with my scoreboard parser, and we had just created all of these typo-correction redirects with unbalanced parentheses. So, I had to change:

1

disambiguation = re.sub(r'^' + item['ID'], '', item['DisambiguatedName'])

to:

1

disambiguation = re.sub(r'^' + re.escape(item['ID']), '', item['DisambiguatedName'])

HAHAHAHAHAHAHAHA

See, the package re stands for “regular expressions.” What I’m doing is a regex-based find-replace here. The first parameter needs to be a valid regular expression. And all of the page names that I’m creating, with a ( but no ) is, well, not a valid regex, unless the opening parenthesis is escaped. So to fix the issue, I just escaped the parenthesis and I was fine.

Here’s the commit on Github. The actual commit message isn’t quite this long, I promise.