Saturday, 15 May 2010

python 3.x - Why does this Wikipedia mediawiki api request not return categories to all links? -



python 3.x - Why does this Wikipedia mediawiki api request not return categories to all links? -

i'm trying outgoing links given wikipedia-page other wikipedia articles , respective categories.

somehow, many pages returned w/o category though belong some. seems not systematic, i.e. pages returned without category not same.

the next illustration minimal can create it:

# -*- coding: utf-8 -*- import urllib.request import urllib.parse import json def link_request(more_parameters={"continue": ""}): parameters = {"format": "json", "action": "query", "generator": "links", "gpllimit": "max", "gplnamespace": "0", "prop": "categories", "cllimit": "max", "titles": urllib.parse.quote(start_page.encode("utf8"))} parameters.update(more_parameters) querystring = "&".join("%s=%s" % (k, v) k, v in parameters.items()) # ensures redirects followed automatically, documented here: # http://www.mediawiki.org/wiki/api:query#resolving_redirects querystring = querystring+"&redirects" url = "http://%s.wikipedia.org/w/api.php?%s" % (wikipedia_language, querystring) print(url) #get json info wikimedia api , create dictionary out of it: request = urllib.request.urlopen(url) encoding = request.headers.get_content_charset() jsondata = request.read().decode(encoding) info = json.loads(jsondata) homecoming info def get_link_data(): data=link_request() query_result=data['query']['pages'] while 'continue' in data.keys(): continue_dict=dict() key in list(data['continue'].keys()): if key == 'continue': continue_dict.update({key: data['continue'][key]}) else: val= "|".join([urllib.parse.quote(e) e in data['continue'][key].split('|')]) continue_dict.update({key: val}) data=link_request(continue_dict) query_result.update(data['query']['pages']) print(json.dumps(query_result, indent=4)) start_page="albert einstein" wikipedia_language="en" get_link_data()

in case wondering: go on stuff explained here: http://www.mediawiki.org/wiki/api:query#continuing_queries

the problem because of way continuations work, can't update() result , expect work.

for example, imagine had next linked pages categories:

page 1 category 1 page 2 category 2a category 2b page 3 category 3

now, if set both gpllimit , cllimit 2 (i.e. each response contain @ 2 pages , @ 2 of categories), result across 3 go on responses this:

response 1 page 1 category 1 page 2 category 2a response 2 page 1 page 2 category 2b response 3 page 3 category 3

if you're going utilize update() combine these responses, results response 2 overwrite results response 1:

page 1 page 2 category 2b page 3 category 3

so, need utilize smarter approach combine responses. or better, utilize one of existing libraries access api.

python-3.x mediawiki wikipedia-api mediawiki-api

No comments:

Post a Comment