Tuesday, 15 April 2014

javascript - Screen scraping dynamic webpage in python with Ghost.py -



javascript - Screen scraping dynamic webpage in python with Ghost.py -

ghost = ghost() page, rcs = ghost.open(https://soundcloud.com/passionpit/sets/favorites) page, rcs = ghost.wait_for_page_loaded() songs = ghost.evaluate("document.getelementsbyclassname('soundtitle__title');") print songs

i attempting utilize above code find html elements on above page have class 'soundtitle__title' of right output

qfont::setpixelsize: pixel size <= 0 (0) ({pyqt4.qtcore.qstring(u'length'): 0.0}, [])

can help me see problem is? when run document.getelementsbyclassname('soundtitle__title') in browsers console output expect, why python output different?

or there way me utilize ghost.py or similar library source of page after javascript has run (the source seen when inspecting element browser developer tools)?

i got working, , recommend, using splinter, running phantomjs , selenium under hood.

you'll need run pip install splinter , install phantomjs on machine, either downloading/untarring or npm -g install phantomjs if have npm, etc. overall installation , dependencies minimal , straightforward.

the next code returns 'ryn weaver - octahate', i'm assuming you're looking for, although without more context can't totally sure.

from splinter import browser browser = browser('phantomjs') browser.visit('https://soundcloud.com/passionpit/sets/favorites') songs = browser.find_by_xpath("//a[contains(@class, 'soundtitle__title')]") if songs: song in songs: print song.text else: print "there aren't songs"

you'll notice had xpath-contains class description looking for; so, might running problem when trying access class notation using - there span element , anchor element both contain 'soundtitle__title' far tell, 'a' element had text , guess that's you're looking for. if want both browser.find_by_xpath("//*[contains(@class, 'soundtitle__title')]")

javascript python screen screen-scraping ghost.py

No comments:

Post a Comment