Monday, 15 July 2013

download - Downloading a LOT of files using python -



download - Downloading a LOT of files using python -

is there way download lot of files en masse using python? code speedy plenty downloading 100 or files. need download 300,000 files. little files (or wouldn't downloading 300,000 of them :) ) real bottleneck seems loop. have thoughts? maybe using mpi or threading?

do have live bottle neck? or there faster way, maybe not using python?

(i included total origin of code completeness sake)

from __future__ import partition import pandas pd import numpy np import urllib2 import os import linecache #we start huge file of urls data= pd.read_csv("edgar.csv") datatemp2=data[data['form'].str.contains("14a")] datatemp3=data[data['form'].str.contains("14c")] #data2 cut-down file data2=datatemp2.append(datatemp3) flist=np.array(data2['filename']) print len(flist) print flist ###below have script download of files in data2 database ###here need create new directory named edgar14a14c in cwd original=os.getcwd().copy() os.chdir(str(os.getcwd())+str('/edgar14a14c')) in xrange(len(flist)): url = "ftp://ftp.sec.gov/"+str(flist[i]) file_name = str(url.split('/')[-1]) u = urllib2.urlopen(url) f = open(file_name, 'wb') f.write(u.read()) f.close() print

the usual pattern multiprocessing create job() function takes arguments , performs potentially cpu bound work.

example: (based on code)

from multiprocessing import pool def job(url): file_name = str(url.split('/')[-1]) u = urllib2.urlopen(url) f = open(file_name, 'wb') f.write(u.read()) f.close() pool = pool() urls = ["ftp://ftp.sec.gov/{0:s}".format(f) f in flist] pool.map(job, urls)

this number of things:

create multiprocessing pool , set of workers have cpu(s) or cpu core(s) create list of inputs job() function. map list of inputs urls job() , wait jobs complete.

python's multiprocessing.pool.map take care of splitting input across no. of workers in pool.

another useful neat little thing i've done kind of work utilize progress this:

from multiprocessing import pool progress.bar import bar def job(input): # work pool = pool() inputs = range(100) bar = bar('processing', max=len(inputs)) in pool.imap(job, inputs): bar.next() bar.finish()

this gives nice progress bar on console jobs progressing have thought of progress , eta, etc.

i find requests library useful here , much nicer set of api(s) dealing web resources , downloading of content.

python download urllib2

No comments:

Post a Comment