Breeding: download - Downloading a LOT of files using python -

Monday, 15 July 2013

download - Downloading a LOT of files using python -

is there way download lot of files en masse using python? code speedy plenty downloading 100 or files. need download 300,000 files. little files (or wouldn't downloading 300,000 of them :) ) real bottleneck seems loop. have thoughts? maybe using mpi or threading?

do have live bottle neck? or there faster way, maybe not using python?

(i included total origin of code completeness sake)

from __future__ import  partition import pandas pd import numpy np import urllib2 import os import linecache   #we start huge file of urls  data= pd.read_csv("edgar.csv") datatemp2=data[data['form'].str.contains("14a")] datatemp3=data[data['form'].str.contains("14c")]  #data2 cut-down file  data2=datatemp2.append(datatemp3) flist=np.array(data2['filename']) print len(flist) print flist  ###below have script download of files in data2 database ###here need create new directory named edgar14a14c in cwd  original=os.getcwd().copy() os.chdir(str(os.getcwd())+str('/edgar14a14c'))   in xrange(len(flist)):     url = "ftp://ftp.sec.gov/"+str(flist[i])     file_name = str(url.split('/')[-1])     u = urllib2.urlopen(url)     f = open(file_name, 'wb')     f.write(u.read())     f.close()     print

the usual pattern multiprocessing create job() function takes arguments , performs potentially cpu bound work.

example: (based on code)

from multiprocessing import pool  def job(url):     file_name = str(url.split('/')[-1])     u = urllib2.urlopen(url)     f = open(file_name, 'wb')     f.write(u.read())     f.close()  pool = pool() urls = ["ftp://ftp.sec.gov/{0:s}".format(f) f in flist] pool.map(job, urls)

this number of things:

create multiprocessing pool , set of workers have cpu(s) or cpu core(s) create list of inputs job() function. map list of inputs urls job() , wait jobs complete.

python's multiprocessing.pool.map take care of splitting input across no. of workers in pool.

another useful neat little thing i've done kind of work utilize progress this:

from multiprocessing import pool   progress.bar import bar   def job(input):     # work   pool = pool() inputs = range(100) bar = bar('processing', max=len(inputs)) in pool.imap(job, inputs):     bar.next() bar.finish()

this gives nice progress bar on console jobs progressing have thought of progress , eta, etc.

i find requests library useful here , much nicer set of api(s) dealing web resources , downloading of content.

python download urllib2

Breeding

Monday, 15 July 2013

download - Downloading a LOT of files using python -

No comments:

Post a Comment