GitHub - pallih/scraperwiki-scraper-vault

This repository contains code for all public scrapers at scraperwiki.com, as of
04:56PM on September 30, 2013

It was created by pallih @ gogn.in / twitter.com/pallih

Some statistics:

Count of file extensions
Extension: 34
Extension: .php 2968
Extension: .tar 4189
Extension: .rb 2653
Extension: .py 14245
Extension: .txt 1
Extension: .html 529

User stats:

Number of users: 4189

Users with over 100 scrapers:
robksawyer 387
tlevine 243
buttub 231
frabcus 188
owl 172
ross 171
NicolaHughes 154
lostexpectation 150
psychemedia 147
pallih 138
paulbradshaw 137
aspeakman 136
pinakighosh 135
Toxicfly 130
DragonDave 121

Usernames that retrieval failed (at least partially) for:
raus81
ccbcreg
jnarra
Julian_Todd

Count of python module imports

peerreach 1
pylab 2
html.entities 2
tables 2
mmap 2
twisted.names 2
local 2
scrapy.utils 2
urllib.parse 2
selenium.webdriver.support.ui 2
jinja2 2
hotshot 2
nltk.tag 2
elementtidy 2
importlib 2
tweepy.streaming 2
rpy2.robjects.lib 2
zlib 2
stat 2
atexit 2
twill 2
msgpack 2
gdata.spreadsheet 2
colorsys 2
SpiderMonkey 2
email 2
GeoIP 2
twitter.oauth_dance 2
xml.etree 2
PIL 2
geopy.geocoders.google 2
PyQt4.QtGui 2
rpy2.robjects.packages 2
pdb 2
PyQt4.QtWebKit 2
scipy.stats 2
cld 2
robotparser 2
fom.session 2
gdata.docs 2
sitescraper 2
networkx.algorithms 2
twisted.web 2
pygments.formatters 2
Beautifulsoup 2
htmltable2matrix 2
pattern.search 2
repr 2
mimetools 2
freesteel.freesteelpy 2
suds 2
dateutil.tz 2
icalendar.cal 2
bz2 2
freesteel.savecontours 2
bs4.element 2
rpy2 2
getopt 2
config 2
PyQt4.QtCore 2
asynchat 2
pdfminer 2
readline 2
scrapy_utils 2
html.parser 2
lmx.html 2
twisted.internet 2
formatter 3
cartodb 4
pdfminer.cmapdb 4
smtplib 4
pattern.web 4
gdata.youtube.service 4
xlwt 4
pydot 4
Levenshtein 4
webscraping 4
twitter 4
scrapy.cmdline 4
struct 4
scrapely 4
matplotlib 4
scrapy.utils.misc 4
Image 4
doctest 4
ipdb 4
freesteel 4
unittest 4
nltk.metrics 4
community 4
scrapely.extraction 4
pygments 4
imp 4
nltk.book 4
twitter.oauth 4
pdftoxml 4
worker 4
email.utils 4
tidylib 4
scraper_utils 4
stdnum.isbn 4
shutil 4
gdata.youtube 4
mimetypes 4
urllib.request 4
textwrap 4
scrapely.htmlpage 4
hmac 4
googlemaps 6
functools 6
yaml 6
ckanclient 6
ssl 6
new 6
matplotlib.cbook 6
gdata.spreadsheet.service 6
pattern.en 6
inspect 6
nltk.collocations 6
commands 6
geopy.distance 6
xmltodict 6
gasp_helper 6
exceptions 6
getpass 6
lxml.builder 6
jellyfish 6
geopy.geocoders 6
scrapely.template 8
multiprocessing 8
matplotlib.ticker 8
timeit 8
warnings 8
pycurl 8
sets 8
pyPdf 8
pdfminer.pdfdevice 8
pattern.graph 8
pygments.lexers 8
argparse 8
w3lib.html 8
rfc822 9
optparse 10
gdata.docs.service 10
matplotlib.mlab 10
yql 10
contextlib 10
tarfile 12
networkx.readwrite 12
imposm.parser 12
bitlyapi 12
ClientForm 13
scipy 14
Queue 14
matplotlib.dates 14
twill.commands 14
gzip 14
selenium 16
xml.sax.saxutils 16
ast 16
glob 16
pyparsing 16
scraperwiki.geo 16
scrapy.spider 18
pdfminer.converter 18
pandas 18
cPickle 20
chardet 20
difflib 20
scraperwiki.utils 22
subprocess 22
oauth2 22
matplotlib.pyplot 22
dateutil.relativedelta 22
os.path 22
md5 23
xml.dom 23
htmllib 23
array 24
threading 25
pdfminer.pdfparser 26
pipe2py 28
pdfminer.layout 30
pdfminer.pdfinterp 32
pytz 35
rdflib 36
xml.etree.ElementTree 40
gc 42
turtle 42
nltk.corpus 42
openpyxl 42
html5lib 42
types 44
lxml.html.soupparser 44
xml.dom.minidom 44
nltk 44
scrapy.settings 45
networkx.readwrite.gexf 48
calendar 48
locale 48
unidecode 50
xml.etree.cElementTree 52
gviz_api 53
scraperwiki.metadata 54
pickle 54
HTMLParser 58
logging 59
scraperwiki.datastore 59
feedparser 62
cStringIO 62
decimal 63
zipfile 64
htmlentitydefs 66
traceback 69
copy 70
icalendar 70
scrapy.conf 72
hashlib 73
networkx 76
scrapy.http 79
codecs 80
numpy 80
tweepy 82
dateutil 92
demjson 98
scrapy.contrib.linkextractors.sgml 103
scrapy.contrib.loader 103
httplib2 112
operator 112
resource 114
scrapy.crawler 115
lxml.html.clean 116
socket 116
scrapy.selector 117
pyquery 121
uuid 123
pygooglechart 139
lxml.cssselect 141
collections 147
scrapy.xlib.pydispatch 160
sqlite3 161
httplib 170
unicodedata 174
base64 206
scrapy.contrib.spiders 206
scrapy.contrib.loader.processor 208
cookielib 226
scrapy.item 238
tempfile 240
itertools 263
math 269
scrapy 277
scrapemark 288
pprint 304
geopy 305
StringIO 312
cgi 361
xlrd 401
random 535
bs4 601
scraperwiki.sqlite 723
os 860
dateutil.parser 943
scraperwiki.apiwrapper 987
csv 1057
lxml 1063
string 1127
requests 1355
lxml.etree 1991
json 2062
sys 2140
urlparse 2257
mechanize 2300
time 2469
BeautifulSoup 3584
urllib 3867
datetime 5144
simplejson 5613
re 7120
urllib2 7700
lxml.html 10080
scraperwiki 25309

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Users		Users
readme.md		readme.md
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

pallih/scraperwiki-scraper-vault

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages