Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code reorg, testing and profiling framework #1

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@

The EMI project
===============

Overview
---------
This project exists to profile python code used in a Natural Language Processing project.


Organization
-------------
The directory structure is as follows:

emi/
+-- build
|
+-- data
|
+-- dependencies
|
+-- dist
|
+-- emi.egg-info
|
+-- profiling
|
+-- README
|
+-- runProf.sh
|
+-- runTests.sh
|
+-- setup.py
|
+-- src
|
+-- test
|
+-- todo.org

with the major components simply being the src/, test/, and profiling/ directories. With test/, I will attempt to follow,
at least minimally, a test-driven development style, e.g. writing a failing test, writing the minimal necessary code to fix the
failing test, then moving on.

For profiling/, work has gone into investigating how we should best profile code for performance. To that end, I have
included basic support for the cProfile library which ships with python's standard lib. I have also included
two third-party libraries, line_profiler and memory_profiler, which provide more textured information about the runtime
behavior and memory usage of a given program.


Setup
-----
Treating the emi project as a module, in good python fashion, means including a setup.py script in the root. Practically,
this means that we can have subdirectories (with __init__.py files) refer to each other without touching the PYTHON_PATH
variable, so we can tell python where to find our libraries. This, however, is more of a side-effect of the overall setup.py
philosophy, which looks further ahead to deployment and shipping logistics. Thus the main output of running
setup.py is the creation and population of the build/, dist/, and emi.egg-info/ directories, which make the root look
busier than it really is.

Although we're pushing (we've pushed) a version of the emi project that has already been "set up", you may periodically refresh
the state of the project by running the following from the root, as per any pyPI package:

$ python setup.py build
$ python setup.py install


Usage
-----
At the outset, or after making any changes to the project, you should run the following:

$ runTests.sh

which will hopefully tell you if you broke anything. Test support is currently flimsy and more demonstrative than useful, that
is, there's very low coverage.

To track how well the program is running, you will want to make use of the runProf.sh script. This has been written as a small
unix utility, accepting a few command line arguments (choose which profiler, which functions to profile). It simply passes
those arguments to a python script, which is written largely identically to the test script.

Still working out some kinks.

Overall, we will probably hone in on a single use-case for the profiling, and it could be that most of the intended features
are dropped in favor of a simpler but more direct profiling methodology.


Miscellaneous
-------------
I keep track of goals and progress in the todo.org file kept in the root directory. Org-mode is a language built inside
emacs that offers support for formatting book-keeping files, such as to-do lists. In emacs, this means there's a lot of
interactivity that comes out of the box, e.g. displaying and contracting lists, headings, moving around the file like its
a directory editor, etc. which get lost in any other text editor, or even an older version of emacs.

*a moment of silence for those peope not using emacs*.

If you don't care about those facets of the project (which you probably don't), then feel free to ignore it.


Contact Info
-------------

+ Richard Futrell : [email protected]

+ Chad Kringen : [email protected]
Empty file added build/lib/src/__init__.py
Empty file.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Empty file added build/lib/test/__init__.py
Empty file.
64 changes: 64 additions & 0 deletions build/lib/test/main_unittest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@

import unittest
import sys

from src import count_skipgrams as skip


class count_skipgramTest( unittest.TestCase ):

def setUp(self):
pass

# # tokenize battery
def test_tokenize_SmallString(self):
s = "the dog ran quickly across the field"
produced = skip.tokenize( s )
target = ['<s>', 'the', 'dog', 'ran', 'quickly', 'across', 'the', 'field', '</s>']
self.assertEqual( produced, target )


def test_tokenize_EmptyString(self):
s = ""
produced = skip.tokenize( s )
target = ['<s>','</s>']
self.assertEqual( produced, target )


def test_tokenize_NonAscii(self):
s = "Hark, a string with an extended charset µ !"
produced = skip.tokenize( s )
target = ['<s>', 'Hark,', 'a', 'string', 'with', 'an', 'extended', 'charset', 'µ', '!', '</s>']
self.assertEqual( produced, target )


def test_tokenize_LineEndings(self):
s = "What is \nthis, \r\nlatin-1?"
produced = skip.tokenize( s )
target = ['<s>', 'What', 'is', 'this,', 'latin-1?', '</s>']
self.assertEqual( produced, target )


# # not sure how to test this
# def test_ichunks_SmallString(self):
# s = "the dog ran quickly across the field".split( )
# islice_generator = skip.ichunks(s,3)
# ans = ['<s>', 'What', 'is' ]

# ctr = 0
# for i in ans:
# for j in i:
# self.assertEqual( j, ans[ctr])
# ctr += 1



def tearDown(self):
pass



if __name__ == "__main__":

unittest.main( )

File renamed without changes.
Binary file added data/prepro.tgz
Binary file not shown.
1 change: 1 addition & 0 deletions data/prepro/.history.buck
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2014-11-22.17-01-48.thor rm .history.buck
11 changes: 11 additions & 0 deletions data/prepro/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
How to preprocess your data to match the LMs.

For all languages but English use this:

(Example Spanish)
cat el_data.txt | ./prepro_post_dedupe.sh es | ./prepro_tokenize.sh es truecasemodels/truecase-model.es

For English do this:

cat the_text.txt | ./prepro_post_dedupe_en.sh en | ./prepro_tokenize_en.sh en truecasemodels/truecase-model.en

100 changes: 100 additions & 0 deletions data/prepro/en.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
df6fa1abb58549287111ba8d776733e9 0.000000 http://www.unionjackstable.com/servlet/the-408/NSF%2C-Columbia%2C-hospital%2C-school%2C/Detail
Search Catalog
Shopping Cart
Customer Service
Store Home
Shop Our Store
Stainless Steel Food Approved Products
Metal Detectable Products for Food Industry
Food-Grade Products
Winery & Vineyard Products
Solid & Vented Bins
Garden & Stable Products
Water Nozzles & Hose Adapters
Brewery & Distillery Products
Harvest Shears, Pruners & Loppers
Rakes
Paddles & Stirrers
Stainless Steel Sinks & Wash Stations
Brushes
Shovels
Squeegees
Forks
Anti-Microbial Cleaning Products
Automatic Shoe Cover Dispenser
Buckets & Pails
Cheese Making Products
Funnel
Hoe
Measuring Utensils
Sampling Dippers
Scoops
Scrapers
Strainers
Wall Tool Brackets
Media & Press
PR
Releases
Videos
In
The News
Sales
Sheets
Shopping Cart:
Empty
Small Bowl Wall Mount Stainless Sink- Knee Operated -Gooseneck Faucet
SKU: C505
Hands-free design for use in the food service and processing industry.

Features :
Double knee operated valve with connecting tube and fittings
Swivel gooseneck spout faucet with aerator
Soap dispenser, pump type, 16 oz. with holder
Strainer with basket
Wall mounting bracket
Specially designed knee valves provide easy access to working parts without disconnecting mount or plumbing.
Columbia Products
PRICE: 
$599.95
Quantity:
Return to Catalog
Stay Informed on the latest Food-Grade Tools!
Enter Email address below & receive our FREE monthly e-news
Search
 | View Cart  | Checkout
 | About
Us
 | Service
 | Policies
 | Home
winery
& orchard | stable
& farm | garden
& compost | metal detectable | food-grade | safety & hazmat
Copyright © 2012
Union Jack . All Rights Reserved.


df6fa1abb58549287111ba8d776733e9 0.000000 http://leftysporn.com/tag/japanese-porn/page/2/
Lefty’s Free Porn Sites
Free Classic Porn
Free Porn Pictures
Galleries
Contact Lefty
search
skip to content ↓
Lefty's Porn
Free Porn Movies and Pictures
Home

df6fa1abb58549287111ba8d776733e9 2.000000 http://leftysporn.com/tag/japanese-porn/page/2/
Lefty on Nov.08, 2010, under Free Porn Movies
Pardon the political incorrectness of this stereotypical observation, but I like the Japanese. Great technology, hot girls, wild game shows, overt – yet restrained -sexuality (think race queens) and now, something called Penis Worship.
Aside from sensuality of this handjob, we have the added interracial factor, which takes this already-amazing video up a notch. Click the arrow to watch. Then click here to see more smokin’ erotic massage videos at Hegre-Art .
Leave a Comment
: Asian porn , babes , beautiful women , big black cock , handjob , Japanese porn , massage
more...
Bold Japanese Public Sex
by Lefty on Sep.29, 2010, under Free Porn Movies
Everybody on the train gets to see these two fucking. Now we can watch, too!
Oh, those wacky Japanese. If they aren’t creating some kind of goofball game show, they are taking girls out in public for some wild exhibitionistic sex. Maybe you have seen some of those videos where girls are forced to have sex in public with roving gangs of thugs who take advantage of the fact that they cannot get away from them on crowded buses or trains. I know that most of them are set ups, but they still smell wrong to me. I guess rape porn just isn’t my thing.
5 changes: 5 additions & 0 deletions data/prepro/nonbreaking_prefixes/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The language suffix can be found here:

http://www.loc.gov/standards/iso639-2/php/code_list.php


75 changes: 75 additions & 0 deletions data/prepro/nonbreaking_prefixes/nonbreaking_prefix.ca
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Dr
Dra
pàg
p
c
av
Sr
Sra
adm
esq
Prof
S.A
S.L
p.e
ptes
Sta
St
pl
màx
cast
dir
nre
fra
admdora
Emm
Excma
espf
dc
admdor
tel
angl
aprox
ca
dept
dj
dl
dt
ds
dg
dv
ed
entl
al
i.e
maj
smin
n
núm
pta
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Loading