-
Notifications
You must be signed in to change notification settings - Fork 0
/
scraps.tex
45 lines (38 loc) · 2.37 KB
/
scraps.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
\chapter{Scraps}
Text that will be cleaned up and integrated somewhere.
\section{HTML Data Cleaning}
In this work we are interested in finding parallel sentences in multilingual
websites. A non-trivial step in this process is identifying sections of a website
which contain useful text, while removing ``boilerplate'' information such as
menus. This task, which is common part in many systems which use the web as a
corpus, has been explored in the CleanEval competition \citep{CleanEval}. This
section will explore the different approaches used for boilerplate
identification in CleanEval and other work.
\subsection{CleanEval}
CleanEval \citep{CleanEval} is a shared task on cleaning web pages.
% ------------------------------------------------------------------------------
\section{Web Mining Pipeline}
\paragraph{Language Identification}
The documents from the previous step are annotated with their language, which
was identified based only on the URL. Since searching for language codes in a
URL cause many false positives (the two letter language code for Indonesian is
``id''), we apply a language identification system to the text of the document
to ensure that the URL matching was correct.
The method we use for language identification is Prediction by Partial Matching
(PPM) \citep{Cleary84}.\footnote{Thanks to Paul McNamee
({\tt [email protected]}) for providing the code and data.} For each
language, we train a character $n$-gram model on text. At test time, we apply
the $n$-gram model for each language to the text of a document. The model which
assigns the highest score to the text is taken as the correct language.
\NoteJS{Paul's data came from a source that can't be redistributed. I could
re-train models on Wikipedia to make the whole pipeline more reproducible, and
this would let me have a small results section here.}
\NoteJS{Given Wikipedia, we could train binary classifiers for each language,
and that may be more effective. I'll come back to this if language ID becomes a
problem.}
\paragraph{Sentence Filtering}
Since we do not perform any boilerplate removal in earlier steps, there are many
sentence pairs produced by the pipeline which contain menu items or other bits
of text which are not useful to an MT system. To remove this data, we prune
segment pairs unless both segments contain at least 5 tokens composed of
alphanumeric characters only, and end with punctuation.