Skip to content

kimryan/perl6-Lingua-EN-Sentence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME

Lingua::EN::Sentence - Module for splitting text into sentences.

SYNOPSIS

use Lingua::EN::Sentence;
add_acronyms(' Lt Gen');  ## adding support for 'Lt. Gen.' 

$text = Q[First sentence with some abbreviations,  Mr. J. Smith, 2 Jones St. SomeTown Ariz. U.S.A. is an address.
Sentence 2: Sequences like ellipsis ... are handled. Sentence 3, numbered sections such as point 1. are ok.];
my @sentences = $text.sentences;
for @sentences -> $sent {
    say $sent;
}

Output is:

First sentence with some abbreviations,  Mr. J. Smith, 2 Jones St. SomeTown Ariz. U.S.A. is an address.
Sentence 2: Sequences like ellipsis ... are handled.
Sentence 3, numbered sections such as point 1. are ok.

DESCRIPTION

The Lingua::EN::Sentence module contains the method sentences, which splits text into its constituent sentences, based on regular expressions, a list of abbreviations (built in and given) and other rules.

Certain well know exceptions, such as abbreviations like Mr., Calif. and Ave. will cause incorrect segmentations. But many of these are already integrated into this code and are being taken care of. Note that abbreviations are case sensitive.

The add_acronyms method allows you to add custom abbreviations.

ALGORITHM

Before any regex processing, quotations are hidden away and inserted after the sentences are split. That entails that no sentence splitting will be attempted between pairs of double quotes. Common cases of full stops that do not denote an end of sentence are also hidden. These include the dot after abbreviations mentioned above, acronymns and ellipsis.

Basically, I use a 'brute' regular expression to split the text into sentences. (Well, nothing is yet split - I just mark the end-of-sentence). Then I look into a set of rules which decide when an end-of-sentence is justified and when it's a mistake. In case of a mistake, the end-of-sentence mark is removed.

What are such mistakes? Cases of abbreviations, for example. I have a list of such abbreviations (Please see `Acronym/Abbreviations list' section), and more general rules (for example, the abbreviations 'i.e.' and 'e.g.' need not to be in the list as a special rule takes care of all single letter abbreviations).

FUNCTIONS

$text.sentences

A very convenient extension to the Perl6 Str string type, the .sentences method allows us to natively request the sentences in a string, similarly to the Str "words" method.

The sentences method takes a Str variable containing the text as an argument and returns an array of sentences that the text has been split into.

Returned sentences will be trimmed (beginning and end of sentence) of white-spaces.

Strings with no alpha-numeric characters in them, won't be returned as sentences.

add_acronyms( @acronyms )

This function is used for adding acronyms not supported by this code. Please see `Acronym/Abbreviations list' section for the abbreviations already supported by this module.

get_acronyms()

This function will return the defined list of acronyms.

set_acronyms( @my_acronyms )

This function replaces the predefined acroynm list with the given list.

get_EOS()

This function returns the value of the string used to mark the end of sentence. You might want to see what it is, and to make sure your text doesn't contain it. You can use set_EOS() to alter the end-of-sentence string to whatever you desire.

set_EOS( $new_EOS_string )

This function alters the end-of-sentence string used to mark the end of sentences.

Acronym/Abbreviations list

You can use the get_acronyms() function to get acronyms. It has become too long to specify in the documentation.

If I come across a good general-purpose list - I'll incorporate it into this module. Feel free to suggest such lists.

Limitations

There are some valid cases cannot be detected, such as: This belongs to John A. Smith, which will break after A. This cannot be distinguished from a valid sequence like so said I. Next sentence. A sentence ending in an acronym does not cause a split such as St.

AUTHOR

Deyan Ginev, 2013. Kim Ryan, 2023

Perl5 CPAN author: Shlomo Yona ([email protected])

Released under the same terms as Perl 6; see the LICENSE file for details.

About

Perl6 Port of the Lingua::EN::Sentence CPAN module

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages