Skip to content

twitter/twitter-korean-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

twitter-korean-text Coverage Status

[//]: # (Travis has been deactivated: Build Status)

νŠΈμœ„ν„°μ—μ„œ λ§Œλ“  μ˜€ν”ˆμ†ŒμŠ€ ν•œκ΅­μ–΄ 처리기

  • 2017λ…„ 4.4 버전 μ΄ν›„μ˜ κ°œλ°œμ€ http://openkoreantext.org μ—μ„œ μ§„ν–‰λ©λ‹ˆλ‹€.
  • We now started an official fork at http://openkoreantext.org as of early 2017. All the development after version 4.4 will be done in open-korean-text.

Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

슀칼라둜 쓰여진 ν•œκ΅­μ–΄ μ²˜λ¦¬κΈ°μž…λ‹ˆλ‹€. ν˜„μž¬ ν…μŠ€νŠΈ μ •κ·œν™”μ™€ ν˜•νƒœμ†Œ 뢄석, μŠ€ν…Œλ°μ„ μ§€μ›ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 짧은 νŠΈμœ—μ€ 물둠이고 κΈ΄ 글도 μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€. κ°œλ°œμ— μ°Έμ—¬ν•˜μ‹œκ³  싢은 뢄은 Google Forum에 κ°€μž…ν•΄ μ£Όμ„Έμš”. μ‚¬μš©λ²•μ„ μ•Œκ³ μž ν•˜μ‹œλŠ” μ΄ˆλ³΄λΆ€ν„° μ½”λ“œμ— μ°Έμ—¬ν•˜κ³  μ‹ΆμœΌμ‹  λΆ„λ“€κΉŒμ§€ λͺ¨λ‘ ν™˜μ˜ν•©λ‹ˆλ‹€.

twitter-korean-text의 λͺ©ν‘œλŠ” 빅데이터 λ“±μ—μ„œ κ°„λ‹¨ν•œ ν•œκ΅­μ–΄ 처리λ₯Ό 톡해 색인어λ₯Ό μΆ”μΆœν•˜λŠ” 데에 μžˆμŠ΅λ‹ˆλ‹€. μ™„μ „ν•œ μˆ˜μ€€μ˜ ν˜•νƒœμ†Œ 뢄석을 지ν–₯ν•˜μ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€.

twitter-korean-textλŠ” normalization, tokenization, stemming, phrase extraction μ΄λ ‡κ²Œ 넀가지 κΈ°λŠ₯을 μ§€μ›ν•©λ‹ˆλ‹€.

μ •κ·œν™” normalization (μž…λ‹ˆλ‹Όγ…‹γ…‹ -> μž…λ‹ˆλ‹€ γ…‹γ…‹, 샀릉해 -> μ‚¬λž‘ν•΄)

  • ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹Όγ…‹γ…‹γ…‹γ…‹γ…‹ -> ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹

토큰화 tokenization

  • ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹ -> ν•œκ΅­μ–΄Noun, λ₯ΌJosa, 처리Noun, ν•˜λŠ”Verb, μ˜ˆμ‹œNoun, μž…Adjective, λ‹ˆλ‹€Eomi γ…‹γ…‹KoreanParticle

μ–΄κ·Όν™” stemming (μž…λ‹ˆλ‹€ -> 이닀)

  • ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹ -> ν•œκ΅­μ–΄Noun, λ₯ΌJosa, 처리Noun, ν•˜λ‹€Verb, μ˜ˆμ‹œNoun, 이닀Adjective, γ…‹γ…‹KoreanParticle

어ꡬ μΆ”μΆœ phrase extraction

  • ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹ -> ν•œκ΅­μ–΄, 처리, μ˜ˆμ‹œ, μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œ

Introductory Presentation: Google Slides

Try it here

Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/

Gunja Agrawalλ‹˜μ΄ λ§Œλ“€μ–΄μ£Όμ‹  ν…ŒμŠ€νŠΈ μ›Ή νŽ˜μ΄μ§€ μž…λ‹ˆλ‹€. http://gunjaagrawal.com/langhack/

Opensourced here: twitter-korean-tokenizer-api

API

scaladoc

mavendoc

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml:

Maven을 μ΄μš©ν•  경우 pom.xml에 λ‹€μŒμ˜ λ‚΄μš©μ„ μΆ”κ°€ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€:

  <dependency>
    <groupId>com.twitter.penguin</groupId>
    <artifactId>korean-text</artifactId>
    <version>4.4</version>
  </dependency>

The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/

Support for other languages.

.net

modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS

node.js

Ch0p kindly offered a node.js wrapper: twtkrjs

Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text

Python

Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py

Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean

  • Python Korean NLP project KoNLPy now includes twitter-korean-text. νŒŒμ΄μ¬μ—μ„œ μ‰¬μš΄ ν™œμš©μ΄ κ°€λŠ₯ν•œ KoNLPy νŒ¨ν‚€μ§€μ— twkorean이 ν¬ν•¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Ruby

jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby

  • This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).

Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby

  • This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).

Elastic Search

socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch

Get the source μ†ŒμŠ€λ₯Ό μ›ν•˜μ‹œλŠ” 경우

Clone the git repo and build using maven.

Git 전체λ₯Ό ν΄λ‘ ν•˜κ³  Maven을 μ΄μš©ν•˜μ—¬ λΉŒλ“œν•©λ‹ˆλ‹€.

git clone https://github.com/twitter/twitter-korean-text.git
cd twitter-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Usage μ‚¬μš© 방법

You can find these examples in examples folder.

examples 폴더에 μ‚¬μš© 방법 예제 파일이 μžˆμŠ΅λ‹ˆλ‹€.

from Scala

import com.twitter.penguin.korean.TwitterKoreanProcessor
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken

object ScalaTwitterKoreanTextExample {
  def main(args: Array[String]) {
    val text = "ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹Όγ…‹γ…‹γ…‹γ…‹γ…‹ #ν•œκ΅­μ–΄"

    // Normalize
    val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
    println(normalized)
    // ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€γ…‹γ…‹ #ν•œκ΅­μ–΄

    // Tokenize
    val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
    println(tokens)
    // List(ν•œκ΅­μ–΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), ν•˜λŠ”(Verb: 7, 2),  (Space: 9, 1), μ˜ˆμ‹œ(Noun: 10, 2), μž…λ‹ˆ(Adjective: 12, 2), λ‹€(Eomi: 14, 1), γ…‹γ…‹(KoreanParticle: 15, 2),  (Space: 17, 1), #ν•œκ΅­μ–΄(Hashtag: 18, 4))

    // Stemming
    val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)

    println(stemmed)
    // List(ν•œκ΅­μ–΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), ν•˜λ‹€(Verb: 7, 2),  (Space: 9, 1), μ˜ˆμ‹œ(Noun: 10, 2), 이닀(Adjective: 12, 3), γ…‹γ…‹(KoreanParticle: 15, 2),  (Space: 17, 1), #ν•œκ΅­μ–΄(Hashtag: 18, 4))

    // Phrase extraction
    val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
    println(phrases)
    // List(ν•œκ΅­μ–΄(Noun: 0, 3), 처리(Noun: 5, 2), μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œ(Noun: 5, 7), μ˜ˆμ‹œ(Noun: 10, 2), #ν•œκ΅­μ–΄(Hashtag: 18, 4))
  }
}

from Java

import java.util.List;

import scala.collection.Seq;

import com.twitter.penguin.korean.TwitterKoreanProcessor;
import com.twitter.penguin.korean.TwitterKoreanProcessorJava;
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer;

public class JavaTwitterKoreanTextExample {
  public static void main(String[] args) {
    String text = "ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹Όγ…‹γ…‹γ…‹γ…‹γ…‹ #ν•œκ΅­μ–΄";

    // Normalize
    CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
    System.out.println(normalized);
    // ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€γ…‹γ…‹ #ν•œκ΅­μ–΄


    // Tokenize
    Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
    // [ν•œκ΅­μ–΄, λ₯Ό, 처리, ν•˜λŠ”, μ˜ˆμ‹œ, μž…λ‹ˆ, λ‹€, γ…‹γ…‹, #ν•œκ΅­μ–΄]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
    // [ν•œκ΅­μ–΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), ν•˜λŠ”(Verb: 7, 2),  (Space: 9, 1), μ˜ˆμ‹œ(Noun: 10, 2), μž…λ‹ˆ(Adjective: 12, 2), λ‹€(Eomi: 14, 1), γ…‹γ…‹(KoreanParticle: 15, 2),  (Space: 17, 1), #ν•œκ΅­μ–΄(Hashtag: 18, 4)]


    // Stemming
    Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
    // [ν•œκ΅­μ–΄, λ₯Ό, 처리, ν•˜λ‹€, μ˜ˆμ‹œ, 이닀, γ…‹γ…‹, #ν•œκ΅­μ–΄]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
    // [ν•œκ΅­μ–΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), ν•˜λ‹€(Verb: 7, 2),  (Space: 9, 1), μ˜ˆμ‹œ(Noun: 10, 2), 이닀(Adjective: 12, 3), γ…‹γ…‹(KoreanParticle: 15, 2),  (Space: 17, 1), #ν•œκ΅­μ–΄(Hashtag: 18, 4)]


    // Phrase extraction
    List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
    System.out.println(phrases);
    // [ν•œκ΅­μ–΄(Noun: 0, 3), 처리(Noun: 5, 2), μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œ(Noun: 5, 7), μ˜ˆμ‹œ(Noun: 10, 2), #ν•œκ΅­μ–΄(Hashtag: 18, 4)]

  }
}

Basics

TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.

TwitterKoreanProcessor.scala에 μ§€μ›ν•˜λŠ” λͺ¨λ“  κΈ°λŠ₯을 λͺ¨μ•„ λ‘μ—ˆμŠ΅λ‹ˆλ‹€.

Running Tests

mvn test will run our unit tests

λͺ¨λ“  μœ λ‹› ν…ŒμŠ€νŠΈλ₯Ό μ‹€ν–‰ν•˜λ €λ©΄ mvn testλ₯Ό μ΄μš©ν•΄ μ£Όμ„Έμš”.

Tools

We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

μ„€μΉ˜ 및 μˆ˜μ •ν•˜λŠ” 방법 상세 μ•ˆλ‚΄

Performance 처리 속도

Tested on Intel i7 2.3 Ghz

Initial loading time (초기 λ‘œλ”© μ‹œκ°„): 2~4 sec

Average time per parsing a chunk (평균 μ–΄μ ˆ 처리 μ‹œκ°„): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Time in Seconds 57.59 112.09 165.05 218.11 270.54 328.52 381.09 439.71 492.94 542.12
Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

Benchmark test

From http://konlpy.org/ko/v0.4.2/morph/

Author(s)

License

Copyright 2014 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0