Skip to content

Electronic-Old-Persian-Library/Old-Persian-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Raw dataset for Old Persian cuneiform

Dear contributors, please be aware that cuneiform languages are different. For instance, the most popular are Elamite, Babylonian and Old Persian; we are working on Old Persian. Below you can see the differences:

types of cuneiform

(Photo is taken from national museum of Iran, the gold plate of king Darius)

Data structure:

/imagedata/

 /source/
        /king/
           source_king_001.jpg
        
  #example:
  
  /behistun/
       /darius_1/
           behistun_darius_1_001.jpg

/textdata/

  /eng_transcription_to_english/
       /metadata/
       eng_transcription_to_english_001.json
       
  /eng_transliteration_to_english/
       /metadata/
       eng_transliteration_to_english_001.json
       
  /single/
      /metadata/
      /eng_transliteration/
            eng_transliteration_001.json

              
   # "single" refers to text data that are just a text without translation 

Translating Old Persian language has some methods, for example, transliteration and transcription. Below you can see an example to know the difference between them:

transliteration_transcription

Metadata

For each directory a "source.metadata.csv" file is provided to see the information of data.

Explanation about metadata columns:

imagedata:

source: The source that I have taken data from.

abbreviation: The name of inscription

location: The main discovered location of that inscription.

translation: 1: if I have the translation of that inscription, 0: if I have not.

collection: The palace of storing that inscription at this current time.

artifact_id : artifact_id from CDLI reference

asset_number: asset_number from british museum collection

museum_number: museum_number from british museum collection


textdata:

abbreviation: The name of inscription

reference: The reference that I have taken data from.

location: The main discovered location of that inscription

image: 1: if I have the image of that inscription, 0: if I have not.

artifact_id : artifact_id from CDLI reference

References

Data pipeline

In the first stage, Old Persian cuneiform will be converted to English transcription text as an output using an OCR model. In the second stage, that English transcription text will be the input for an NLP or Large language model (LLM) model to be converted to modern languages. The NLP model performs as a machine translation model

data pipeline (copy)

Glossary

Behistun:بیستون

Susa:شوش

Persepolis:پرسپولیس(تخت جمشید)

Elamite:ایلامی

Babylonian:بابِلی

Cyrus:کوروش

Xerxes:خشایار

Artaxerxes:اردشیر

𐎠𐎢𐎼𐎶𐏀𐎡𐎠:اهورامزدا

LICENSE

This repository is under CC-BY-NC license and any commercial use is prohibited.