Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tran rawtxt data to h5 #50

Open
hellokitty753159 opened this issue Aug 4, 2020 · 5 comments
Open

tran rawtxt data to h5 #50

hellokitty753159 opened this issue Aug 4, 2020 · 5 comments

Comments

@hellokitty753159
Copy link

[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]]
我的每一条数据是多层嵌套的list,我需要转成h5格式,以至于可以直接在您的程序上进行。但是np.array做不了这个操作。
def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close()
我应该如何修改这段代码,thx。

@guxd
Copy link
Owner

guxd commented Aug 4, 2020

现有代码只支持线性的序列化,不支持嵌套,你可以把你的方括号也看成一个字符,这样就可以当成一个序列了。

@li-car-fei
Copy link

要怎么将txt文件转成符合您的数据集中的.h5格式呢?

@guxd
Copy link
Owner

guxd commented May 13, 2022

@li-car-fei 可以参考
https://github.com/guxd/DialogBERT/blob/master/prepare_data.py
中的binarize函数,把对话(a list of sequences)转成earray数组。

@Ashbajawed
Copy link

@li-car-fei You can refer to the binarize function in https://github.com/guxd/DialogBERT/blob/master/prepare_data.py to convert the dialog (a list of sequences) into an array array.

Hey sorry for being dumb but can you please guide what are dialogs in
def binarize(dialogs, tokenizer, output_path)

Are these arrays of sentences ? or something else

@guxd
Copy link
Owner

guxd commented Dec 19, 2022

You can refer to this function which processes the dialog argument to the binarize function.

def get_daily_dial_data(data_path):
    dialogs = []
    dials = open(data_path, 'r').readlines()
    for dial in dials:
        utts = []
        for i, utt in enumerate(dial.rsplit(' __eou__ ')):
            caller = 'A' if i % 2 == 0 else 'B'
            utts.append((caller, utt, np.zeros((1, 1))))
        dialog = {'knowledge': '', 'utts': utts}
        dialogs.append(dialog)
    return dialogs

According to this code, dialogs is a list of dialog, and each dialog is a dictionary consists of utts. The utts is a list of sentences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants