tran rawtxt data to h5 #50

hellokitty753159 · 2020-08-04T02:56:39Z

[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]]
我的每一条数据是多层嵌套的list，我需要转成h5格式，以至于可以直接在您的程序上进行。但是np.array做不了这个操作。
def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close()
我应该如何修改这段代码，thx。

The text was updated successfully, but these errors were encountered:

guxd · 2020-08-04T06:56:49Z

现有代码只支持线性的序列化，不支持嵌套，你可以把你的方括号也看成一个字符，这样就可以当成一个序列了。

li-car-fei · 2022-05-13T12:27:47Z

要怎么将txt文件转成符合您的数据集中的.h5格式呢？

guxd · 2022-05-13T14:17:01Z

@li-car-fei 可以参考
https://github.com/guxd/DialogBERT/blob/master/prepare_data.py
中的binarize函数，把对话(a list of sequences)转成earray数组。

Ashbajawed · 2022-12-18T18:54:07Z

@li-car-fei You can refer to the binarize function in https://github.com/guxd/DialogBERT/blob/master/prepare_data.py to convert the dialog (a list of sequences) into an array array.

Hey sorry for being dumb but can you please guide what are dialogs in
def binarize(dialogs, tokenizer, output_path)

Are these arrays of sentences ? or something else

guxd · 2022-12-19T14:38:52Z

You can refer to this function which processes the dialog argument to the binarize function.

def get_daily_dial_data(data_path):
    dialogs = []
    dials = open(data_path, 'r').readlines()
    for dial in dials:
        utts = []
        for i, utt in enumerate(dial.rsplit(' __eou__ ')):
            caller = 'A' if i % 2 == 0 else 'B'
            utts.append((caller, utt, np.zeros((1, 1))))
        dialog = {'knowledge': '', 'utts': utts}
        dialogs.append(dialog)
    return dialogs

According to this code, dialogs is a list of dialog, and each dialog is a dictionary consists of utts. The utts is a list of sentences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tran rawtxt data to h5 #50

tran rawtxt data to h5 #50

hellokitty753159 commented Aug 4, 2020

guxd commented Aug 4, 2020

li-car-fei commented May 13, 2022

guxd commented May 13, 2022

Ashbajawed commented Dec 18, 2022

guxd commented Dec 19, 2022

tran rawtxt data to h5 #50

tran rawtxt data to h5 #50

Comments

hellokitty753159 commented Aug 4, 2020

guxd commented Aug 4, 2020

li-car-fei commented May 13, 2022

guxd commented May 13, 2022

Ashbajawed commented Dec 18, 2022

guxd commented Dec 19, 2022