Models from joint research project on doxing and calls to harassment.
Paper: A large-scale characterization of online incitements to harassment across platforms
The following is some high-level details about each model.
More information can be found in the current manuscript.
Additional metadata can be found in the config.json
for each model.
Path structures:
models/MODEL_NAME/pt/
config.json
: hyperparameters used during training, architecture information, etc.pytorch_model.bin
: trained model weightsspecial_tokens_map.json
: tokenizer special token valuestokenizer_config.json
: trained tokenizer argumentstraining_args.bin
: training arguments saved as bin filevocab.txt
: tokenizer vocabulary file (using the defaultdistilbert
vocab file.)
models/MODEL_NAME/tf/
config.json
: tensorflow config filetf_model.h5
: tensorflow HDF5 modelsaved_model/1/saved_model.pb
: saved model that can be loaded in keras or tensorflow servingsaved_model/1/variables
: model variables directory
The conversion between pytorch and tensorflow was doing using the /bin/convert_pt_tf.py
script.
The models are both in pytorch and tensorflow format. They were created using the Huggingface Transformers library.
The tpu.ipynb
file in the notebooks
directory shows an example of how to classify a large amount of data using a TPU and a pre-trained model.
However the general process for loading a pytorch model is as follows:
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
model_path = 'path/to/model/file/'
config = AutoConfig.from_pretrained(
model_path,
num_labels=2
)
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
config=config
)
tokenizer = AutoTokenizer.from_pretrained(
model_path,
use_fast=True
)
padding = 'max_length' # pad the input text to the maximum length
max_length = 512 # set length based on model type
example_text = 'example input text'
args = (
(example_text,)
)
# Tokenize the texts
result = tokenizer(*args, padding=padding, max_length=max_length, truncation=True)
output = model(result)
print(output) # can run argmax on this to get predicted class, or use outputs for probability
The tensorflow version of the model can also be deployed via tensorflow serving so that input text can be sent to it via an API layer. You can start the tensorflow serving version of the model with the following:
./bin/tensorflow_serving.sh /models/{MODEL_NAME}/tf/saved_model/ {MODEL_NAME}
Note, this assumes you have docker installed.
You can then make test predictions against the model using the bin/get_predictions.py
file. You need to configure the file manually:
sentence
: example input text you are classifying.model_name
: name of the model deployed above.