Example job
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I approach running code?
What software should I user?
Objectives
Run a job that is relevant for NLP work
Training available!
The training course An Introduction to Machine Learning Applications may be suitable to learn further.
Example job
The example job is from Kaggle, specifically AG News Classification Dataset.
The specific task will be:
This dataset contains news articles with 4 different genres namely "world news" , "sports news" , "business news" and "science and technology". These categories are represented by 1,2,3 and 4 in the class id respectively. The task requires to classify the articles accurately in the different categories (genres).
Also, preprocess the data and do EDA for the data set.
Setting up an environment
Lets first upload some data. This can be accomplished with Filezilla or via the command line:
$ scp train.csv.zip c.username@hawklogin.cf.ac.uk:
$ scp test.csv.zip c.username@hawklogin.cf.ac.uk:
The files can then be unzipped where required.
One method would be to use the Transformers library we could perform the following.
$ module load anaconda/2020.02
$ . activate
$ conda create -n NLP jupyter
$ conda activate NLP
$ conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch -c conda-forge
$ conda install pandas
$ pip install transformers simpletransformers
Notice the mix of Anaconda and Pip. Not ideal but seems to work in this place.
If you know which Transformers model you will be using it will need to be downloaded. The compute nodes do not have Internet Access by default (certain addresses can be opened up). One way to make sure Transformers downloads the correct file is to run a small portion of the code on the login node. E.g.
from simpletransformers.classification import ClassificationModel, ClassificationArgs
model = ClassificationModel(
"bert", "bert-base-cased", num_labels=4, use_cuda=False, args={"reprocess_input_data": True, "overwrite_output_dir": True}
)
This will cache the model for us before running the actual code on the compute/GPU nodes.
Traditional approach
We would now run this within a Slurm job or interactive session. For example an interacive session (would have to wait for resource to be free) can be started with:
$ srun -p c_gpu_diri1 --gres=gpu:2 -A scwXXXX --pty bash --login
The run python
to get a prompt.
Alternatively if we know the code to run we can copy the following to nlp.py
import pandas as pd
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_data = pd.read_csv('/home/c.username/nlp_tut/train.csv')
test_data = pd.read_csv('/home/c.username/nlp_tut/test.csv')
train_data['text'] = train_data['Title'] + ' ' + train_data['Description']
test_data['text'] = test_data['Title'] + ' ' + test_data['Description']
train_data['labels'] = train_data['Class Index']-1
test_data['labels'] = test_data['Class Index']-1
print("""
The class ids are now numbered 0-3 where:
0 represents World
1 represents Sports
2 represents Business
3 represents Sci/Tech.
""")
labels = ["World","Sports","Business","Sci/Tech"]
train_data = train_data.drop(columns=['Title', 'Description', 'Class Index'])
test_data = test_data.drop(columns=['Title', 'Description', 'Class Index'])
model = ClassificationModel(
"bert", "bert-base-cased", num_labels=4, args={"reprocess_input_data": True, "overwrite_output_dir": True}
)
model.train_model(train_data)
By default the model is saved in outputs
.
The model can be used by using the model saved to outputs
to evaluate without running the training step again.
import pandas as pd
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_data = pd.read_csv('/home/c.sistg1/nlp_tut/train.csv')
test_data = pd.read_csv('/home/c.sistg1/nlp_tut/test.csv')
train_data['text'] = train_data['Title'] + ' ' + train_data['Description']
test_data['text'] = test_data['Title'] + ' ' + test_data['Description']
train_data['labels'] = train_data['Class Index']-1
test_data['labels'] = test_data['Class Index']-1
print("""
The class ids are now numbered 0-3 where:
0 represents World
1 represents Sports
2 represents Business
3 represents Sci/Tech.
""")
labels = ["World","Sports","Business","Sci/Tech"]
train_data = train_data.drop(columns=['Title', 'Description', 'Class Index'])
test_data = test_data.drop(columns=['Title', 'Description', 'Class Index'])
model = ClassificationModel(
"bert", "./outputs"
)
#model.train_model(train_data)
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_data)
print(result)
print(test_data.head())
example_text=test_data.iloc[0]['text']
example_label=test_data.iloc[0]['labels']
print(f"{example_text} with label {example_label} has following output")
print(model_outputs[0])
sentence = "Injury in football"
predictions, raw_outputs = model.predict([sentence])
print(f"{sentence} is labelled as {labels[predictions[0]]}")
And then write a job script, for example called nlp.sh
with the following:
#!/bin/bash --login
#SBATCH -p c_gpu_diri1
#SBATCH -A scw1001
#SBATCH --gres=gpu:2
#SBATCH -t 1:00:00
module load anaconda/2020.02
. activate
conda activate NLP
python nlp.py
And submit with:
$ sbatch nlp.sh
This should provide you with output in the location you submitted the job, e.g. slurm-20880.out
and slurm-20880.err
.
However we have opportunities to use other technologies.
Jupyter
This seems to be a good opportunity to use a Jupyter Notebook. We can run on Slurm within the environment setup above with:
$ srun -n 1 -p c_comsc_diri1 --gres=gpu:1 --account=scwXXXX -t 1:00:00 jupyter-lab --ip=0.0.0.0
http://ccs9202:8888/lab?token=627a2a159dd6c1d81f4d52238c6e09e952ab84f4c85a0c58
or http://127.0.0.1:8888/lab?token=627a2a159dd6c1d81f4d52238c6e09e952ab84f4c85a0c58
Making sure the srun
has the correct partition and possible GPU requirements.
This can then be accessed either using ssh port forwarding or VNC.
Port forwarding
Port forwarding is achieved by specifying to ssh to redirect traffic, e.g.
$ ssh -L9009:ccs9202:8888 c.username@hawklogin.cf.ac.uk
Then the URL to connect to Jupyter will be
http://localhost:9009/lab?token=627a2a159dd6c1d81f4d52238c6e09e952ab84f4c85a0c58
Port forwarding can be tricky for some users, VNC is
possible by connecting to clvnc1.cf.ac.uk:5901
with a VNC viewer such as TigerVNC. This will get you a GNOME login prompt and
eventually a Linux GNOME desktop that has a browser available to access ccs9202
URL above.
You should now be able to connect to Jupyter. Using the code above the model has been trained on the GPU and the model
can be used to evaluate with the test.csv
case in test_data
.
Early preview
We have been looking at improving access to Web interfaces such as Jupyter and VNC and discovered OnDemand. This provides a web-based approach to deliver HPC services. In Cardiff we have been testing out an instance. For example it can:
- Provide VNC through website.
- Provide access to Jupyter and Rstudio directly without user having to do anything.
- Provide SSH access via web browser.
- Provide easy method to roll out other ways to deliver applications.
Demonstration provided during tutorial.
Key Points
Once you have run one job, the others are very similar.