Classifying Dungeons & Dragons Characters with Lance DB and GPT-instruct.
Is the next generation even going to know k-folds, regression or confusion matrices?
When the article Classifying Character Classes in Dungeons and Dragons with Machine Learning showed up in my inbox I was intrigued. The author uses logistic regression to build a D&D character classifier based on a set of character traits. The approach felt oddly anachronistic and I was left wondering, “is the next generation even going to know k-folds, regression analysis or confusion matrices? or will they just GPT?”
For those unfamiliar, Dungeons & Dragons (D&D) is a fantasy tabletop role-playing game where players create imaginary characters and embark on adventures led by a Dungeon Master. One of the core aspects of these characters is ‘traits’, which define their personalities, strengths, weaknesses, and unique abilities.
I wanted to use vectorDBs and retrieval based augmentation (RAG) to see if I could match the results of the logistic regression approach in the blog post. The rest of this post goes through the steps I took, presents the results and then gives GPT4 a chance to opine on the implications of the work.
Step 1. Loading the Data
I used the data made available by the author of the post referenced above and loaded the characters and their associated traits into a Pandas dataframe:
dnd = pd.read_csv('./data/dnd_chars_unique.tsv', sep='\t')
Step 2. Adding a text field
The embeddings database will take a string representation of the traits and store it in a multidimensional space that can be queried for similar entries. To create the embedding, I need to pack all of the relevant traits into a single ‘text’ column containing the JSON>.
cols = dnd.columns[4:]
dnd['text'] = dnd[cols].to_json(orient='records', lines=True).splitlines()
Step 3. Embedding with text-embedding-ada-022
For our D&D character traits, I used Open AI’s text-embedding-ada-022
model to generate embeddings for each character based on their traits.
First, I created an embeddings function:
def embed_func(c):
rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
return [record["embedding"] for record in rs["data"]]
Then used the `with_embeddings` in LanceDB to create an Arrow Dataset containing the original dataframe with a new column for the embedding.
import lancedb
from lancedb.embeddings import with_embeddings
data = with_embeddings(embed_func, df,
show_progress=True,
batch_size=50)
Finally, I created a local vector db using Lance
uri = "/Users/tnn1t1s/openai/geppetto/dnddb.3"
db = lancedb.connect(uri)
tbl = db.create_table("chatbot", data)
len(tbl)
Step 4. Crafting the GPT-Instruct Prompt
To guide GPT-Instruct in answering the query, I provide it with a structured prompt. This prompt will contain context from similar characters and a query about a specific character's class.
First, structure the prompt:
def create_prompt(query, context):
limit = 3750
prompt_start = (
"Answer the question based on the context below.\n\n"+
"Context:\n"
)
prompt_end = (
f"\n\nQuestion: {query}\nAnswer: "
)
# append contexts until hitting limit
for i in range(1, len(context)):
if len("\n\n---\n\n".join(context.text[:i])) >= limit:
prompt = (
prompt_start +
"\n\n---\n\n".join(context.text[:i-1]) +
prompt_end
)
break
elif i == len(context)-1:
prompt = (
prompt_start +
"\n\n---\n\n".join(context.text) +
prompt_end
)
return prompt
Next, write a function to remove ‘class’ and ‘subclass’ from a character entry that we intend to classify. This should insure GPT-instruct has no knowledge of the character to class mapping in the prompt.
def get_character_query(df, id):
c = df.loc[id].text
c_json = json.loads(c)
del c_json['justClass']
del c_json['class']
del c_json['subclass']
return c_json
Now, after embedding the query, search the embeddings database to retrieve the three closest characters and then return the completed prompt.
def classify_character(cid):
character = get_character_query(df, cid)
print(character)
query = (f"What is the class for a character with {character} ")
# Embed the question
emb = embed_func(query)[0]
# Use LanceDB to get top 3 most relevant context
context = tbl.search(emb).limit(3).to_pandas()
prompt = create_prompt(query, context)
return complete(prompt)
Step 5. Try it out
It seems to work surprisingly well, accurately predicting a Level 20 Cleric.
(classify_character(35), df.loc[35]['class'])
('Cleric 20', 'Cleric 20')
Step 6. Analyzing the Results
Finally, test it out for more characters and compute an accuracy score. tldr; the classifier was nearly always correct and only made mistakes in the formatting of the output.
[(classify_character(i), df.loc[i]['class']) for i in range(20,40)]
[('The class for this character is a Druid.', 'Cleric 20'),
('Cleric 20', 'Cleric 20'),
('The class for this character is Fighter.', 'Fighter 20'),
('Barbarian', 'Barbarian 20'),
('Monk 20', 'Monk 20'),
('Warlock 20', 'Warlock 20'),
('Paladin 20', 'Paladin 20'),
('Druid 20', 'Druid 20'),
('Paladin 20', 'Paladin 20'),
('Rogue 20', 'Rogue 20'),
('Paladin 20', 'Paladin 20'),
('The class for this character is Monk.', 'Monk 20'),
('Wizard 20', 'Wizard 20'),
('The class for this character is Bard.', 'Bard 20'),
('Cleric 20', 'Cleric 20'),
('Cleric 20', 'Cleric 20'),
('Monk 20', 'Monk 20'),
('Cleric 20', 'Cleric 20'),
('Paladin 20', 'Paladin 20'),
('Cleric 20', 'Cleric 20')]
Reflections and Conclusions
I asked GPT what it thought of my experiment and its implications for the future of machine learning and data science practitioners. As usual, GPT was careful in its response …
“While logistic regression has been a stalwart in the classification arena, the advent of GPT-instruct brings forth a paradigm shift. This methodology, as showcased in our exploration with D&D character traits, not only achieves impressive accuracy but also encapsulates the richness of language and context. However, it's essential to approach such results with a blend of enthusiasm and caution. As we stand at this intersection of traditional and modern classification techniques, it's evident that the fusion of these methodologies can lead to powerful, nuanced, and innovative solutions in the realm of data science.” - GPT4
Reference:
Classifying Character Classes in Dungeons and Dragons with Machine Learning