Classifying Dungeons & Dragons Characters with Lance DB and GPT-instruct.

Is the next generation even going to know k-folds, regression or confusion matrices?

Oct 28, 2023

When the article Classifying Character Classes in Dungeons and Dragons with Machine Learning showed up in my inbox I was intrigued. The author uses logistic regression to build a D&D character classifier based on a set of character traits. The approach felt oddly anachronistic and I was left wondering, “is the next generation even going to know k-folds, regression analysis or confusion matrices? or will they just GPT?”

For those unfamiliar, Dungeons & Dragons (D&D) is a fantasy tabletop role-playing game where players create imaginary characters and embark on adventures led by a Dungeon Master. One of the core aspects of these characters is ‘traits’, which define their personalities, strengths, weaknesses, and unique abilities.

I wanted to use vectorDBs and retrieval based augmentation (RAG) to see if I could match the results of the logistic regression approach in the blog post. The rest of this post goes through the steps I took, presents the results and then gives GPT4 a chance to opine on the implications of the work.

Step 1. Loading the Data

I used the data made available by the author of the post referenced above and loaded the characters and their associated traits into a Pandas dataframe:

dnd = pd.read_csv('./data/dnd_chars_unique.tsv', sep='\t')

Step 2. Adding a text field

The embeddings database will take a string representation of the traits and store it in a multidimensional space that can be queried for similar entries. To create the embedding, I need to pack all of the relevant traits into a single ‘text’ column containing the JSON>.

cols = dnd.columns[4:]
dnd['text'] = dnd[cols].to_json(orient='records', lines=True).splitlines()

Step 3. Embedding with `text-embedding-ada-022`

For our D&D character traits, I used Open AI’s text-embedding-ada-022 model to generate embeddings for each character based on their traits.

First, I created an embeddings function:

def embed_func(c):  
    rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
    return [record["embedding"] for record in rs["data"]]

Then used the `with_embeddings` in LanceDB to create an Arrow Dataset containing the original dataframe with a new column for the embedding.

import lancedb
from lancedb.embeddings import with_embeddings

data = with_embeddings(embed_func, df, 
                       show_progress=True,
                      batch_size=50)

Finally, I created a local vector db using Lance

uri = "/Users/tnn1t1s/openai/geppetto/dnddb.3"
db = lancedb.connect(uri)
tbl = db.create_table("chatbot", data)
len(tbl)

Step 4. Crafting the GPT-Instruct Prompt

To guide GPT-Instruct in answering the query, I provide it with a structured prompt. This prompt will contain context from similar characters and a query about a specific character's class.

First, structure the prompt:

def create_prompt(query, context):
    limit = 3750

    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer: "
    )
    # append contexts until hitting limit
    for i in range(1, len(context)):
        if len("\n\n---\n\n".join(context.text[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text[:i-1]) +
                prompt_end
            )
            break
        elif i == len(context)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text) +
                prompt_end 
            )    
    return prompt

Next, write a function to remove ‘class’ and ‘subclass’ from a character entry that we intend to classify. This should insure GPT-instruct has no knowledge of the character to class mapping in the prompt.

def get_character_query(df, id):
    c = df.loc[id].text
    c_json = json.loads(c)
    del c_json['justClass'] 
    del c_json['class']
    del c_json['subclass']
    return c_json

Now, after embedding the query, search the embeddings database to retrieve the three closest characters and then return the completed prompt.

def classify_character(cid):
    character = get_character_query(df, cid)
    print(character)
    query = (f"What is the class for a character with {character} ") 
    # Embed the question
    emb = embed_func(query)[0]
    # Use LanceDB to get top 3 most relevant context
    context = tbl.search(emb).limit(3).to_pandas()
    prompt = create_prompt(query, context)
    return complete(prompt)

Step 5. Try it out

It seems to work surprisingly well, accurately predicting a Level 20 Cleric.

(classify_character(35), df.loc[35]['class'])

('Cleric 20', 'Cleric 20')

Step 6. Analyzing the Results

Finally, test it out for more characters and compute an accuracy score. tldr; the classifier was nearly always correct and only made mistakes in the formatting of the output.

[(classify_character(i), df.loc[i]['class']) for i in range(20,40)]

[('The class for this character is a Druid.', 'Cleric 20'),
 ('Cleric 20', 'Cleric 20'),
 ('The class for this character is Fighter.', 'Fighter 20'),
 ('Barbarian', 'Barbarian 20'),
 ('Monk 20', 'Monk 20'),
 ('Warlock 20', 'Warlock 20'),
 ('Paladin 20', 'Paladin 20'),
 ('Druid 20', 'Druid 20'),
 ('Paladin 20', 'Paladin 20'),
 ('Rogue 20', 'Rogue 20'),
 ('Paladin 20', 'Paladin 20'),
 ('The class for this character is Monk.', 'Monk 20'),
 ('Wizard 20', 'Wizard 20'),
 ('The class for this character is Bard.', 'Bard 20'),
 ('Cleric 20', 'Cleric 20'),
 ('Cleric 20', 'Cleric 20'),
 ('Monk 20', 'Monk 20'),
 ('Cleric 20', 'Cleric 20'),
 ('Paladin 20', 'Paladin 20'),
 ('Cleric 20', 'Cleric 20')]

Reflections and Conclusions

I asked GPT what it thought of my experiment and its implications for the future of machine learning and data science practitioners. As usual, GPT was careful in its response …

“While logistic regression has been a stalwart in the classification arena, the advent of GPT-instruct brings forth a paradigm shift. This methodology, as showcased in our exploration with D&D character traits, not only achieves impressive accuracy but also encapsulates the richness of language and context. However, it's essential to approach such results with a blend of enthusiasm and caution. As we stand at this intersection of traditional and modern classification techniques, it's evident that the fusion of these methodologies can lead to powerful, nuanced, and innovative solutions in the realm of data science.” - GPT4

Reference:

Classifying Character Classes in Dungeons and Dragons with Machine Learning

Dungeons and Dragons as a Dialogue Challenge, CCB

Lance DB

Live Without a Skynet