Compass

Exploring Compass: An AI Conversational Agent for Personalized Skill Discovery

Compass is a generative AI agent designed to engage users in exploratory conversations to gather information and identify their unique skills, focusing on abilities over credentials, including informal and unpaid work.

Apostolos Benisis, Francesco Preta

24 Oct 2024 — 7 min read

User Testing for Compass – Tabiya's conversational skills exploration tool. Photo by Barry Christianson for Tabiya.

In the rapidly evolving landscape of generative AI applications, Compass is designed to engage users in exploratory conversations to gather information and identify their unique skills, focusing on abilities over credentials, including informal and unpaid work. This tech-focused blog post dives into how Compass leverages agentic workflows and other AI techniques to offer a personalized and engaging user experience and how Compass utilizes Tabiya's Inclusive Taxonomy to discover skills from both the formal and the unseen economy.

Agentic Workflows That Mimic Human Conversations

At the core of Compass is its ability to mimic human conversational patterns. An overarching agent breaks down a conversation for smaller, specialized agents, each with specific responsibilities and goals. This multi-agent system allows Compass to approach conversations and tasks much like a human would. For instance, one agent might engage in dialogue to gather specific information, then process that information and pass it to another agent for further use.

Each agent maintains an individual internal state across user interactions, enabling it to apply strategies to accomplish its goals effectively. Agents have access to the user’s conversation history and various tools based on Large Language Model (LLM) prompts, and they offer interfaces for other agents to interact with them.

The conversation process consists of multiple stages:

Introduction: The user is greeted and asked if they would like the process to start. Alternatively, the user can ask questions or general clarifications about the process.
Experience Gathering: The user provides basic information about their work experiences, including roles, locations, and periods. The agent prompts for all types of experiences, including unpaid work, informal work, and volunteering, ensuring a comprehensive view of the user’s skills.
In-depth Exploration: Compass dives into each experience with open-ended questions, encouraging the user to discuss their work in detail. This helps the system accurately extract user's responsibilities in each position and subsequently link them to the relevant skills.
Skill Identification: Finally, Compass processes all the information obtained in the previous steps to present a list of skills associated to the user's original experiences.

Compass leverages an LLM to drive the conversation, allowing for natural and adaptive interactions. However, it remains focused on the task, like a professional interviewer, while showing empathy and gracefully handling unexpected inputs. Compass attempts to maintain a balance between open conversation and setting boundaries to achieve the tasks at hand.

Compass

Leveraging the Power of an LLM in Four Ways

Compass employs large language models in several ways to enhance user interaction and data processing:

Conversational Engagement: Unlike typical applications where the LLM responds to user questions, Compass reverses this interaction. It generates questions to guide a directed and grounded conversation.
Natural Language Processing Tasks: Compass uses the LLM for tasks like clustering, named entity extraction, and classification, handling user inputs efficiently without the need for costly and time-consuming model training or fine-tuning.
Explainability and Traceability: The LLM provides reasoning for specific outputs, allowing for explanations that link discovered skills back to the user’s input. This feature, which is based on a variation of Chain of Thought reasoning, is especially noteworthy—not only for the capability it offers but also because it was an unplanned outcome. It emerged while attempting to solve a different problem: improving the accuracy of the LLM's tasks
Filtering of Taxonomy output: The LLM filters relevant skills and occupations connected to the ESCO model from the conversation's output. Leveraging its advanced reasoning capabilities, the LLM efficiently processes large amounts of text, identifying the most pertinent entities. This approach combines traditional entity linking via semantic search with LLM-based filtering, creating a hybrid solution.

Reducing Hallucinations and Improving Accuracy

A common challenge with AI language models is the risk of generating irrelevant or inaccurate outputs, often referred to as "hallucinations." Compass addresses this issue through several strategies to reduce the probability of hallucinations or to mitigate their effects:

Task Decomposition: Smaller, more manageable tasks are assigned to individual agents with specific LLM prompts.
State-Induced Instructions: Agents use their internal state to generate targeted instructions during user interactions, guiding the conversation toward a specific goal. This approach reduces the size of the prompt by including only relevant segments, making it more likely that the LLM will follow the instructions accurately, thereby reducing the risk of hallucinations.
Guided Output: Instructions are carefully crafted to increase the likelihood of relevant responses. Techniques include:
- One- and few-shot learning
- Chain of Thought
- Retrieval Augmented Generation
- JSON schemas with validation and retries
- Ordering output segments to align with semantic dependencies
State Guardrails: Simple, rule-based decisions are made whenever possible, reducing reliance on the LLM and minimizing potential inaccuracies.
Taxonomy Grounding: By linking entities to a predefined occupations/skills taxonomy, Compass ensures that identified skills remain within a relevant and accurate domain.

It’s important to note that, despite these measures, there is always a residual chance that Compass may deviate from expected behavior due to the probabilistic nature of LLMs.

Balancing the flow of natural conversation with safeguards against hallucinations, while still achieving the conversation’s goals, is a delicate task that can be challenging to perfect.

Multi-Stage Pipeline for Skill Identification

After gathering the necessary information from the user, Compass processes the data through a multi-stage pipeline to identify the user’s top skills. This pipeline employs techniques such as clustering, classification, and entity linking to an occupations/skills taxonomy. The result is a refined list of skills that accurately reflects the user’s experience based on the information they provided. The pipeline is quite sophisticated and deserves a separate article, so stay tuned for a follow-up.

You can find a high-level overview of the pipeline here.

The Core Role of a Taxonomy

Tabiya's inclusive taxonomy plays a central role in Compass. It grounds the LLM’s tasks, but there are several additional aspects worth mentioning:

Standardization: The identified skills are linked to a standard taxonomy, making interpretation and comparison easier. The concepts behind these skills are well-defined and can be explained, allowing for clarification and disambiguation.
Canonicalization: Explored skills are listed with canonical names and UUIDs, enabling consistent referencing across different experiences and applications.
Network Structure: The taxonomy models the labor market by associating occupations with skills, forming a knowledge graph that can provide additional insights to users.
Unseen Economy: The taxonomy has been extended to include activities from the unseen economy, empowering young women and first-time job seekers to enter the job market.
Localization: A taxonomy can consider the specific context of a country. This includes occupations unique to certain regions, alternative names for occupations that are region-specific, and varying skill requirements for the same occupation across different countries.
Work Type Classification: All experiences are classified into four types (wage employment, self-employment, unpaid trainee work, and unseen work), which allows for a more targeted exploration of the job seeker’s skills.

Evaluation Strategies

To ensure Compass operates effectively, we implemented an evaluation focused on five key elements:

Rigorous Embeddings Evaluation: We rigorously evaluated various strategies for generating embeddings from the taxonomy entities. Our considerations included identifying which properties of the entities should be included in the embeddings generation, as well as determining the optimal number of entities to balance accuracy and precision. For the tests, we used established datasets from the literature and generated synthetic data to mimic Compass user queries.
Isolated Component Testing: Each agent's tools were evaluated individually using specific inputs and expected outputs. For example, classification components were tested with known inputs to verify accurate label assignments.
Scripted Conversations: Conversational agents were tested using predefined dialogues, with outputs evaluated by either automated evaluators (other LLMs) or human inspectors.
Simulated User Interactions: Compass was tested in end-to-end scenarios by simulating user interactions driven by an LLM. The simulated user was given a persona based on our UX research and additional instructions to cover specific cases of interest. These conversations were then assessed for quality and relevance by automated evaluators (other LLMs) or human inspectors.
User Testing and Trace Analysis: On a smaller scale, real user tests were conducted. By tracing the top skill outputs back to the user's input, human inspectors could assess the performance of specific agents within Compass.

A key takeaway is that evaluating this system requires a hybrid approach; it cannot be treated strictly as a back-end system or a typical ML task. This is primarily because the evaluation results are flaky, the system is probabilistic by design, and tests are time-consuming and potentially costly. Running thousands of test repetitions using LLMs is not always efficient, though improvements in resource availability, rate limits, and costs are helping to alleviate this concern.

These tests cannot be treated as typical ML tasks—such as running them once at the end of an iteration in a Jupyter notebook—because the iteration pace is fast, with multiple commits and changes per day. Moreover, many tests need to be run continuously as the system evolves. As a result, a combined approach is required.

Technical Stack Overview

Compass is built on a scalable and reliable technical foundation:

Language Models and Embeddings: Compass utilizes the gemini-1.5-flash-001 model for its LLM capabilities and the textembedding-gecko version 3 model for embeddings. The gemini-1.5-pro-preview-0409 model is used for the LLM auto-evaluator. The Gemini model was chosen for its balanced performance across task accuracy, inference speed, rate availability, and cost.
Backend Technologies: Developed with Python 3.11, FastAPI 0.111, and Pydantic 2.7 for a performant server-side environment. An asynchronous framework suited the use case well, as LLM inference endpoints can be slow. Python was chosen for its extensive AI/ML library support and because it made it easier to integrate ML scientists into the development team.
Frontend Technologies: The UI, built with React.js 19, TypeScript 5, and Material UI 5, is optimized for mobile but performs well on tablets and desktops. Additionally, we use Storybook 8.1 to showcase, visually inspect, and test UI components in isolation.
Data Persistence: Data is securely stored using MongoDB Atlas, which includes vector search capabilities. Our team was already familiar with MongoDB, and the taxonomy was already in MongoDB Atlas, so it was a natural choice.
Deployment: The entire application is deployed on Google Cloud Platform (GCP), ensuring high availability and scalability. We use Pulumi to deploy nearly all the infrastructure, as it allows us to write deployment code in Python, aligning with the rest of the backend development. Additionally, our team was already experienced with Pulumi, making it a natural choice. For error tracking and application performance monitoring, we use Sentry.

Contribute and Get Involved

Compass is built on an AI architecture that combines agentic workflows with sophisticated LLM utilization. Its capacity to mimic human conversation, minimize irrelevant outputs, trace responses back to user inputs, and the use of an inclusive labour taxonomy distinguishes it in the space of personalized AI interactions.

The code is open source under the MIT license and available here. We would love to hear your feedback, questions, or contributions to our code. We are particularly interested in skills spanning AI/ML, backend, and UI to help improve and expand Compass for our partners.