Paper citation:
Jiang, Weipeng, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, and Chao Shen. “COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service Embeddings.” Proceedings of the ACM on Software Engineering 1, no. FSE (July 12, 2024): 906–28. https://doi.org/10.1145/3643767.

Summary:
This paper presents Costello a contrastive learning approach for testing validating the embeddings of large language models.
Why validate the embeddings?
They are used in many downstream tasks by customers of the LLMaaS,
- The customer collects labeled training data for a downstream task.
- They convert this data into embeddings using an API call from the LLM.
- They may train a downstream ML model (like a classifier) using these embeddings as input features.
- During inference, input data is converted into embeddings using an API call, passed through the trained model, and the output (e.g., a label or prediction) is generated.
Thus, having good embeddings is important, a user of LLMaaS would need to validate that the embeddings they are accessing are of high quality.
What is contrastive learning:
Contrastive learning is a widely used approach in self-supervised representation learning. Its main idea is to map input samples into a high-dimensional feature space and use a similarity metric to differentiate between semantically similar (positive) and dissimilar (negative) pairs.
The key objective is to bring positive samples closer together and push negative samples farther apart by designing an appropriate loss function.
How does Costello work:
- Generate a seed sentence, e.g., “He likes the movies”
- Generate a test case which contain two pairs, for example:
- Similar pair: He likes the movies ← →He loves the movies
- Dissimilar pair: He likes the movies ← →He hates the movies
- The pairs are generated by mutators: replacing words with synonyms (for similar) and antonym (for dissimilar)
- Provide the test case to the LLMSaaS, and receive an embedding, measure the distance between the embeddings (L2 norm), then, a test passes if :
distance of similar — distance of dissimilar > threshold
Summary of the research questions:
RQ1 (Accuracy): Can the tests generated by tool actually detect if the embeddings are adequate for the downstream task (that uses the embeddings)?
RQ2 (Components of approach): How do different parameters (e.g., distance between embeddings’ metric, threshold) affect the accuracy?
RQ3 (Usefulness): Can the test cases be used to improve the downstream task (e.g., train the task on the contrastive pairs to improve accuracy)?
RQ4 (Case study): Appling Costello to commercial LLMs (e.g., Ali Cloud), and finding failing test cases.
Summary of the results:
- Costello generates test suites and discovers numerous violations, with over 51.22% causing significant downstream classifier issues.
- LP-norm distance is more effective than Cosine distance, and an adaptive threshold improves precision.
- Violation samples detected by Costello enhance the behavior of both language models and downstream classifiers.
- Costello is applicable to real-world commercial LLMaaS (e.g., Ali Cloud, NLPCloud), identifying typical accuracy and fairness violations.
Leave a Reply