This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft. Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
AI sandbagging is a term used in AI safety to refer to an artificial intelligence which deliberately underperforms in official evaluations in order to appear less powerful or less capable than it actually is.[1]
References
edit- ^ van der Weij, Teun; Hofstätter, Felix; Jaffe, Ollie; Brown, Samuel F.; Ward, Francis Rhys (2024-06-11). "AI Sandbagging: Language Models can Strategically Underperform on Evaluations". arXiv.org. Retrieved 2024-09-16.
External links
edit