Stories about AI gone bigoted are easy to find: Microsoft’s Neo-Nazi “Tay” bot, her still racist sister “Zo”, Google’s autocomplete function that assumed men occupy high status jobs, and Facebook’s job-related targeted advertising which assumed the same.

A key factor in AI bias is that the technology is trained on faulty databases. Databases are made up of existing content. Existing content comes from people interacting in society. Society has historic, entrenched, and persistent patterns of privilege and disadvantage across demographic markers. Databases reflect these structural societal patterns and thus, replicate discriminatory assumptions. For example, Rashida Richardson, Jason Schultz, and Kate Crawford put out a paper this week showing how policing jurisdictions with a history of racist and unprofessional practices generate “dirty data” and thus produce dubious databases from which policing algorithms are derived. The point is that database construction is a social and political task, not just a technical one. Without concerted effort and attention, databases will be biased by default. 

Ari Schlesinger, Kenton P. O’Hara, and Alex S. Taylor present an interesting suggestion/question about database construction. They are interested in chatbots in particular, but their point easily expands to other forms of AI. They note that the standard practice is to manage AI databases through the construction of a “blacklist”. A blacklist is a list of words that will be filtered from the AI training. Blacklists generally include racist, sexist, homophobic, and other forms of offensive language. The authors point out, however, that this  approach is less than ideal for two reasons. First, it can eliminate innocuous terms and phrases in the name of caution. This doesn’t just limit the AI, but can also erase forms of identity and experience. The authors give the example of “Paki”. This is a derogatory racist term. However, filtering this string of letters also filters out the word Pakistan, which is an entire country/nationality that gets deleted from the lexicon. Second, language is dynamic and meanings change. Blacklists are relatively static and thus quickly dated and ineffective.

The authors suggest instead that databases are constructed proactively through modeling. Rather than tell databases what not to say (or read/hear/calculate etc.), we ought to manufacture models of desirable content (e.g., people talking about race in a race conscious way). I think there’s an interesting point here, and an interesting question about preventative versus proactive approaches to AI and bias. Likely, the approach has to come from both directions–omit that which is explicitly offensive and teach in ways that are socially conscious. How to achieve this balance remains an open question both technically and socially. My guess is that constructing social models of equity will be the most complex part of the puzzle.


Jenny Davis is on Twitter @Jenny_L_Davis

Headline Pic Via: Source