Using Data Engineering to Monitor Due Process Violations in Asylum Hearings

How I engineered legal data to analyze linguistic discrimination against asylum seekers.

WWI propaganda poster. Source: Flickr/Mennonite Church USA Archive

Currently, the United States is experiencing what the New Yorker describes as a “translation crisis at the border.” When migrants from South America arrive at the border, they’re met at the border by paramilitary officers and immigration agents. Interpreters are integral to the asylum process, since they facilitate communication between asylum applicants and law enforcement.

However, there aren’t enough interpreters to represent asylum seekers — especially asylum seekers who are indigenous or speak a language with a small speaking population.

As a data engineer for non-profit advocacy organization Human Rights First’s asylum case database, I was curious if there was just as much of a translation crisis in the courtroom as at the border.

So, I engineered new features in the database to study whether variables like an asylum applicant’s native language and access to an interpreter had an effect on my research target variable, the applicant’s perceived legal credibility.

Translation: “Indigenous Languages in Latin America: More than a fifth of the 557 indigenous languages spoken in Latin America are in grave danger of going extinct, according to a new map published by UNICEF this month. SOURCE: ‘Sociolinguistic Atlas of Indigenous Communities in Latin America,’ UNICEF.”


Because my career mission is to use break barriers to information, I was excited to serve as a data engineer for non-profit advocacy organization Human Rights First’s asylum case database. The goal of the database is to provide asylum lawyers a centralized place to research case records and judicial trends. (Click here to view the codebase on GitHub.)

At the start, we inherited a scraper that pulled the following information from both initial hearing and appellate decisions:

  • Applicant’s country of origin
  • Applicant’s sex
  • Date of document
  • Judge(s) hearing the case
  • Type of application
  • Protected grounds
  • Type of violence applicant experienced (if applicable)
  • Hearing outcome

While the scraper as-inherited provided a database user with basic information about the asylum applicant, their case and the case’s outcome, it did not dig into possible under-researched biases. With my particular interest and previous experience in linguistic justice advocacy, I wanted to explore

Why would a judge deny asylum?

While reviewing the 130-or-so asylum hearing decisions provided as testing data, I noticed the term “credibility” occurring over and over. How trustworthy did the judge find the asylum applicant to be? Did the judge determine that the applicant was telling the full truth?

Immigration judges determine an asylum applicant’s credibility by the consistency of their testimony, and the ability for their testimony to meet the burden of proof for each claim to asylum. No matter how dangerous the conditions are back home, if the applicant doesn’t present a clear, consistent narrative, then the judge will deny their case.

However, the judge’s ability to evaluate applicant’s testimony is dependent on their ability to effectively communicate with the applicant. If an applicant doesn’t speak English, they must rely on an interpreter to effectively communicate their claims for asylum. In combination with the cultural, racial and socioeconomic differences between the three parties, communication can become misconstrued, or the applicant might become wary of sharing every detail of their narrative.

(It should be noted that not all asylum applicants are non-English speakers. Members of countries that speak English as either lingua francas or official languages seek asylum in the United States — and face linguistic discrimination in the courts as well. Testimony provided in variations of English, such as African American Vernacular English (AAVE), are also not considered to be as legally credible.)

In order to study the relationship between language, racial/ethnic identity and legal credibility, I designed code to pull information from an asylum hearing document about the applicant’s perceived credibility; their native language; race/ethnicity — specifically, indigenous status — and access to an interpreter during their hearings, if applicable.

Translating the research question into code

When designing the fields, I crafted arguments based on the syntactical patterns in the legal documents provided as testing data.

For the field determining the applicant’s perceived credibility, I used the follow logic: if the term “credible” is found in the document, and the word before it is not the word “not,” then the judge perceived the applicant as credible; if the “credible” appears, but the word “not” does precede it, then the applicant was not perceived as credible. However, when designing this field, I also had to take into account cases that did not mention an applicant’s perceived credibility.

To account for occasions when credibility is not mentioned in a hearing, I created an argument that states: if the term “credibility” is not present in the document, then fill in the field as “Not applicable to the case.” Without this logic, an asylum applicant could be misidentified as not credible, when their credibility was not in discussion.

Left: Function to search for an applicant’s perceived credibility. Right: Function to search for an applicant’s access to an interpreter.

For the field determining the applicant’s native language, I used the following logic: if the terms “native speaker” or “native speakers” appear in the document, then the native language must be the two subsequent words. If neither of those terms appear, then it’s assumed that the applicant is a native speaker of English — or at least did not declare any languages other than English.

For determining if the applicant is a member of an indigenous group or nation, I constructed a field using similar logic: if the term “indigenous” appears in the document, then the applicant is indigenous, and their group/nation/tribe is stated in the two previous words. Because indigeneity is an umbrella term, I decided to return the name of the group/nation/tribe, instead of just a “Yes” or “No,” so that asylum lawyers would have a more detailed understanding of their client.

For the final field, an applicant’s access to an interpreter, I implemented a series of “If/Else” statements, so that if the terms “interpreter” or “translator” appeared in the file, then the scraper will search the nearby words for terms such as “granted” and “was present.” If those terms appear nearby, then the scraper will determine that an interpreter was not only requested and granted for the client, but also present to interpret for the asylum applicant.

Left: Function to search for an applicant’s native language, as explicitly documented. Right: Function to search for an applicant’s specific tribe/group, if applicable.

Obstacles with testing feature accuracy

There are currently only around 150 case decisions populating the database. Because this quantity is lower than the 500-observation industry standard for quality model training, I cannot train the model and share predictive outcomes in good faith.

Human Rights First is in process of forming relationships with other organizations and schools across the country, to develop a steady source of training data. Once there is an established stream of data, then data scientists will be able to train and predict with the model.

Next Steps

Besides accruing more testing data, one of the most pressing next steps would be to improve the model training methods.

Due to lack of data, the model model must rely on supervised learning. However, as Human Rights First populates the database, I highly recommend future data scientists to train the model to conduct data analysis unsupervised. Because Human Rights First is a nonprofit, an unsupervised model will allow the organization to dedicate their resources elsewhere, like to accruing more data or expanding partnerships.

Another next step would be to improve the model’s accuracy.

To continue improving the accuracy of these four natural language processing models, I would implement packages like fuzzywuzzy that analyze data grammatically and semantically, so that names which include accent marks unusual in English transliterations would not be accidentally skipped. Additionally, I would expand the “credibility” field to count the number of times credibility-related terms, such as “burden of proof” and “inconsistencies,” appear in the document, so that an asylum lawyer can better understand the elements of the applicant’s testimony that led them to be perceived as not credible.

So what?

While more than 350 languages are spoken in the United States, English — Standard American English , specifically — is the most legally, politically and socially powerful. Unwillingness or inability to comply risks punishment.

When claiming asylum, the punishment is deportation to a place where adult survival rate is less than a month.

Through my work, asylum lawyers will now be able to study how linguistic differences impact a judge’s ability to trust testimony. Knowledge of this specific bias will provide an extra edge when preparing for hearings, which could break decision trends.

It could also save someone’s life.

Data scientist with a focus on advocacy and public records. Combining data and language to increase public access to information.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store