You are where you email: Global migration trends discovered in email data

For the first time comparable migration data is available for almost every country of the world. To date, records were incompatible between nations and especially by gender and age, nonexistent. Emilio Zagheni from the Max Planck Institute for Demographic Research (MPIDR) in Rostock, Germany, for the first time provides a rich migration database by compiling the global flow of millions of e-mails.

"Where estimates of demographic flows exist, they are often outdated and largely inconsistent," says MPIDR researcher Emilio Zagheni. Official records are difficult to use for various reasons. Emigrants tend not to register after they move to a new country or do so very late. There is also no clear agreement between nations on how to actually define a migrant.

Official migration data is outdated and inconsistent

"Global internet data does not have these drawbacks," says Zagheni. "You are where you email." Together with Ingmar Weber from Yahoo! Research he traced emails sent from Yahoo! accounts around the world to infer the residence of its sender. Every device which sends email can be located at least at the country level by an internationally standardized code, the so-called IP address. Zagheni and Weber analysed the countries derived from IP addresses for a set of messages sent by 43 million anonymous Yahoo! account holders between September 2009 and June 2011.

In addition to the date and geographical origin of each message they compiled the self-reported birthday and gender of the sender. When a person started sending e-mail from a new location permanently, it was assumed that he or she had changed residence. This way they were able to calculate rates of migration from and to almost every country in the world. Only anonym zed data was used, so identifying individuals was impossible and no information about the recipients, the subject, or content of a message was accessed. The findings have now been published in the ACM Web Science Conference Proceedings.

The results not only are a proof of concept. They also reveal international migration characteristics never seen before. For the USA, Zagheni and Weber were able to produce the first curve of emigration by age and sex ever. "In the U.S. many statistics are collected about people who move into the country, but there is no system that keeps track of people who move out," says Emilio Zagheni.

The potential of the email statistics goes far beyond calculating gross country profiles. For instance, the researchers also looked into Mexico-US cross-border mobility. The data reveals how strongly both countries are demographically integrated: most people who moved from Mexico to the United States either spent time in the USA before emigrating north, or went back to visit Mexico soon after moving to the United States. Those in their 30s have the highest rate of mobility across the Mexico-US border, while the least mobile are those 50 and older.

Only the tip of the iceberg

The strength of Zagheni's and Weber's migration data comes not only from the vast number of emails available, but also from a mathematical model they set up to adjust for typical shortcomings of email statistics: those who send email are not representative of the entire population. Some groups, like the elderly, use email less or not at all and are thus underrepresented. But the researchers managed to calculate adjustment factors for such groups by gauging their email data against migration numbers from European countries, where official data is fairly reliable.

"What we addressed so far is only the tip of the iceberg," says Emilio Zagheni. With further fine-tuning of the adjustment factors and mining more digital data like twitter messages, more difficult questions could be tackled. For instance one could keep track of the short and long-term mobility patterns before and after a crisis like that of the Japanese Fukushima reactors. Unquestionably, digital records give demographers the chance to gain a more accurate picture of population dynamics in regions they can so far only guess about, says Zagheni. "This research has the most potential in developing countries, where the Internet spreads much faster than registration programs develop."