Populations change through three major processes:
· Mortality, and
A useful way to express the rate at which women have children is the Total Fertility Rate (TFR). TFR is the average number of children that would be born per woman if all women lived to the end of their childbearing years and bore children according to a given set of age-specific fertility rates. If the average woman has approximately 2 children in her lifetime, this is just enough to maintain the population.
Figure 1: TFR in countries in 2002 .
As seen in Figure 1, some countries have high and some low TFR. In most European countries TFR in 2006 was below 1.5 children per women, which is far less than desired. Namely, sustained low fertility rates can lead to a rapidly aging population and, in the long run, may place a burden on the economy and the social security system because the pool of younger workers responsible for supporting the dependent elderly population is getting smaller. Tracking trends of fertility rates and factors that influence them helps to support effective social planning and the allocation of basic resources across generations.
So far scientific efforts in demography were devoted mainly to exploration and definition of the process of data collection and qualitative interpretation of the statistical results, consequently not putting emphasis on new data analyzing methods. Data is typically analyzed with event history regression methods, Markov transition models and Optimal matching method using common spread statistical packages like (SPSS, SAS, S-Plus, Stata, R, TDA, etc.). The hypothesis is that between these typical aggregate descriptions and causal analysis there is a deficit of research on complex relations. Several modern methods, including data mining, offer opportunities to fill this gap.
In the last decades, data mining tools for knowledge discovery from data (KDD) proved successful in various fields. However, searching through the internet showed that these approaches have received little attention in demographic analyses. There are some publications, e.g. Blockeel et al. showed how mining frequent item sets may be used to detect temporal changes in event sequences frequency from the Austrian FFS data. In Billari et al., three of the authors experienced an induction tree approach for exploring differences in Austrian and Italian life event sequences. Oris et al. initiated social mobility analysis with induction trees. Unlike the statistical modeling approach, the methods make no assumptions about an underlying process generating the data and proceeds mainly heuristically. The approach differs from ours because we study rather static data and do not yet apply sequential rule mining analysis on historical demographic data.
Data Mining for Demography
Successful data mining is based on various investigations of the data using different methods, parameters, and data to find most meaningful relations.
Basic data description
Data for machine learning and data mining are most commonly presented in attribute-class form, i.e. in a “learning matrix”, where rows represent examples and columns attributes. In our case, an example corresponds to one country, and a class of the country, presented in the last column, denotes fertility rate. The first attribute is the name of the country. Altogether there are 95 basic attributes and 147 countries. Attributes and their values were partially obtained from the demographic sources such as UN, Eurostat, and the Slovenian statistical database. Several of the attributes were obtained from the internet, based on the assumption that they might show some interesting demographic relation. We were trying to get as many attributes as possible, nondiscriminatory whether positive or negative in terms of fertility rate.
Attributes in demographic literature are grouped into biological and social since human fertility is a socially formed biological process. Newer literature introduces more and more complex structures, based on detailed grouping of social factors. Malai divided factors that impact fertility rate in six groups:
· anthropological and
By attribute modifications we denote eliminating some columns in the learning matrix, and adding new columns, i.e. attributes. Subgroups of columns were chosen based on the demographic categories, and by DM methods. There were 5 new attributes added during the process of DM, thus bringing the total number of attributes to 100. Around half of the experiments were performed on 100 attributes.
Besides the basic class discretization into two values, three values of TFR were tried as well: low (< 2), middle (2-3) and high (>3).
In another attempt countries were classified according to decrease or increase of TFR. First calculated average UN predicted TFR for years 2005-2010 and subtracted average TFR for years 2000-2005. The obtained value was discretized into two classes:
Or three classes:
· decreasing (ΔTFR<0.5),
· stable (-0.5<ΔTFR<0.5), and
· increasing (ΔTFR>0.5).
Modifications of Learning Examples
Learning examples consisted of 147 countries, each represented by a row in the learning matrix. Modifications were performed as eliminating or choosing specific rows to form a new learning matrix. A typical example would be a subgroup “developed countries”, consisting only of countries with high gross domestic product (GDP) or Failed States Index (FSI). GDP is defined as the total market value of all final goods and services produced within a given country or region in a given period of time (usually a calendar year). FSI on the other hand consists of several attributes, describing the strength of central government, provision of public services, level of corruption and criminality; percentage of refugees and involuntary movement of populations, and an amount of economic decline. Since 2005, the index has been published annually by the United States think-tank, the Fund for Peace and the magazine Foreign Policy. GDP review extracted two groups of countries:
· well developed countries with GDP above 1000$ per habitant (39 countries), and
· developing countries with GDP less than 1000$ per habitant (108 countries).
Examination of FSI revealed three groups of countries:
· developed with FSI lower then 39.45 (29 countries),
· moderately developed with FSI 39.45-61.4 (21 countries), and
· developing with FSI>61.4 (97 countries).
Machine learning and lately data mining are among the most successful artificial intelligent application areas. Whenever there are lots of learning examples, these systems learn properties of the domain and make predictions about future cases. These systems not only compete with statistical methods in terms of accuracy, they also introduce several new approaches such as cooperation between systems and humans. The constructed knowledge is often in the form of readable, understandable trees, rules and other representations thus enabling further study and fine tuning. Two examples of successful scientific and engineering DM tools are Weka and Orange. Both systems provide tens of DM systems, several data preprocessing and visualization tools. From the ML and DM techniques available in Weka and Orange J48 was chosen, the implementation of C4.5, a method used for induction of classification trees. This method is most commonly used when the emphasis is on transparency of the constructed knowledge. In our case this was indeed so, since the task was to extract most meaningful relation from hundreds of constructed trees.
Most meaningful relations are those most significant to humans with best classification accuracy at the same time. To estimate the accuracy of the trees, 10-fold cross-validation, built in the system, was used. The estimated accuracy of a classification tree corresponds to a probability that a new example will be correctly classified.
A short description of decision trees is presented in this paragraph for readers not familiar with classification trees. Classification trees are built in a top-down manner. The first task is to choose the most informative attribute which will be placed at the root of the classification tree. The next step is to add branches according to the values of the attribute. For a discrete attribute, there are as many branches as there are different values. In case of a numeric attribute, there are only two branches, one that represents values less or equal than the border value as proposed by the system, and the other branch with greater values. The set of examples is divided into subsets corresponding to the branches. Now the process can be repeated recursively for each branch, using only those instances at each particular branch. If at any time all instances at a node have the same classification, further branching is stopped and the classification into that class is proclaimed. The splitting process is usually stopped as soon as sufficient statistical significance is obtained, classifying into the majority class. Classification is performed by starting at the top of the tree and choosing appropriate attribute values to proceed with the chosen branch. At the leaf, the numbers represent all examples and those with different class.
Experiments were performed with various method parameters, mainly changing levels of pruning. However, it turned out that default parameters were most successful.
Tens of trees were created in a systematic way, as presented in Figure 2. First experiments were performed with TFR and ΔTFR, then with all and only developed countries. Finally, several selections of the attributes were tested: all, economical, direct, social, economical, and educational. These tests resulted in 24 basic trees. In addition, various further experiments were performed.
Due to lack of space, only most interesting trees are presented in this paper, those with most meaningful relations to humans and with best classification accuracy at the same time.
Firstly, the analysis was based on TFR as a class, with 2, 3 or more values. Only experiments with 2 or 3 values were interesting enough to be presented in this paper.
In the first fertility rate analysis all 147 countries and all 95 available attributes were taken into consideration. The obtained tree is presented in Figure 3, showing that the most important indicator for high TFR is the number of stillborn children per 1000 births. More than 11.5 stillborn children per 1000 births is a strong supporting factor in favor of high TFR of the country and vice versa. The results are consistent with practically all literature in the demographic field and experts’ opinions, who claim that death of newborns is in tight connection with social and economical status of mothers who need to have several children to compensate for those dead. According to experts, higher educated mothers usually have less children and lower newborn mortality, low percentage of stillborn is supposed to be related to the costs of child life-support , different life condition of the urbanized and industrialized society, changes of the attitude towards women, decaying of old patriarchal community etc. as the main reasons for fertility decline. As the tree in Figure 3 shows, these relations are indeed statistically most relevant. However, the tree shows additional relations in a structured way with appropriately weighted leaves, i.e. nodes at the bottom of the tree. For example, the top right leaf “high (104/16)” includes 88 countries with high TFR and 16 with low TFR. The bottom left leaf, on the other hand, encapsulates only 2 countries with high TFR, rendering this information as statistically less important. Therefore, in the tree there is just another statistically strongly confirmed relation: when number of dead born children is less than 11.5 and majority religion is Christianity and there are fewer men than women then TFR is low (35/1). This relation shows another crucial matter regarding interpretations of the tree. Why should Christian majority be negative for fertility rate while Christians give high emphasis on families, strong marriages and devotion to children? Indeed, further analysis show, as pointed out by demographic experts long time ago that population in these countries have high divorce rates etc. meaning that people do not follow church directions, but live according to their own desires. The bottom right part of the tree, starting with low percentage of women in the population is statistically rather meaningless, however, density and number of inhabitants gives some indication that these are among relevant attributes. Therefore, reading and interpreting trees demands some understanding of statistics, trees and demographic literature.
At each Figure title, there is cross-validation accuracy estimate. For Figure 3 it is over 80%, which is a reasonably good result. Default accuracy obtained by classifying only into the majority class is 89, 4%.
In another attempt the class values was divided into three groups:
· moderate and
· high TFR rate
The experiment once again revealed the most important attribute: “number of stillborn children”. However, the branching point leading to high TFR is in this case much higher: 53.56 children per 1000 births. In this tree, there are three major groups all from 30 to 40 countries: high, moderate and low. The major attribute distinguishing between moderate and low TFR countries is the length of the maternity leave. At this point one should be aware that such attributes are semantically potentially misleading - countries with low TFR probably introduced lengthier maternity leave as a consequence and not as cause. The tree therefore shows most important relations without knowing the nature of them.
After obtaining the first tree, in a series of tests seemingly most important attributes are being eliminated in order to test if other attributes can replace them and still obtain similar accuracy. Instead of “number of stillborn children” several attributes can be used: human development index (HDI), life expectancy rate, literacy rate, etc. all denoting the same concept. It is generally accepted that in these, developing countries, TFR is high.
For the maternity leave, the elimination of the attribute results in lower accuracy 68%. Although this attribute is obviously important, we are not able to establish the type of relation. Whatever the case, countries with short maternity leave have moderate TFR, and those with long maternity leave low TFR.
Although the rest of the relations are not so significant, they represent a bigger share than in the previous tree and they seem to have two common denominators: developmental status and value system.
Altogether, analysis so far indicate that the developed countries have low TFR, e.g. most of the European and north American countries, developing countries have high TFR, and moderately developed countries like Botswana, Bolivia, Honduras, Jamaica, etc. have moderate TFR.
We further filtered attributes according to the algorithms in DM tools. Again, as seen in Figure 5, the most distinctive attribute regarding TFR rate appears to be the number of stillborn children per 1000 births. When this number is lower or equal to 11.55, the TFR is low (under 2), with the exception of the countries that do not ensure appropriate delivery treatment and invest most of its educational foundation in a primary sector.
On the other hand, TFR is low despite high number of stillborn children in the case when the human development index (consists of life expectancy rate, literacy rate, educational rate and standard of living) is high, abortion is allowed and unemployment rate is low (under 13.9 %), or if abortion is not allowed, but the country invests most of its educational foundation in a primary sector and has long maternity leave (more than 11 weeks). The discovered relations indicate a meta attribute - developmental status of the country.
The demographic experts classify fertility attributes, i.e. factors, on direct and indirect. Direct factors have direct influence on fertile persons. In this context a decision tree including 4 attributes was built:
· legality of demanded abortion,
· number of abortions per 1000 people,
· percent of married women (between 15 and 49 years old) that use contraception and
· percent of elders infected with HIV virus or AIDS.
The obtained 82, 31% accurate tree is presented in Figure 6. Legal abortion associated with low percent of HIV infected elderly relates to low TFR while illegal abortion and lower percent (less than 70) of women using contraception leads to high TFR. These attributes again seems to correlate to the meta attribute - developmental state of the country and to the value system. The other derivation could be that the value system plays an important role. The accuracy is very high indicating that these attributes are meaningful.
Since many experts in the field agree that only direct factors cannot explain the fertility rate determination, we further examined influence of the indirect TFR factors. 11 attributes were analyzed that express the society attitude towards general life questions: legality of homosexuality, legality of homosexual marriages, possibility of adoptions to homosexuals, number of suicides per 10000 persons (men only, women only, altogether), legality of abortion, number of abortions per 1000 people, number of divorces per 1000 persons, percent of women in the parliament.
Experts generally find low TFR strongly related to the economical factors, society modernization and liberalization. The nature of economic relations was established by extracting 13 economical attributes that refer to the field of unemployment, GDP, public health and social protection expenditure, number of working hours per week and inflation rate.
The tree indicates that high GDP, low unemployment rate and high inflation GDP deflator relate to low fertility rate, while low GDP per capita usually relates to a high TFR.
As David Heer said, economical progress should positively influence fertility rate. Overall statistics significantly disconfirm the hypothesis at least in the modern world where food is not scarce. Our analysis indicates that direct economical attributes are not very relevant for fertility on their own, at least not as other groups of attributes. For example in figure 8 in some cases high GDP per capita leads to high and in others to low TFR. Becker (1981) presents a plausible explanation of such GDP-TFR relations. He claims that TFR depends on the disposable expenses and expected usefulness of the children. To uphold the thesis he gives an example of the rural family that used to have more children in order to assure help for maintaining the family. Human resources were urgent for working on the fields, in the woods, etc. Nowadays, agriculture has become more and more automated, thus reducing the need for human forces. Consequently, the cost benefit of the children dropped drastically and families began to shrink. Besides, factors like higher educational level, lower child mortality rate, and the desire for career making among young people, pushes TFR even lower. This linkage between income and fertility is typical for developed countries, where despite constant income growth, TFR is continually decreasing, whereas in developing countries, low income does not influence fertility rate.
Figure 8: TFR classification tree with two-valued class, considering only economical attributes (78, 2%).
In any case, the tree from Figure 8 is only 78.23 % accurate, which is low in comparison with trees based on other attributes. This indicates that direct economical factors are not the main cause for the distinction among countries with low and countries with high fertility rate.
Analyzing the relation between educational factors and TFR resulted in the tree presented in Figure 9. High percentage of enrolment in primary educational level is in general related with high fertility rate, whereas low TFR is more related to enrolment in secondary or tertiary educational factors. As observed by experts before, high education, especially of women, decreases TFR.
While developing countries have problems with too high TFR, developed countries, especially in Europe, have problems with low TFR. Mark Steyn, a conservative polemicist, argues that Europe is quickly becoming a barren, ageing, enfeebled place. In the decades after the Second World War, rich countries everywhere experienced similar trends. The bonds of traditional family life began to slacken, more women got jobs, and people sought enjoyment and satisfaction more and more through individual pursuits rather than in families. This social transformation, which is occurring also in America and East Asia, led to a demographic bonus (a bulge of people working) and to what might be called “the postponement of everything”. People left school later, left home later, married later, had children later, they also died later . Even though these interpretations are not uniformly accepted, they seem to be statistically quite well grounded.
Figure 9: TFR classification tree with two-valued class, considering only educational attributes (78.2%).
Having that in mind, the relevant question is: Why do some rich countries still have high TFR? In the following experiments we denoted 39 countries with high GDP as rich.
The tree in Figure 10 indicates that exceptions to the low fertility rate have poor education and social system. Further analyses showed that these countries rely on natural resources such as oil.
Figure 10: TFR classification tree with two-valued class and automatically selected attributes (78.2%).
Analyses of the obtained tree presented in Figure 11 revealed that countries with oil are rich and have Muslim religion. But the relation can be interpreted originally as follows: when Islam is the prevailing religion of the country, then TFR is most likely to be high, while otherwise, TFR decline is the more likely option. Results are consistent with the previously observed relations that TFR is higher in more conservative countries, which Islam countries certainly are.
Figure 11. TFR classification tree with two-valued class considering only social attributes (89.7%).
The newest studies of Worldwatch Institute conclude that there is so much variability in fertility rates that we cannot know with any confidence how many people the future holds. Indeed, it seems reasonable that ΔTFR analyses are a bit less relevant as those with TFR, since they measure the amount of change and not the obtained situation. Even though, our next attempt was to established factors that might influence TFR growth and decline. In the next section a few of the most interesting and accurate trees are presented.
Again, literacy seems to be an important indicator of TFR trends (see Figures 12 and 13). Countries with low percent of literate habitants generally have increase in TFR. Countries with high percent of literate citizens (above 97.9%) and low unemployment rate (below 9.6%) on the other hand have decreasing TFR trend.
Figure 12: ΔTFR classification tree with three-valued class (81,1%).
Figure 13: Unprunned ΔTFR classification tree with three-valued class indicator (83,2%).
Similar conclusions can be drawn from the tree on Figure 14, when attributes were automatically selected. This tree has surprising high accuracy.
Figure 14: ΔTFR classification tree with three-valued class, automatically selected attributes (85,3%).
Considering only social attributes, the same tree as in the case of TFR class appeared (see Figure 7), again exposing the importance of conservative politics of the country for the TFR growth trend. Countries that don’t allow abortion and adoption to homosexuals have TFR growth trend, whereas countries that allow abortion and homosexuality have TFR decline trend. Accuracy in this case is 78.32%, much lower than in the tree presented in Figure 14.
In this case criteria for dividing countries by their developmental status was FSI. A country was classified as well developed if FSI index was less than 39.45, resulting in 27 countries. Analyses were performed on the attributes separately merged in smaller groups.
Figure 15: ΔTFR classification tree with two-valued class (accuracy is 84.6 %).
Race appeared to be an important factor of TFR trend (see Figure 15). In nations with prevalent Asian and combined race, TFR is likely to increase, while in countries with a majority of white race, TFR is declining. The nature of this genetic relation is not cleat at this point.
Figure 16: ΔTFR classification tree with two-valued class, considering only economical attributes (84.6 %).
We can see that highly economical developed countries with more than 10100 GDP per capita ($) have TFR decline. This thesis is for example not in agreement with the Worldwatch Institute study noting that fertility rate is rising in the United States. However, this study is violating the age-old dictum that rich countries do not make lots of babies as well. The tree based on economical attributes is this time quite accurate. Therefore, ΔTFR analyses gave more statistical relevance on economical attributes than analyses with TFR.
When selecting only social attributes, the accuracy of ΔTFR classification trees dropped drastically (on 76.9%) what means that these factors are not good indicators for TFR trends.
Conclusion and discussion
In fertility analyses, the data mining tools again proved their major asset: the constructed knowledge is in a transparent form, enabling human comprehension of relevant relations in complex forms. In this way, an interactive and interaction process is enabled between computers and humans, exploiting best properties of the two most advanced information machines. Computers fast examine vast search spaces with their advanced speed and accuracy while humans make conclusions and guide search with the advanced cognitive skills.
To readdress the problem, let us restate that the space of all potential hypotheses for 100 binary attributes and a single binary class is 22^100. This number is far larger than the number of all atoms in our universe, which is according to Wikipedia around 1080, i.e. 2266. Therefore, there is no way humans can analyze any meaningful share of all the hypotheses. But we can examine results of one search, make conclusions and redo the search changing specific details of the search. In this way humans can “mine” for relevant hypothesis.
Regarding the fertility relations, the DM tools enabled rediscovery of major properties. The authors are not experts in the fertility or demographic field, therefore verification of our conclusions by an expert and further analyses of interesting new patterns are a matter of further research