Zusammenhang Anzahl Splits zu Topic Zuweisung

Die abgegebenen NDC’s variieren stark in der Länge. Durch das durchgeführte Aufteilen der Dokumente in einzelne Abschnitte gibt es nun einige Länder, welche eine große Anzahl an Abschnitten aufweisen und andere Länder die lediglich durch einen einzelnen Abschnitt repräsentiert werden. Es gilt nun zu untersuchen ob die Anzahl der Splits maßgeblich für die Zuweisung zu einem Topic ist. Hierfür wird ein Ergebnis einer Topic Model Analyse aus dem iLCM benötigt und im Ordner abgelegt.

path<-"77_NDC PP nur EU_2020-03-31 13:46:38/"

load(paste0(path,"data_TM.RData"))
load(paste0(path,"meta_TM.RData"))

theta<-theta[,order(as.numeric(colnames(theta)))]
data<-cbind(meta,theta)

Anzahl an Splits

Wenn statt den 184 Ländern nur 157 angezeigt werden, liegt es daran, dass in diesem Topic Model nur die EU als einzelne Partei mit in die Berechnung einbezogen wurde. Die Beiträge der Einzelländer fehlen.

splits<-data.frame(table(data$mde1),stringsAsFactors = F)
splits[order(splits$Freq),]

##                                      Var1 Freq
## 7                                 Armenia    1
## 72                            Kazakhstan     1
## 105                                  Oman    1
## 123                            Seychelles    1
## 136                            Tajikistan    1
## 148                               Ukraine    1
## 44                     Dominican Republic    2
## 88                                 Mexico    2
## 92                             Montenegro    2
## 122                                Serbia    2
## 127                               Somalia    2
## 150              United States of America    2
## 152                           Uzbekistan     2
## 157                              Zimbabwe    2
## 4                                 Andorra    3
## 6                               Argentina    3
## 9                              Azerbaijan    3
## 19                 Bosnia and Herzegovina    3
## 35                           Cook Islands    3
## 39                                Czechia    3
## 40  Democratic People's Republic of Korea    3
## 50                               Eswatini    3
## 52                         European Union    3
## 64                               Honduras    3
## 78                                Liberia    3
## 82                               Malaysia    3
## 83                               Maldives    3
## 91                               Mongolia    3
## 94                            Mozambique     3
## 98                                  Nepal    3
## 99                            New Zealand    3
## 104                                Norway    3
## 107                                 Palau    3
## 110                              Paraguay    3
## 114                  Republic of Moldova     3
## 115                               Rwanda     3
## 120                 Sao Tome and Principe    3
## 125                             Singapore    3
## 132                                 Sudan    3
## 133                              Suriname    3
## 137                             Thailand     3
## 145                          Turkmenistan    3
## 146                                Tuvalu    3
## 154                             Venezuela    3
## 155                              Viet Nam    3
## 1                             Afghanistan    4
## 8                               Australia    4
## 14                                Belarus    4
## 17                                 Bhutan    4
## 23                                Burundi    4
## 31                                  China    4
## 34                                  Congo    4
## 41       Democratic Republic of the Congo    4
## 43                               Dominica    4
## 51                               Ethiopia    4
## 56                                Georgia    4
## 58                                Grenada    4
## 61                         Guinea Bissau     4
## 65                                Iceland    4
## 70                                  Japan    4
## 74                              Kiribati     4
## 80                            Madagascar     4
## 93                                Morocco    4
## 109                      Papua New Guinea    4
## 113                     Republic of Korea    4
## 116                 Saint Kitts and Nevis    4
## 117                           Saint Lucia    4
## 119                            San Marino    4
## 138       The Republic of North Macedonia    4
## 143                   Trinidad and Tobago    4
## 144                              Tunisia     4
## 2                                 Albania    5
## 11                                Bahrain    5
## 24                            Cabo Verde     5
## 26                               Cameroon    5
## 28               Central African Republic    5
## 37                          Côte d'Ivoire    5
## 38                                   Cuba    5
## 42                               Djibouti    5
## 45                                Ecuador    5
## 47                            El Salvador    5
## 53                                   Fiji    5
## 62                                 Guyana    5
## 69                                Jamaica    5
## 71                                 Jordan    5
## 73                                  Kenya    5
## 87                              Mauritius    5
## 97                                  Nauru    5
## 111                                  Peru    5
## 118                                 Samoa    5
## 156                               Zambia     5
## 3                                 Algeria    6
## 12                             Bangladesh    6
## 13                               Barbados    6
## 18       Bolivia (Plurinational State of)    6
## 20                              Botswana     6
## 29                                   Chad    6
## 32                               Colombia    6
## 46                                  Egypt    6
## 59                              Guatemala    6
## 76      Lao People's Democratic Republic     6
## 89                             Micronesia    6
## 90                                 Monaco    6
## 95                                Myanmar    6
## 131                    State of Palestine    6
## 149                  United Arab Emirates    6
## 151                               Uruguay    6
## 153                              Vanuatu     6
## 25                              Cambodia     7
## 27                                 Canada    7
## 30                                  Chile    7
## 48                     Equatorial Guinea     7
## 60                                Guinea     7
## 68                                Israel     7
## 75                                 Kuwait    7
## 79                          Liechtenstein    7
## 81                                Malawi     7
## 86                             Mauritania    7
## 101                                 Niger    7
## 106                             Pakistan     7
## 130         St Vincent and the Grenadines    7
## 135                Syrian Arabic Republic    7
## 139       The United Republic of Tanzania    7
## 5                     Antigua and Barbuda    8
## 10                                Bahamas    8
## 33                                Comoros    8
## 55                                 Gambia    8
## 57                                  Ghana    8
## 66                                  India    8
## 84                                   Mali    8
## 96                               Namibia     8
## 100                            Nicaragua     8
## 112                                 Qatar    8
## 124                          Sierra Leone    8
## 128                          South Africa    8
## 129                             Sri Lanka    8
## 140                           Timor-Leste    8
## 141                                  Togo    8
## 147                               Uganda     8
## 15                                 Belize    9
## 21                                 Brazil    9
## 36                             Costa Rica    9
## 49                                Eritrea    9
## 77                                Lesotho    9
## 102                               Nigeria    9
## 108                                Panama    9
## 121                          Saudi Arabia    9
## 126                       Solomon Islands    9
## 134                           Switzerland    9
## 16                                  Benin   10
## 22                           Burkina Faso   10
## 54                                  Gabon   10
## 63                                  Haiti   10
## 103                                  Niue   10
## 67                              Indonesia   11
## 85                       Marshall Islands   14
## 142                                 Tonga   14

splits<-splits[order(splits$Freq),]
labels<-as.character(paste(splits$Freq,splits$Var1,sep=""))

Farben festlegen

color = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)]
colors<-sample(x = color,size = ncol(theta))

Topiczugehörigkeit

Jeder Abschnitt weist eine Wahrscheinlichkeitsverteilung über alle 15 Topcis auf.

indiv_topic_data<-theta[1,]


data_pie<-data.frame(prob=indiv_topic_data, names=names(indiv_topic_data))
p<-plotly::plot_ly(data = data_pie,values=~prob, labels=~names,type="pie",textposition = 'inside',
                   textinfo = 'label+percent',
                   insidetextfont = list(color = '#FFFFFF'),
                   hoverinfo = 'text',
                   marker = list(colors = colors,
                                 line = list(color = '#FFFFFF', width = 1)),
                   #The 'pull' attribute can also be used to create space between the sectors
                   showlegend = TRUE)
p

1. Korrelation der Länder mit je nur einem Abschnitt

single_split_countries<- as.character(splits[which(splits$Freq<2),"Var1"])

data_single_split_countries<-theta[which(data$mde1%in%single_split_countries),]

# Korrelations Heatmap
correlations<-matrix(c(0),length(single_split_countries),length(single_split_countries))
colnames(correlations)<-single_split_countries
rownames(correlations)<-single_split_countries
for(i in 1:length(single_split_countries)){
  for(j in 1:length(single_split_countries)){
    correlations[i,j]<-cor(x = data_single_split_countries[i,],y = data_single_split_countries[j,],method = "pearson")
  }
}
# correlation mit sich selbst auf 0 setzen
diag(correlations)<-0

heatmap<-plotly::plot_ly(z=correlations,type="heatmap",x=single_split_countries,y=single_split_countries)
heatmap

subplots<-plotly::plot_ly()
row<-0
column<-0
for(k in 1:length(single_split_countries)){
  print(paste("row:",row,"  column:",column))
  data_pie<-data.frame(prob=data_single_split_countries[k,], names=colnames(theta))
  subplots<-plotly::add_pie(subplots,data = data_pie,values=~prob, labels=~names,
                            name=single_split_countries[k],
                            textposition="inside",
                            marker = list(colors = colors,
                                          line = list(color = '#FFFFFF', width = 1)),
                            #The 'pull' attribute can also be used to create space between the sectors
                            domain = list(row = row, column = column))
  if(k%%2==0){
    row=row+1
  }
  column<-(k%%2)
}

## [1] "row: 0   column: 0"
## [1] "row: 0   column: 1"
## [1] "row: 1   column: 0"
## [1] "row: 1   column: 1"
## [1] "row: 2   column: 0"
## [1] "row: 2   column: 1"

library(magrittr)
subplots <- plotly::layout(subplots,title = "Pie Charts for Countries with only 1 Split", showlegend = F,
                                grid=list(rows=row, columns=2),
                                xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                                yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
subplots

2. Korrelation aller Länder

Dazu werden für Länder mit mehr als einem Abschnitt die einzelnen Wahrscheinlichkeitsverteilungen aggregiert. Zum Aggregieren wird der gewichtete Mittelwert auf Basis der Tokens pro Abschnitt genutzt. Die Anordnung der Elemente basiert auf der Anzahl an Splits pro Land.

# Länder mit mehr als einem split aggregieren
sort_countries<-as.character(splits[order(splits$Freq),"Var1"])

data_all_splits<-matrix(c(0),length(sort_countries),ncol(theta))
for(l in 1:length(sort_countries)){
  tokens<-data$token[which(data$mde1==sort_countries[l])]
  #normalize
  tokens<-tokens/max(tokens)
  token_weights<-tokens/sum(tokens)
  lokal_theta<-theta[which(data$mde1==sort_countries[l]),]
  # add weights
  lokal_theta<-lokal_theta*token_weights
  if(!is.null(nrow(lokal_theta))){
    lokal_theta<-colSums(lokal_theta)
  }
  lokal_theta<-lokal_theta/sum(lokal_theta)
  data_all_splits[l,]<-lokal_theta
}
rownames(data_all_splits)<-sort_countries

# Korrelations Heatmap
correlations_all<-matrix(c(0),length(sort_countries),length(sort_countries))
colnames(correlations_all)<-labels
rownames(correlations_all)<-labels
for(i in 1:length(sort_countries)){
  for(j in 1:length(sort_countries)){
    correlations_all[i,j]<-cor(x = data_all_splits[i,],y = data_all_splits[j,],method = "pearson")
  }
}
# correlation mit sich selbst auf 0 setzen
diag(correlations_all)<-0


heatmap_all<-plotly::plot_ly(z=correlations_all,type="heatmap",x=sort_countries,y=sort_countries, colors= colorRamp(c("red", "green")))
heatmap_all

Cosinus Ähnlichkeit aller Länder

# Cosinus similarity Heatmap
correlations_all_cosine<-matrix(c(0),length(sort_countries),length(sort_countries))
colnames(correlations_all_cosine)<-labels
rownames(correlations_all_cosine)<-labels
for(i in 1:length(sort_countries)){
  for(j in 1:length(sort_countries)){
    correlations_all_cosine[i,j]<-lsa::cosine(x = data_all_splits[i,],y = data_all_splits[j,])
  }
}
# correlation mit sich selbst auf 0 setzen
diag(correlations_all_cosine)<-0


heatmap_all<-plotly::plot_ly(z=correlations_all_cosine,type="heatmap",x=sort_countries,y=sort_countries, colors= colorRamp(c("red", "green")))
heatmap_all

Euklidischer Abstand aller Länder

# euklidischer Abstand Heatmap
euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))

correlations_all_euclidean<-matrix(c(0),length(sort_countries),length(sort_countries))
colnames(correlations_all_euclidean)<-labels
rownames(correlations_all_euclidean)<-labels
for(i in 1:length(sort_countries)){
  for(j in 1:length(sort_countries)){
    correlations_all_euclidean[i,j]<-euc.dist(x1 = data_all_splits[i,],x2 = data_all_splits[j,])
  }
}
# correlation mit sich selbst auf 0 setzen
diag(correlations_all_euclidean)<-0


heatmap_all<-plotly::plot_ly(z=correlations_all_euclidean,type="heatmap",x=sort_countries,y=sort_countries, colors= colorRamp(c("red", "green")))
heatmap_all

Kein Muster dafür zu erkennen, dass die Anzahl der Splits ursächlich für die Topiczuweisung ist.