Part of Speech Tags Visualization

Part of Speech Tags Visualization

Big data is very known words these days. It’s a broad term for data sets so large or complex that traditional data processing applications are adequate.

The most important benefits of visualization is,

Allows us to visual access to the huge amount of data in an easily digestible visual.

Recently I was in a situation to train a large set of tagged ‘corpus’ for a retail domain. Building a corpus from scratch is not that very easy task unless if you didn’t find an easy way to expand the data set. In the journey of training a model or to get a better understanding of the data the data need to be visualized.

Sample corpus data set:

car_JJ seats_NNS ._.
croft_NN barrow_NN women_JJ 's_POS ._.
frozen_JJ dress_NN ._.
cold_JJ shoulder_NN ._.
vera_NN ._.
beard_JJ trimmer_NN ._.
paper_JJ towel_JJ holders_NNS ._.
sonoma_JJ cargo_JJ pants_NNS ._.
kettlebell_NN ._.

My objective was to map the features including with the tagged label/class name. Therefore, it will help us to clearly understand the list of features and the constraints of the features in respective to Information Entropy theory.

There are so many graphs and each of them answers different questions. I chose network graph to represent the data collection because it’s the suitable graph for this situation.

The first challenging task is to process the list of corpus to a graph format data. Since my objective was to find the frequency of the features and to show the corresponding tags for the feature we can easily implement MapReduce Job. To implement it fast, I used Python MapReduce. I will create a separate blog to explain to implement Python MapReduce with the help of Hadoop Streaming. For the moment I’ll share the Mapper and Reducer code.

__author__ = 'renienj'
#!/usr/bin/env python
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
# Add default count to each word
def pos_tag_count(data):
for words in data:
for word in words:
(value, pos) = word.split('_')
print '%s%s%d' % (pos, '\t', 1)
# Add default count to each pattern
def pos_tag_pattern_count(data):
for words in data:
pattern = []
for word in words:
(value, pos) = word.split('_')
pattern.append(pos)
print '%s%s%d' % (" ".join(pattern), '\t', 1)
# words and words pattern frequency
def word_count(data):
for words in data:
for word in words:
(value, pos) = word.split('_')
print '%s%s%d' % (value, '\t', 1)
# Add default count to each pattern
def word_pattern_count(data):
for words in data:
pattern = []
for word in words:
(value, pos) = word.split('_')
pattern.append(value)
print '%s%s%d' % (" ".join(pattern), '\t', 1)
def main(argv):
data = read_input(sys.stdin)
if argv[1] == "tag-pattern":
pos_tag_pattern_count(data)
elif argv[1] == "tag":
pos_tag_count(data)
elif argv[1] == "word-pattern":
word_pattern_count(data)
elif argv[1] == "word":
word_count(data)
if __name__ == "__main__":
main(sys.argv)
view raw mapper.py hosted with ❤ by GitHub
__author__ = 'renienj'
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
view raw reducer.py hosted with ❤ by GitHub

To execute hadoop job in terminal type the following command:

cat sample-corpus.txt| python featureFrequency/mapper.py tag-pattern | sort -k1,1 | python featureFrequency/reducer.py | tee results.txt

Processed corpus data set:

DT NN .	1
DT NN VBG NNS JJ NN NN CD .	1
FW .	1
IN .	8
JJ .	11
JJ : NN .	1
JJ : NNS .	1
JJ CC JJ NNS .	1
JJ CD IN JJ JJ JJ JJ JJ JJ NN .	1
JJ CD JJ JJ NN NN .	1
JJ CD JJ NN .	1
JJ CD NN .	1

After evaluating few graph libs I picked ‘vis.js’. vis.js library’s network graph is very easy to visualize and it’s customizable (Figure 1).

<!doctype html>
<html>
<head>
<title>Network | Basic usage</title>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.0.0-beta1/jquery.min.js"></script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.15.0/vis.js"></script>
<script type="text/javascript" src="pos-pattern.js"></script>
<link href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.15.0/vis.css" rel="stylesheet" type="text/css" />
<style type="text/css">
#mynetwork {
width: 100%;
height: 900px;
border: 1px solid lightgray;
}
</style>
</head>
<body>
<div>
Select the logfile.txt text file:
<input type="file" id="fileInput" value="/retail_postagger/logfile.txt">
</div>
<p>
Networkgraph with weightage | Part of Speech Tagging
</p>
<div id="mynetwork"></div>
</body>
</html>
view raw index.html hosted with ❤ by GitHub
$(document).ready(function() {
readFile();
});
/*
This function will help to edges lists using
the nodes and the connection link.
*/
var createNodeEdgeDataSet = function(edgeList, nodesList){
//Edges for vis.js
edgesDataVis = [];
for(var key in edgeList){
var wFreq = edgeList[key];
wFreq.nGram.forEach(function(element, indexInner){
//Skip if its single node
if(wFreq.nGram.length > indexInner+1){
if(wFreq.nGram.length > 1){
edgesDataVis.push({
from: nodesList.indexOf(element),
to: nodesList.indexOf(wFreq.nGram[indexInner+1])
});
}
}
});
}
return edgesDataVis;
}
/*
This function will allows to create the node dataset
for vis.js
*/
var createNodeDataSet = function(nodesList){
nodesDataVis = [];
nodesList[1].forEach(function(element, index){
var customLabel = "";
if(element in nodesList[0]) { //Customize the label with frequency
customLabel = element + ":" + nodesList[0][element].frequency;
}
else {
customLabel = element;
}
nodesDataVis.push({id: index, label: customLabel});
});
return nodesDataVis;
}
/*
To create n-grams from 0-length. This n-grams helps
to generate the network between each nodes
*/
var nGram = function(data) {
tokens = data.split(" ")
posPattern = []
for (var i = 0; i < tokens.length; i++) {
posPattern.push(tokens.slice(0,i+1).join(' '));
}
return posPattern;
}
/*
This fnction will allows the raw data to convert
in to structed data collection.
This will return two values:
1) WordFrequency object list (networkConnectionList)
2) Unique node list
*/
var createNetWorkData = function(data){
//Get unique list
Array.prototype.getUnique = function(){
var u = {}, a = [];
for(var i = 0, l = this.length; i < l; ++i){
if(u.hasOwnProperty(this[i])) {
continue;
}
a.push(this[i]);
u[this[i]] = 1;
}
return a;
}
//Varibles that holds the last return values
networkConnectionList = {};
nodesList = [];
dataList = data.split(/\n/);
for(var i=0; i < dataList.length; i++){
patternFrequency = dataList[i].split(/\t/);
patternFrequency[0] = patternFrequency[0].replace(" .", "");
var nodes = nGram(patternFrequency[0]);
nodesList = nodesList.concat(nodes);
networkConnectionList[patternFrequency[0]] = new WordFrequency(
patternFrequency[0],
patternFrequency[1],nodes);
}
return [networkConnectionList, nodesList.getUnique()];
}
//Read the local file
var readFile = function(){
var fileInput = document.getElementById('fileInput');
var fileDisplayArea = document.getElementById('fileDisplayArea');
fileInput.addEventListener('change', function(e) {
var file = fileInput.files[0];
var textType = /text.*/;
if (file.type.match(textType)) {
var reader = new FileReader();
//In file load event process the data and draw graph
reader.onload = function(e) {
//fileDisplayArea.innerText = reader.result;
var networkList = createNetWorkData(reader.result);
var nodes = createNodeDataSet(networkList);
var edges = createNodeEdgeDataSet(networkList[0], networkList[1]);
drawGraph(nodes, edges)
}
reader.readAsText(file);
} else {
fileDisplayArea.innerText = "File not supported!"
}
});
}
//Vis.js draw function
var drawGraph = function(nodesDataVis, edgesDataVis){
// create a network
var container = document.getElementById('mynetwork');
var data = {
nodes: nodesDataVis,
edges: edgesDataVis
};
var options = {
edges: {
smooth: true,
arrows: {to : true }
}
};
var network = new vis.Network(container, data, options);
}
//WordFrequency class
function WordFrequency(pattern, frq, ngram) {
this.exactPattern = pattern;
this.frequency = frq;
this.nGram = ngram;
}
view raw pos-pattern.js hosted with ❤ by GitHub
image
Figure 1 - Tags are represented using network graph

I hope you all got an idea of the process to visualize the data. I have taken POS Tag data to explain the flow and it can be changed to any data collections.