The following code was developed as part of my Master’s dissertation on data mining.
I would reiterate the fact that when it comes to parsing, extraction, finding patterns in my opinion Python takes the crown. This program uses python 2.7 because python 3 i found it to be incompatible with some of the libraries of natural language text processing.
As you can see above, writing code for mining data in python is so much easier as compared to Java. Perhaps, that could be the reason that why the industry stalwarts like Google are extensively using python as an interface for designing their robust search engine.
# This program read an excel file and extracts the data from it and encodes it in XML format import xlrd path='c:\python27\CCAgent_data.xls'# Open the XLS file as a "workbook" and select the first sheet as our source. # We'll get our tag names from the sheet's first row contents. workbook = xlrd.open_workbook(path)#sheet = workbook.sheet_by_index(0) sheet=workbook.sheet_by_name('agents') tags = [n.replace(" ", "").lower() for n in sheet.row_values(0)]# This is going to come out as a string, which will write to a file in the end. result = "\n\n"# Now, we'll just create a string that looks like an XML node for each row # in the sheet. Of course, a lot of things will depend on the prescribed XML # format but since we have no idea what it is, we'll just do this: for row in range(1, sheet.nrows): result += " \n" for i in range(len(tags)): tag = tags[i].encode("utf-8") val = sheet.row_values(row)[i].encode("utf-8") result += " " % (tag, val, tag) print val print '\n' result += "\n" print "Total number of records=",row # Close our pseudo-XML string. result += ""# Write the result string to a file using the standard I/O. storeResult=open("c:\python27\myfile.xml", "w") storeResult.write(result) storeResult.close()