Load Data from Excel Format into Database
August 24, 2009
Task: Take as input name of a directory on a Windows machine containing Microsoft Excel (xls) documents. Convert them to csv files. Process these csv files to normalize them into a single, standard csv format. Load these files into a PostgreSQL database. Tools available: Windows, Linux (Debian), Python, and PostgreSQL. Result: success (as described below).
Mount Windows Share
It is very easy to mount a Windows shared folder under Debian. Hat tip to Ubuntu Linux Quick Tip – Mount a Samba (Windows) file share to a folder and How to mount remote windows partition (windows share) under Linux. First, install smbfs.
sudo aptitude install smbfs
Create a directory where you want to mount it. Initially, I had chosen /mnt/ but reading up on Linux File System Hierarchy, the proper place is /media instead of /mnt. So I have modified the code.
sudo mkdir /mnt/ntserver
sudo mkdir /media/ntserver
Actually mount the share using the following command. It’s a single command but the formatting may make it appear as if it’s on two lines.
sudo mount -t smbfs //192.168.0.111/shared/ /mnt/ntserver -o username=user,password=password
sudo mount -t smbfs //192.168.0.111/shared/ /media/ntserver -o username=user,password=password
Now you can easily browse and use the mounted share just like any directory. You do not need to mount it every time. Just edit your /etc/fstab so upon every reboot the share is mounted automatically. At the end of my /etc/fstab file, I added the following (hat tip to Permanent mount – fstab):
//192.168.0.111/shared /media/ntserver cifs username=user,password=password,rw,mand 0 2
Then I ran the following command:
sudo mount -a
Now with every reboot the shared directory was mounted automatically.
Convert XLS to CSV
Linux has a command line solution to it, called xls2csv. To install xls2csv in Debian, run the following command.
sudo aptitude install catdoc
Now you can use the command to convert any xls file to csv. I like to do the following, using ! as delimiter because the data I am working with usually has a list, separated by comma, in one column. To differentiate between columns and the list, I use another delimiter. With this tweak, I choose to use no quotes because it would only make things difficult by having to remove quotes myself before loading data.
xls2csv -x -q0 -c! myfile.xls
Since xls in all it’s wisdom can cause date and time problems when converting to csv, Python may need to be used. xlrd is a package for Python to extract Excel data. To install it in Debian, run the following.
sudo aptitude install python-xlrd
Normalize CSV Files
Normalize the resulting csv files, with their different formats, by using standard commands, like cut, sort, uniq, etc. This normalization should be done based on the table structure into which data has to be loaded in the next step.
Load Data into Database
You can use the copy command in PostgreSQL to load csv files into it.
Update 2009-08-25: Added Python to the tools because of problems with Excel (xls) date time format.