Two useful Python tools - dimpy and tablele for data analysis applications

Python’s list array is more powerful than arrays in other languages like C, C++, Fortran, or Java. However, in some cases it becomes tedious and complicated to construct a multidimensional ‘list’ type array in Python. A Python tool namely ‘ dimpy ’ is discussed in this paper which can easily generate any multidimensional ‘list’ type array in python. Another Python package called tablefile for reading and analysing column-wise data from a data-file is also discussed. How these two tools may be useful and reduce steps of programming is shown by using some mathematics and physics related problems.


Introduction
Python has become popular among scientists for its simplicity and flexibility to enhance functionality by adding open-source packages in the program. It is new as compared to other languages and still evolving. However, there are various tools contributed by Python developers in the PyPi repository that can be added to a python program to enhance its performance. There are certain applications, where user experience may be improved by introducing new classes or functions within the program. For example, N-dimensional arrays with variable parameters are frequently required in various applications. An array in Python is in general defined by a 'list'. Other specific array types include 'tuple', 'set' and 'dict'. Although, 'list' is more powerful array type than its similar counterpart in other languages due to its flexibility to include variable of any type, it becomes a bit complicated to predefine a 'list' type array of larger dimension in Python without assigning an initial value to the elements. Those who have proficiency in languages like Fortran, C, C++, Java, VB.NET, etc. may know how to create arrays of any number of dimensions of a given variable type (i.e. either float, int, or str) in a more or less similar manner. If the number of dimensions or elements is very large it may be problematic to give values to individual elements while declaring a list variable in Python. However, one may use the Numpy package (Harris et al. 2020, Oliphant et al. 2007, Cai et al. 2005, Sympy package (Meurer et al. 2017), Awkward arrays in Python (Pivarski et al. 2020), or array class of Python itself, which can create arrays similar to other languages, but none of those give the output arrays in the form of inbuilt 'list' class of Python, whereas 'list' type arrays may be advantageous in many cases (elaborated in next section). To achieve this goal and to make the 'list' type arrays of multiple dimensions in a way similar to arrays in other languages, the Python tool dimpy has been developed. Its installation, use, and application with examples have been discussed in this article.
Apart from the above, another useful tool has been discussed in this paper that may be advantageous to those who work with tabulated data in a text file. It is often required to read data from a data-file in which data are presented in a tabulated format consisting of a set of columns. The columns are usually separated by a delimiter which may be a comma, tab, blank space, or any other character. The users of the FORTRAN language can very easily read such data in a very simple way because it has been designed to do so. Although reading a file is simpler in Python as compared to other languages, but there are still some issues that must be addressed for ease of application and to make it popular among scientists and researchers. In this aim, the Python tool tablefile was developed which reads data from a data-file in a more convenient manner. In addition to that, it performs some elementary analytical tasks such as -averaging, summation, standard deviation, the maximum and minimum value from the data columns to assist data analysts in their work. This paper elaborates the advantage, working, and the use of the above two opensource Python tools with some applications to show how these tools in Python code may be advantageous to solve problems in science.

Scope of improvement
(i) Arrays in Python: The ways to create a real-valued array variable in different languages are summarized in Table-1. Other types (i.e. integer or string) of arrays can be created similarly. In Python, there are several ways to generate an array. Most of the methods need that each of the elements must be given a value by the programmer when it is created. The closest approach in Python which is analogous to the other languages as shown in Table- Similarly, if during runtime of the program it is required to add an element to the array, it will be not possible in 'numpy.ndarray' type variable, but can be done with the append() function if the array type is 'list'. Another advantage of list arrays over 'numpy.ndarray' is that they can include elements of mixed data types. That is, an array consisting of some string elements, some integers, and some float variables is allowed in a list class variable, which is not the case in 'numpy.ndarray'.
While 'numpy.ndarray' may be advantageous in many cases, list type arrays may be in many other applications. However, list type variables must be given some initial values to the individual elements when they are created. If the array is of many dimensions, it becomes a tiresome task to give each initial input individually and to construct many complicated nested lists.
Hence, to create a list array similar to most of the programming languages and numpy.empty() function in Python, the dimpy package was developed. The use and working are described in the subsequent section.
(ii) Reading data from a file: A basic FORTRAN code to read data from a data-file maybe The above code reads the data fields from some file 'FILE.TXT' in which 1st and 3rd columns are read as float whereas 2nd column is read as a string of maximum length 3. This code is simple, but leads to runtime error if the data format in the file do not match with what is being read (like that is shown in Fig.1).

Fig1
. Screenshot of the file 'data.txt' shown in Notepad. The first line is a header, and below this values are separated by tabs. Fig.1 shows an input data-file where the fields are separated by \t (i.e. tab). In between numeric data, there are strings too. In Python, it's easy to read such datafiles as a list. A Python code to read data from this file in Python maybe It may be noticed that 'dat' is a 'list' in which each line is a member in the form of a string. Data analysts would like to split the fields and store them as float values and in a more sequential format so that lines and column values can be easily accessed. That would require writing a few lines of code more in Python to finally get the result. To simplify this task, tablefile package was developed which has been described in the next section.

Installation and Use
(i) The dimpy package: To install, in cmd or terminal use $ python3 -m pip install dimpy An example of a 3-dimensional list array consisting of 2×3×2 elements is shown below The default value assigned to the elements is 0 which can be changed to anything (whether an integer, float or string) by the use of function dfv(). In principle, we can construct an equivalent list without the use of any package as shown below But there is a catch. If we assign a value to a particular element, then it's multiplied in each sub-element It should be noted that the array type of A is the 'list' and therefore it allows mixed variable types within it.
If one wishes to convert the type of A from the 'list' to a new array C of type 'numpy.ndarray', one may use C=npary(A) function, provided the numpy package is already installed in the system. The point to note here is that as one of the elements in A is a string, all the elements will be converted to string type in C. Further, as the intrinsic data type of C is now 'str', one cannot now assign a 'float' or 'int' value to a particular element in C as we have done in case of A. In this case, the float or int value will be converted to 'str' and then stored in C But remember that, the converse is not true and will through ValueError if tried.
That means if C is 'int' or 'float' type then one cannot assign an 'str' data to a given element, because 'str' to 'int' or 'float' conversion is not allowed.
(ii) The tablefile package: To install from cmd or terminal, enter $ python3 -m pip install tablefile An example code to read column-wise data from 'data.txt' as shown in Fig.1  If the data-file separator is one or more blank-space, then one may not specify it at the 2 nd argument of file() function. That is, in this case, we may write >>> f1=file("C:/.../data.txt") # If column separator is a blank-space It can be seen that the output of 'lines' is a list that is already divided into lines and columns. The fields that cannot be converted to float remain string and lines starting with '#' have been considered as a comment line and therefore skipped.
While reading data from a table, sometimes we are interested to get all the column values in a list element rather than lines. This can be done by >>> cols=f1.read("c/l") # Here "c/l"says that output should be column/line forma, i.e. first index of 'cols' will indicate column whereas second index row. In addition to reading the data-file in sequential format, tablefile also gives some additional features as shown below >>> # Column-wise operations >>> average=f1.read("av") >>> sum=f1.read("sm") >>> std=f1.read("sd") # Standard deviation for a population (ie large N) >>> stds=f1.read("sds") # Standard deviation for a sample (ie small N) >>> max=f1.read("mx") >>> min=f1.read("mn") >>>print("Average=",average,"Sum=",sum,"Sigma_population=",std,"Sigma_sample=",st ds, "Maximum=",max,"Minimum=",min) Here in each case above, the first element corresponds to the first column, the second element to the second column, and so on. All the strings in the columns that cannot be converted to 'float' will be neglected during the calculation.
If we want to do the statistical calculations above on a particular list -whether it is a line or column or any other list containing some numbers, we can do that using tablefile functions as follows Note that in the all above cases the string elements in List1 were neglected during calculation and they did not through any error. Python's inbuilt functions sum(), max(), min() and NumPy's functions such as numpy.std() and numpy.average() can do the same task, but all these will through errors due to the presence of string elements in the list.

Some Physical Examples
In this section, some physical examples are shown which demonstrates how we can use the above packages to solve various problems of mathematics and physics In principle, a matrix with variable elements can be defined with the help of the Array function of the Sympy package, but it is not practically useful in cases like this one. Let us see why -suppose we have a function of five variables, so we would require a Hessian matrix of 5×5 =25 elements.
Our logical approach would be to first define an array of 5×5 =25 elements with dummy values and then assign each of them corresponding double derivative in two nested loops as shown below from sympy import * from math import pi n=5 # A straight forward approach that may be considered to construct the 5×5 Hessian matrix using Array function would be to write the array like this - But, one would hardly prefer to follow this approach. Instead of this, the problem can be easily handled by the use of dimpy as followsfrom sympy import * from math import pi from dimpy import * x1,x2,x3,x4,x5=symbols('x1,x2,x3,x4,x5') # Defines the symbolic variables x=[x1,x2,x3,x4,x5] f=x1**2+x1*x2+cos(x1*x2)+x3+x4*x5**0.5 # Defines the function n=5 # Number of rows or columns H=dim(n,n) for i in range(n): for j in range(n):

print(det(H)) # Determinant of H print(H.eigenvals()) # Eigenvalues of H print(H.eigenvects()) # Eigenvectors of H
(ii) Test for Symplectic Condition (Canonical Transformation): In classical mechanics, we use the 'symplectic condition' to test whether a transformation of coordinates in phase space is canonical or not Goldstein 1998.
The mathematical approach to test canonicality of an n particle system in phase space demands the following condition be satisfied: is the transpose matrix of , and, is an 2n×2n anti-symmetric matrix given by where is a n×n null matrix or zero matrix (i.e. a matrix whose all the elements are zero) and is an n×n unit matrix For a two or three-particle system though the test is not difficult to carry out manually, but, if the number of particles in the system is large we must do it programmatically. Let us solve a problem by using the Python program for a twoparticle system which can be extended to any number of particles system with little modification.
Problem: Prove that the following transformation is canonical  When we run this program, we obtain the message "The symplectic condition is satisfied" at the output. Although the above program deals with a two-particle system, our program can be easily used for any number of particles system by introducing necessary parameters and equations in the first few lines.
The Array() function can not be used in the above program due to the same reason as in the Hessian matrix case, and therefore dimpy is the only option.
(iii) Analysis of Astronomical data: Hipparcos catalog (Perryman et al. 1997) is an astronomical database in the form of an ASCII table (i.e. can be opened by any text editor like Notepad). It contains various astronomical data for 118218 stars in each line tabulated in 77 columns called Fields (excluding the Field 0). Deb and Chakrabory 2014 used a FORTRAN program to read 13 among these 77 columns in their work to identify stars with incorrect spectral classification. As the first step to investigate further on those wrongly classified stars (Table-2 M v can be read from the 4 th column of Table-2  To do the task programmatically we first copy the data in Table-2 of Deb and  Chakraborty 2014 to a text file and name it 'data_deb.txt'. The full Hipparcos catalog is available to download from http://cdsarc.u-strasbg.fr/ftp/cats/I/239/ in the form of an ASCII table with a file name 'hip_main.dat'. We will use Python code to extract/compare data from the above two files and finally construct the required list.
The task to do: First of all, we need to read columns 1, 4, and 8 from 'data_deb.txt'. Stars with presumably incorrect spectral classification are tagged 'Unchanged' in the 8 th column. Corresponding data of 1 st column is the Hipparcos identifier number. We can now search for this number with our program in the Hipparcos catalog file 'hip_main.dat' in its H1 field and then read corresponding other fields to access the necessary data. We can then perform necessary calculations and finally print the data at the output.
Without adding any package to our program, the code will look like this -f1=open("D:/data_deb.txt",'r+') data_deb=f1.readlines() f2=open("D:/hip_main.dat", "r+") data_hip=f2. On the other hand, if we import tablefile package to the code, the same task can be performed with the following codefrom tablefile import * f1=file("D:/data_deb.txt",'\t') lines_deb=f1.read() # reads 'data_deb.txt' in default line/column format f2=file("D:/hip_main.dat", "|") cols_hip=f2.read("c/l") # reads 'hip_main.dat '  If we compare the above two codes, it can be seen that with the use of tablefile package in our program we can make it simpler as well as concise. Both the codes are properly commented on so that readers can easily understand the steps. (See the last paragraph for file availability information) (iv) Analysis of Experimental Data: The temperature dependence of an ohmic conductor is given by Where R T is the resistance at temperature T, R o is the resistance at a reference temperature T o which is generally 0 o C, and α is the temperature coefficient of resistance for the material. The experimental determination of α needs temperature v/s resistance data for a wide range of temperatures. Fig.2 shows a computerized experimental setup for the determination of α. The Arduino microcontroller automatically collects data for the temperature of the oil bath (which is at equilibrium temperature with the resistance), the current through the circuit, and the voltage across the resistance R with the help of respective sensors and sends it to the computer through a USB connection which can be logged to a text file utilizing a serial data read software like Terminal or Putty. Fig.3 shows the screenshot a of typical output data-set from the microcontroller which is programmed to send 50 observations at the interval of 20 milliseconds, then waits 5 minutes for the temperature change, and then repeats the process. Here it can be seen that some entries are due to serial data read errors and hence can not be converted to a floating-point number during calculation and therefore would lead to errors unless it is handled separately in the program. Complete removal of the lines containing errors may not be suggested because it will delete some correct data too. But if we use inbuilt functions of tablefile package then this limitation can be overcome without adding extra steps to the code. Two codes producing identical results -with and without importing tablefile are shown below. It can be seen that the use of tablefile makes the code simpler and shorter. The output is shown in Fig.4.
Working and theory behind the codes: In the codes, we first read the four columns from the data-file and place them in a two-dimensional list array called 'cols' where the first index would indicate column number and the second would indicate the serial number of the data. In the data-file, the 1 st column gives the serial number of observations, the 2 nd column gives temperature data, the 3 rd gives voltage data, and 4 th that of current data, and the corresponding data can be accessed by setting the first index of 'cols' equal to 0,1,2 and 3 respectively. We divide each column into groups of 50 data sets and then take their average and sample standard deviations. This gives average temperature (T av ), voltage (V av ), and current (I av ) at different observation times and their corresponding sample standard deviations T sds , V sds, and I sds respectively. Corresponding standard errors can be calculated from T err =T sds /√N, V err =V sds /√N, and I err =V sds /√N where N is the sample size (which is 50 in this case). Now resistance R can be computed from R= V av /I av and uncertainty in R can be obtained from the well-known relation = × + Here δV=V err and δI=I err are standard errors in V and I measurements respectively. The temperature and resistance data and their uncertainties are stored in four separate lists which are used to plot a graph as shown in Fig.4. A least-square fit with the help of the 'numpy.polyfit()' function is also shown in Fig.4. The (slope/intercept) ratio of this line gives the value of α.
Codes for all the four examples described in this section and the related files are available for download at https://github.com/DwaipayanDeb/dimpy-tablefileexamples.git

Discussion
This paper has discussed the limitations of array formation and reading data from files in the present form of Python and introduces two new tools which eliminate these issues and improve the user experience for scientific calculations and analysis. The tablefile package simplifies reading tabulated data from a file with any kind of field separator, and also some inbuilt functions are provided to perform basic calculations like summation, averaging, standard deviations, etc. On the other hand, dimpy package can generate Python 'list' type arrays of any number of dimensions with any number of elements. Two physical examples in each case are given to show how these tools may simplify Python programming in physics.
With the use of dimpy we may greatly simplify and enhance the calculations within arrays of two or more dimensions especially for symbolic operations like differentiation and integration. These examples also demonstrate how calculus can be applied to the matrix elements within nested loops that would be a difficult or time-consuming process without the help of dimpy in a Python program. Although given examples deal with two-dimensional arrays, a similar procedure may be applied to perform calculus operations on arrays of any number of dimensions (as in tensors). On the other hand, we see that reading data from huge files like the Hipparcos catalog or a file containing experimental data is simplified with the use of tablefile. Also, inbuilt functions like convert(), av(), sd(), etc. are more efficient in the sense that they do not through run time error if the input list argument contains one or more string elements.

Declarations:
Funding: No funding was received for this work

Conflict of interest/Competing interests: No conflict of interest/ competing interest applicable
Availability of data and material: Data reported in this work are available to be used by anyone without restrictions.