Code Buster: 2014

Monday, August 25, 2014

Neural network for regression (example)

Credit scoring is the practice of analysing a persons background and credit application in order to assess the creditworthiness of the person. One can take numerous approaches on analysing this creditworthiness. In the end it basically comes down to first selecting the correct independent variables (e.g. income, age, gender) that lead to a given level of creditworthiness. In other words:creditworthiness=f(income, age, gender, …). A creditscoring system can be represented by linear regression, logistic regression, machine learning or a combination of these. Neural networks are situated in the domain of machine learining. The following is an strongly simplified example. The actual procedure of building a credit scoring system is much more complex and the resulting model will most likely not consist of solely or even a neural network.

If you’re unsure on what a neural network exactly is, I find this a good place to start.

For this example the R package neuralnet is used, for a more in-depth view on the exact workings of the package see neuralnet: Training of Neural Networksby F. Günther and S. Fritsch.

First let load the package and an example dataset.

set.seed(1234567890)

library("neuralnet")

dataset <- read.csv("creditset.csv")
head(dataset)

##   clientid income   age   loan       LTI default10yr
## 1        1  66156 59.02 8106.5 0.1225368           0
## 2        2  34415 48.12 6564.7 0.1907516           0
## 3        3  57317 63.11 8021.0 0.1399398           0
## 4        4  42710 45.75 6103.6 0.1429105           0
## 5        5  66953 18.58 8770.1 0.1309895           1
## 6        6  24904 57.47   15.5 0.0006223           0

The dataset contains information on different clients who received a loan at least 10 years ago. The variables income (yearly), age, loan (size in euros) and LTI(the loan to yearly income ratio) are available. Our goal is to devise a model which predicts, based on the input variables LTI and age, whether or not a default will occur within 10 years.

The dataset will be split up in a subset used for training the neural network and another set used for testing. As the ordering of the dataset is completely random, we do not have to extract random rows and can just take the first xrows.

## extract a set to train the NN
trainset <- dataset[1:800, ]

## select the test set
testset <- dataset[801:2000, ]

Now we’ll build a neural network with 4 hidden nodes (a neural network is comprised of a input, hidden and output nodes). The number of nodes is chosen here without a clear method, however there are some rules of thumb. The lifesign option refers to the verbosity. The ouput is not linear and we will use a threshold value of 10%. The neuralnet package uses resilient backpropagationwith weight backtracking as its standard algorithm.

## build the neural network (NN)
creditnet <- neuralnet(default10yr ~ LTI + age, trainset, hidden = 4, lifesign = "minimal", 
    linear.output = FALSE, threshold = 0.1)

## hidden: 4    thresh: 0.1    rep: 1/1    steps:    7266   error: 0.79202  time: 9.32 secs

The neuralnet package also has the possibility to visualize the generated model and show the found weights.

## plot the NN
plot(creditnet, rep = "best")

Once we’ve trained the neural network we are ready to test it. We use the testsetsubset for this. The compute function is applied for computing the outputs based on the LTI and age inputs from the testset.

## test the resulting output
temp_test <- subset(testset, select = c("LTI", "age"))

creditnet.results <- compute(creditnet, temp_test)

The temp dataset contains only the columns LTI and age of the trainset. Only these variables are used for input. The set looks as follows:

head(temp_test)

##               LTI         age
## 801 0.02306808811 25.90644520
## 802 0.13729704954 40.77430558
## 803 0.10456984914 32.47350580
## 804 0.15985046411 53.22813215
## 805 0.11161429579 46.47915325
## 806 0.11489364221 47.12736998

Let’s have a look at what the neural network produced:

results <- data.frame(actual = testset$default10yr, prediction = creditnet.results$net.result)
results[100:115, ]

##     actual                                 prediction
## 900      0 0.0000000000000000000000000015964854322398
## 901      0 0.0000000000000000000000000065162871249459
## 902      0 0.0000000000164043993271687692878796349660
## 903      1 0.9999999999219191249011373656685464084148
## 904      0 0.0000000000000000013810778585990359033486
## 905      0 0.0000000000000000539636283549265018946381
## 906      0 0.0000000000000000000234592312583958126923
## 907      1 0.9581419934268182725389806364546529948711
## 908      0 0.2499229633059911748205195181071758270264
## 909      0 0.0000000000000007044361454974853363648901
## 910      0 0.0006082559674722616289282983714770125516
## 911      1 0.9999999878713862200285689141310285776854
## 912      0 0.0000000000000000000000000015562211243506
## 913      1 0.9999999993455563895849991240538656711578
## 914      0 0.0000000000000000000000000000003082538282
## 915      0 0.0000000019359618836434052080615331181690

We can round to the nearest integer to improve readability:

results$prediction <- round(results$prediction)
results[100:115, ]

##     actual prediction
## 900      0          0
## 901      0          0
## 902      0          0
## 903      1          1
## 904      0          0
## 905      0          0
## 906      0          0
## 907      1          1
## 908      0          0
## 909      0          0
## 910      0          0
## 911      1          1
## 912      0          0
## 913      1          1
## 914      0          0
## 915      0          0

As you can see it is pretty close! As already stated, this is a strongly simplified example. But it might serve as a basis for you to play around with your first neural network.

Saturday, February 15, 2014

An Introduction to Object Detection

Digital image processing refers to processing of a two-dimensional picture by digital computer. It implies digital processing of two dimensional data. A digital image is an array of real or complex numbers represented by a finite number of bits. Image segmentation is a key step in digital image processing. It was developed in 1960’s for image analysis. It is the process of grouping together pixels which are semantically linked. Segmentation divides image into its constituent regions or objects. The level to which segmentation is carried out depends upon the problem being solved i.e. segmentation should stop when the objects of interest in an application have been isolated. Segmentation accuracy determines the eventual success or failure of computerized analysis procedures. For this reason considerable care is taken to improve the probability of rugged segmentation . In some situations such as industrial inspection applications, at least some measure of control over the environment is possible at times. In others, as in remote sensing, user control over image acquisition is limited principally to the choice of image sensors.

Image segmentation is a tool used for precise image analysis. An object input image is taken and is preprocessed. Preprocessing is done to convert the image in more suitable form and to remove the noise. Image smoothing and binarizing are the two stages of preprocessing. Various filters such as median filter, spatial average filter, linear filter and Gaussian filters are used for image smoothing. In few cases noise is multiplicative. Noise smoothing filters are also designed for such images. Binarized image has only two levels i.e. black and white and is obtained by thresholding. The next step of image segmentation is feature extraction. Feature extraction generally refers to the extraction of discontinuities such as point, line and edge, and pixels forming homogeneous regions. Such features have difference in gray level when compared to the background area. Region growing is based on similarity criteria. Region growing is an iterative process by which regions are merged starting from individual pixels or initial segmentation and grow iteratively until every pixel is processed. Selection of edge or region depends upon the type of data being analyzed and on the application area.

Image Analysis Therefore, the final output image is a segmented image in which the features of the objects in foreground are extracted so precisely that they are separated from the background. In human visual system, edges are more sensitive than other picture elements. As a result, if one uses either region-growing or edge detection technique alone, one may lose some information of interested objects. For example, if one uses region-growing technique alone, the lack of edge information would terminate region-growing process at wrong place. If the similarity criteria were too strict, many false edges would be generated. In other words, the region-growing process may not stop at the contour of object. In order to improve segmentation results, combination of region growing and edge detection techniques is a good research issue. The integrated method can exploit the edge information obtained by using edge detection techniques to help the region growing process determine where and when to stop the growing process. In this way, objects separated could have accurate contour on the true edges. Edge based and region based approaches are complementary to each other and use ancillary information to guide the image segmentation procedure. The early researchers used the edge information to check the boundary produced by performing region growing process on the raw input image.

Any region growing technique may produce false boundaries because the uniformity criterion may not be satisfied over a given area even if there is no clear line where a transition occurs. Furthermore, it is likely that such boundaries will reflect the data structures and traversal strategies used during region growing.
The application of any region growing process can lead to three kinds of errors:
a) A boundary is not an edge and there are no edges nearby.
b) A boundary corresponds to an edge but it does not coincide with it.
c) There exist edges with no boundaries near them.

The probability of third type errors mentioned above can be reduced significantly, if not eliminated altogether, by the proper selection of parameters. This results in an over segmented image because such parameter settings cause the errors of first type to increase. In order to achieve a meaningful segmentation, low level features must be extracted first and subsequently linked together using a series of opportunistic grouping algorithms. At the lowest level the only information is similarity. The main goal of segmentation scheme presented is to combine edge and region information to achieve a stable segmentation. The segmentation scheme presented is designed to operate on general home and stock photographs. It returns comprehensive region based description of the visual content of an image. This segmentation scheme is designed to facilitate image retrieval and has been tested on several images and has been found to be robust, rapid and free of tuning parameters. The background noise is removed and reliability and accuracy of image segmentation is increased. It offers precise segmentation in detecting multiple objects of different sizes and non rigid targets. It improves static image segmentation and the computational load is low. Stable segmentation of satellite images is achieved by this process.