Weka problem filtering instances null




















Lets me show an example of how it works. I will start with a simple text collection, which is an small sample of the publicly available SMS Spam Collection. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4, legitimate messages and mobile spam messages, for a total of 5, short messages collected from several sources.

I will make use of an small subset in order to better show my points in this post. Available only in bugis n great world la e buffet Cine there got amore wat Joking wif u oni U c already then say Tb ok!

I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. In the first messages of the collection, 33 of them are spam and are legitimate "ham". This collection can be loaded in the WEKA Explorer , showing something similar to the following window:. The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:.

Here is where the StringToWordVector filter comes to help. Once selected, you should be able to see something like this:. If you click on the name of the filter, you will get a lot of options, which I leave for another post.

For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of messages and 1, indexing tokens plus the class attribute , shown in the next picture:. If you want to see colors showing the distribution of attributes tokens according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer.

So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate ham one:. Now, we can make our experiments in the Classify tab.

We can just select cross-validation using 3 folds 1 , point to the appropriate attribute to be used as a class which is the "spamclass" one 2 , and select a rule learner like PART in the classifier area 3. This setup is shown in the next figure:. The selected evaluation method, cross-validation , instructs WEKA to divide the training collection into 3 sub-collections folds , and perform three experiments.

Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix.

The classifier learnt over the full collection is the following one:. Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training making the learner trying to generalize from a token that does not occur in the test collection. And when it is on the test collection, the learner should not even know about it!

Moreover, what happens with attributes that are highly predictive for the full collection according to their statistics when computing e. They may have worse or better statistics when a subset of their occurrences is not seen, as they can be on the test collection!

The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter StringToWordVector and the learner, the way that we index and train for every sub-set in the cross-validation run. In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message as a string and the class.

Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:. For an accuracy of However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.

This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.

So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e. Thanks for reading, and please feel free to leave a comment if you think I can improve this article! Thanks for the excellent post. The test collection is not balanced, there are more negative instances ham than positive instances spam doesn't it effect the model performance?

I have tried to run the classifier on sample spam data. My sample contains ham instances and 82 spam. Following is the part of PART classifier output. Can you explain how to read PART decision list notation and the 'predictions on test data' part, especially for the instance Thanks, Nirmala. Dear Nirmala First, thank you for your comment.

About your questions: Q1: "The test collection is not balanced, there are more negative instances ham than positive instances spam doesn't it effect the model performance? If you are below that distribution, most learning algorithms will be able to handle it. In cases with e. For those cases, I recommend to use weighting as a variation of stratification.

In case of e. Q2: I have tried to run the classifier on sample spam data. Q4: and in the second one, how the fraction Thanks again for your comment. Jose Maria. Thanks for the great posts. I am new to Weka and find your material extremely useful. I am now confused about when to use a Filtered classifier. Is it meant to be used in certain instances say when cross validation is used, what about when a percentage split or a supplied test set is used?

Specifically to my task similar to Text Categorization , I would like to use the StringToWord vector to create N-Grams, then use that data on different classifiers. For instance for 2-Grams test many classifiers about 8 , then build 3-Grams and test many classifiers, then for 4-Grams, 5-Grams, … to find the best combination of N-grams and classifiers. I aim to do this using iteration in Java a for loop. Hi the problem is the filter converts every term in a string into an attribute. Now there must be a term "review" or "sentiment" in your data section.

Therefore the attributes are duplicated. So, change the names of these two attributes like "myreview" and "mysentiment" or to something that is unlikely to occur in your data. It should work. I also encountered the same problem because the word "domain" appeared in the data, causing the filter to misunderstand when recognizing it.

My solution was to remove all the "domain" from the data and keep only the "domain" in attribute. The easiest solution to avoid these attribute name clashes, is to use a prefix for the generated attributes. How are we doing? Please help us improve Stack Overflow.

Take our short survey. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. Asked 6 years, 10 months ago. Active 9 months ago. Viewed 2k times. NullStemmer -M 1 -tokenizer "weka. Karen Karen 11 3 3 bronze badges. Add a comment. Active Oldest Votes. Rushdi Shams Rushdi Shams 2, 18 18 silver badges 29 29 bronze badges.

Eason Eason 1.



0コメント

  • 1000 / 1000