Tips on writing regular expressions for searching the post list

Way to force "Accept" Subjects ONLY?

Postby Kaigi » Sat Oct 18, 2003 3:04 am

This is my first post here. I'm a 'Founding Registerer' of Newsbin (when I registered, I was told I was one of the first 100 people to ever pay the registration fee... this was back in the late V1 or early V2 stages of the program: roughly early '97). I've been with the program a LONG time, but never really learned the advancing complexities. Now I want to try at least a bit more than I have been doing.

Among other things, I collect celebrity pics. With Celeb groups, there are just such a huge number to go through when I check them (today's count? 59,511 posts) it is obviously impossible to sort through them 'by hand'.

I have been going through and manually doing searches via the "Find" bar at the top of the program. However, doing that with over 80 celebs I collect - doing each search two to three times is simply WAY too time-consuming (searching two-three times because sometimes someone will post under the whole name, sometimes only under the first or under the last, sometimes there is a hyphen or underscore between first and last, etc, so one can't count on the full name search... not to mention when people spell either the first or last names wrong).

I had the brilliant idea of creating a new filter profile with everyone I wanted in the "Accept" part for just the celeb groups! :D I put in all the assorted variations of the names I wanted (working on the file in a text editor for all the variations I thought of made life a tad easier). That came to a total of 330 lines [that's hundreds of things I used to enter MANUALLY about every few weeks!].

However, when I cranked up Newsbin, EVERY post showed up, and stayed there, even when I used the pull-down menu to 'apply filter'. :(

Is there ANY way to have an "anything that isn't in the 'Subject Accept filters' is Rejected" reject filter? In other words, is there any way to set up the program so that it will ONLY accept what is in the Subject Accept filter list and nothing else?

[And that - of course - brings me to: "Will the program recognize that I want any post where PART of the Subject line is what I have in my Accept filter?" (i.e. that it WON'T say, "oh, there's an extra space or character before or after that... gotta reject it!") Or do I have to do variations such as /whoever/, */whoever/*, */whoever/, /whoever/* for EVERY entry in the Accept Filter list?]
Postby Quade » Sat Oct 18, 2003 12:28 pm

You have to assign the filter profile to the group. Right click the group and assign it if you haven't already. Make sure "Ignore Filters" isn't checked too.

Any part of the name will match. If you're concerned about embedded spaces, you can actually handle that with the syntax of the filter. Maybe an example name and variations so, we can posts some RE examples?
Did have the filters assigned to the group... [I thought]

Postby Kaigi » Sat Oct 18, 2003 8:19 pm

Hi Quade (hope things are going well with the Mars rebuilding :wink: ),

First off, thanks for the quick response.

I thought I had the filters assigned to the group, since they were the ones present when I opened the group (I checked which filters were present before doing the 'Download Latest' just a minute ago). And I do not have 'Ignore Filters' checked in this Celeb.nbi file.

I assigned the filters via a right-click of the selected groups they are to apply to (all celeb groups are in a seperate "Celebs.nbi" file, to keep it easier, so all are hilighted and it covers them all, presumeably), selecting "Filter Profile" from the listing, then selecting the proper filter, then hitting "Save" [even though I have the program set to "Save on Exit", regardless I manually did the save here].

Two problems: now that filter set is assigned to ALL "_.nbi" sets of groups, and they cant' be changed back to either of the other two filter sets I have, AND they don't actually FILTER anything! (i.e. things that are clearly NOT in my 'Celebs' "Accept" filter set are showing up in the Post List).

I have changed back and forth between "_.nbi" group sets this afternoon to double-check this, I am finding that no other filter set STAYs attached to any other groups. Once I set the 'Celebs' filters for the Celbs.nbi set of newsgroups, then I am getting the 'Celebs' filters locked in place when I move to another "_.nbi" group set, even though THEY had a specific filter group assigned to them (plus the 'not filtering' thing is still going on, too).

IMMEDIATELY after I've tried to correct THAT, and assigned the appropriate filter set to that group (via right click "Filter Profile" and selecting), start a download with JUST that group, I look at the filters that are active, and find that it is still the 'Celebs' filter set, even though I JUST changed it. [i.e. it is staying STUCK on the one filter set, even though I specifically try to apply another set to any other group, in addition to it not actually filtering anything.]

So I guess I am not getting them to stay assigned to the groups properly (though, for the life of me, I can't find/think of any other way to do so!)

A couple sample names and variations I was using in the filter file [in case that has something to do with why they aren't filtering anything]:
Katherine Heigl
Catherine Zeta Jones

"Katherine" and "Catherine" are WAY too common to use as selections on their own, so I didn't. The "K_Heigl" and others with the underscore are used because some people post with only the filename visible, and that is how many are named... also, many Catherine Zeta Jones pics are named simply "CZJ", so I included that, as well, for instance. "Zeta" and "Heigl" are used alone because they are SO unique, the chances of a 'false positive' are too small to worry about.

Obviously, with the 'celebrities' newsgroups, there is simply NO WAY to have an "exclude" list, unless I was to sit down and write out EVERY OTHER celebrity's name in it! [i.e. it simply is not possible to have an exclude list that actually does an "everything else" that is manually produced.]
Postby Kaigi » Sat Oct 18, 2003 8:59 pm


Just realized/noticed/thought of something with regards (at least) to the filters: that a dot "." instead of EITHER the space " " or underscore "_" was the actual appropriate way to do these filters (so just corrected it: e.g. "Katherine.Heigl" instead of BOTH "Katherine Heigl" and "Katherine_Heigl")... no 'test subjects' yet to see if this made a difference. [Thought about what you'd said, "will accept any part"... well, if there is a space, or an underscore in any post, I started to wonder if THAT "any part" could be why the filters didn't seem to be working!]

Just a 'thinking update' here... hope that helps... and thanks again!
Postby Smite » Sun Oct 19, 2003 2:15 am

Newsbin has been known to have issues with large filter lists, though usually it's a matter of ignoring the latter filters.

If you create a filter with just the first 10 of the subject accept filters you're using, does it work correctly? Since there's only 10 there, it should be much easier to check a subject by hand to see if it's matching the filter.

Also, the Post List tab will list the number of posts displayed out of the total avaliable. Is it saying all of them are displayed? (ie: 9334/9334)

I look at the filters that are active

I'm not sure that's actually possible. Mind telling me how you do that? You might be mis-interpreting something.

You can tell which (single) filter profile is assigned to a group, by checking what it says under the "filter profile" column beside the group names.
Ta-daaaa! [I'm hoping! ;) ]

Postby Kaigi » Sun Oct 19, 2003 5:00 am

Hi Smite,

Thanks for the response.

That did lead me to a less-hassle-filled way of doing things than I HAD been doing them, but is still a mini-drag to follow through with (though, G-d knows, a HELL of a lot less of a drag than trying to do all that manually!!!)

[Numbering below is to keep my own thinking straight... not meant to 'read' as if upset or anything, OK?]

First: NO group - when right clicked and then selecting "Filter Profile" - EVER showed a group attached to it, even when I just did it 2-3 seconds before (using the right click, selecting "Filter Profile", then choosing the profile, sometimes REPEATEDLY doing this sequence in a row). If Newsbin is supposed to show what profile is attached at that point, it wasn't/isn't doing it. I was checking which filter profile was active by selecting the profile, starting the single group I'd just 'profiled' downloading, going into "Filters" and looking at which were listed there. It remained the same (and 'wrong') regardless of what Filter Profile I'd just selected for the group. That may, indeed, be me misinterpreting something... I don't know. I'm guessing from what you say about Newsbin having trouble with long filter lists, that the reason it defaulted to "Celebs" instead of even the "Default" filter collection is because "Celebs" was the first group listed in the "FILTERS.XML" file... it quite possibly wasn't working because - between the three different filter profiles I had in it, my "FILTERS.XML" file was 996 lines long!

Second: I knew that the filter was not doing its job because things that CLEARLY were not on ANY of my 'Accept' lists were right there in front of me even when I would manually use the pull-down menu "Utilities/Apply Filters" e.g. "Try M$ patch" has NO words in it AT ALL that are anywhere in ANY of my filter sets! At that point I knew, instantly, that it wasn't working. [A strict number comparison {e.g. 63,500/63,500} wasn't really possible with most of the groups I had a different filter set for, because my ISP holds contents of some groups for three MONTHS, and I just downloaded some of the newer posts {using the Find bar} within the last 48 hours.]

Third: (and here's where your information about Newsbin having problems with long filter lists REALLY helped [THANKS!!], and your suggestion about limiting filters to 10 helped me find a 'workable' way of doing things {though not as small as 10}). I had three filter sets, including my old "Default". With all three in the 'standard' file (as Newsbin automatically puts them there), as I said, my prior "FILTERS.XML" file was 996 lines long total. I tried making THREE DIFFERENT "FILTERS.XML" files, and leaving only ONE filter group in each. Now each of the three 'subfilters' are roughly 1/2-1/3 that size! I just cranked up one [the largest, actually, at 530 lines total file length] and found that it had - indeed - filtered appropriately (from what things I checked - which was a LOT before I opted to possibly 'close' this discussion). Of course, I labelled duplicate copies of each file (such as "celebFILTERS.XML"), so that I could simply copy them, then strip the name down to "FILTERS.XML" when I want to change them. Not as efficient as if I could really get that to work otherwise (such as having one FILTERS.XML file with the three filter sets in it and being able to assign a filter set to a group or set of groups and just run things 'straight'), but at least it appears to work, and is CONSIDERABLY easier than going through and using the "Find" area to 'collect' things for download!

So: the answer to my original question appears to be "yes, it is possible, but no, you can't do it with just one 'FILTERS.XML' file if you have a lot of filtering to do, and you have to shut down Newsbin, and manually switch the 'FILTERS.XML' files out when you switch group collections".

Thanks SO INCREDIBLY much for the info about Newsbin having trouble with large filter lists! This makes life SO much easier (now I can just sit back and let the wanted files - and pretty much ONLY the wanted files! - stream in! :D
Re: Ta-daaaa! [I'm hoping! ;) ]

Postby itimpi » Sun Oct 19, 2003 6:32 am

Kaigi wrote:First: NO group - when right clicked and then selecting "Filter Profile" - EVER showed a group attached to it, even when I just did it 2-3 seconds before (using the right click, selecting "Filter Profile", then choosing the profile, sometimes REPEATEDLY doing this sequence in a row). If Newsbin is supposed to show what profile is attached at that point, it wasn't/isn't doing it.

Where were you looking to see what profile was active? It is in the Groups/Servers tab where there is a column labelled "Filter Profile" (you may have to scroll to the right to make it visible). This is set to DEFAULT when you first add a group. It is NOT shown in the Filter tab bar - that is where you apply new filters that apply to the view only and are temporary.
Re: Ta-daaaa! [I'm hoping! ;) ]

Postby Smite » Sun Oct 19, 2003 6:18 pm

Kaigi wrote:First: NO group - when right clicked and then selecting "Filter Profile" - EVER showed a group attached to it, even when I just did it 2-3 seconds before (using the right click, selecting "Filter Profile", then choosing the profile, sometimes REPEATEDLY doing this sequence in a row).

That's correct. The right-click -> Filter Profile command is for choosing a filter, it does not show you what's currently chosen. Like I said above, you have to check what's in the Filter Profile column to do that.

Kaigi wrote:If Newsbin is supposed to show what profile is attached at that point, it wasn't/isn't doing it. I was checking which filter profile was active by selecting the profile, starting the single group I'd just 'profiled' downloading, going into "Filters" and looking at which were listed there. It remained the same (and 'wrong') regardless of what Filter Profile I'd just selected for the group.

This will not tell you which filter profile is selected. See above.

Kaigi wrote:Second: I knew that the filter was not doing its job because things that CLEARLY were not on ANY of my 'Accept' lists were right there in front of me even when I would manually use the pull-down menu "Utilities/Apply Filters" e.g. "Try M$ patch" has NO words in it AT ALL that are anywhere in ANY of my filter sets! At that point I knew, instantly, that it wasn't working.

If you select the filter you want from the dropdown on the filter bar, does it then filter properly? And again, do you have "Ignore Filters" checked on the filter bar? I do not care about any of the contents of the .nbi file, since they do not affect this.

Kaigi wrote:Third: (and here's where your information about Newsbin having problems with long filter lists REALLY helped [THANKS!!], and your suggestion about limiting filters to 10 helped me find a 'workable' way of doing things {though not as small as 10}). I had three filter sets, including my old "Default". With all three in the 'standard' file (as Newsbin automatically puts them there), as I said, my prior "FILTERS.XML" file was 996 lines long total. I tried making THREE DIFFERENT "FILTERS.XML" files, and leaving only ONE filter group in each. Now each of the three 'subfilters' are roughly 1/2-1/3 that size! I just cranked up one [the largest, actually, at 530 lines total file length] and found that it had - indeed - filtered appropriately (from what things I checked - which was a LOT before I opted to possibly 'close' this discussion). Of course, I labelled duplicate copies of each file (such as "celebFILTERS.XML"), so that I could simply copy them, then strip the name down to "FILTERS.XML" when I want to change them. Not as efficient as if I could really get that to work otherwise (such as having one FILTERS.XML file with the three filter sets in it and being able to assign a filter set to a group or set of groups and just run things 'straight'), but at least it appears to work, and is CONSIDERABLY easier than going through and using the "Find" area to 'collect' things for download!

I had intended you to split your one long filter profile into multiple profiles. I don't beleive splitting the different filters into different filters.xml files truely makes a difference, since most people don't have to play with the files in order to make their filters work.

As an asside, unless you know exactly what and why you're doing something, it's usually counter productive to mess with the files in the newsbin directory, as opposed to using the Newsbin program itself to set the options. Hand editing of both the .nbi files, and the filters.xml file should be completely unneccesary.

Kaigi wrote:So: the answer to my original question appears to be "yes, it is possible, but no, you can't do it with just one 'FILTERS.XML' file if you have a lot of filtering to do, and you have to shut down Newsbin, and manually switch the 'FILTERS.XML' files out when you switch group collections".

I would be very surprized if this were the actual case.
That said, the fact that you can get it to work with the exact same filter in one filters.xml file, as opposed to a different one, does make me wonder...

Since you hand edit these files, it's possible one of the other filters is in some sort of illegal format, and that's what was causing your filters to fail.
Postby Quade » Sun Oct 19, 2003 6:26 pm

I'll be honest, once more I'm not reading this because it's too long. Glad you're on the job Smite.
I think you're right...

Postby Kaigi » Sun Oct 19, 2003 7:49 pm

Thanks Smite (and Itimpi),

I appreciate the time involved in responding so detailed to the complex (and LENGTHY) stuff I've been writing. :) [I suppose best for me to 'overexplain' than not explain enough... I've tried to help people with computer problems who left out SO much information it was absolutely impossible to tell what they were talking about!]

I not only misinterpreted something originally with the program, I misunderstood something you'd written: I wasn't checking the "Filter Profile" column: I've had it hidden for SO long (since I'd never ventured into different filters before) that I'd COMPLETELY forgotten it even existed (so didn't think to look for it, even when you specifically said, "Filter column"... it wasn't until Itimpi's statement "you may have to scroll to see it" that I found it). [My personal icon here makes more sense, eh? I've had three head injuries and sometimes miss the obvious!] When I looked there I found that the filters WERE linked to the groups I'd assigned them to.

You asked, "If you select the filter you want from the dropdown on the filter bar, does it then filter properly? And again, do you have "Ignore Filters" checked on the filter bar?" The answers were both "no" to these. (I did/have used "Ignore Filters" AFTER the fact {after all headers were downloaded} to see if the filters did their jobs... that's how I found they were working later, because the things I didn't want were, indeed, marked 'rejected by filtering'.)

Actually, I had three filter profiles (those that I split into the three seperate files). Each of those were as small as they could get to still do what I needed/wanted with the groups I was applying them to.

I was QUITE comfortable back in the DOS days, so know my way around most files that can be edited in a text editor (and know a little about HTML {as I'd built a bit of a web page before the third head injury made pursuing that further impossible as it would have then involved sales and such which I no longer had the capacity for}). HTML is similar to the basic organization in the "FILTERS.XML" file. I've also hand-edited the 'default set-up' files with computer games to change frame refresh-rate, gamma, all the movement criteria key-links, etc, so have some experience with files that open with a text editor [I, obviously, never try working with one that *doesn't* open with a text editor, and ALWAYS make a back-up of the original file in case I ever screw something up really bad... I can just pop the old one back in place if I need to give up on what I was trying to do.] I was also working with the earlier versions of Newsbin, where certain files virtually *invited* tinkering to tweak them (I think some actually REQUIRED manual editing for some things {this was in late-version 1 or during version 2 of Newsbin}).

Since I already HAD a complete list of who I was going to look for in the Celebs groups, it was MUCH easier to hand-edit the XML file to include them than to enter each individually in Newsbin's "Accept Subject Filter : Add" area. (To hand-edit, it was a simple cut-and-paste of names and name fragments that I'd edited easily in a word processor from a 'directory list' made from the Celebs directory on my hard drive... much easier, quicker, and less likely to get typos than if I added each of the hundreds of entries individually to the file through the 'normal' interface).

Then - since you'd mentioned there'd been occasional problems with longer files - I tried splitting the file since the 'big one' didn't seem to be working right.

I DO think that something that I didn't think of may have been the culprit in the filter(s) not working (that Quade's original comment brought to mind, and that I think I mentioned in my second response to him): that I had *spaces* and *underscores* in my filter names. It took his statement to 'remind' me that these characters do not work in this filter file format. As you suggest here, it may very well be that THEY were the reason that the filters did not seem to be working at all, as they 'invalidated' either the filter profile they were in, OR they invalidated the entire FILTERS.XML file! [It appeared that every post showed up... I just realized that if I'd thought to check to see if the file-size limits still worked at that point, that actually would've meant that ONLY the "name" portion of the filter profile was 'broken', not the whole thing... ain't hindsight wonderful? :/]

I did try 'hand-selecting' the filter profile, then 'hand-applying' it (with the pull-down menu) and that did not change anything with the way it did (or rather, did not) act.

Since you mention that other users don't have this problem of 'too many filters so they don't work', I am 'rebuilding' the "FILTERS.XML" file (re-adding all three different profiles to the one XML file, since they have now all been tested - and worked - individually, even the one with 530 entries in the <SUBJECTACCEPT> area!). I will test that new corrected/combined one to see if it does - indeed - work now as a whole.... This would be pretty conclusive that it was not working before NOT because it was 'too big', but because I had the spaces and underscores in it which invalidated the whole file, so Newsbin (intelligently!) ignored it/them.

I will have to wait a few days to get enough 'new' posts on the different groups to check this new combined version of the XML file to see if the XML file size is - or is not - a culprit here [or rather that people like me who want it to do such extreme filtering are the culprit <sheepish wink> ].

Thanks so much for your help, again. As Quade pointed out just above: this IS a massive amount to read and take in, and I REALLY appreciate that you have been, and are, taking the time to help me out with this! :D
Postby Smite » Sun Oct 19, 2003 8:28 pm

I don't beleive spaces or underscores are invalid characters, since they should be treated the same as any other characters in an xml file. I would think only < and > could cause problems. But then, I don't know the specifics of how the xml file is parsed, so I could be missing something.

If you find that the total filter.xml filesize is indeed the issue, then a short post that Quade will actually read would probably be helpful. :)

Glad to see you've at least found something that works for you though.
Postby Kaigi » Sun Oct 19, 2003 11:34 pm

So far, the new - fully self-contained - FILTERS.XML file (with all three of the filter sets in it) seems to be working PERFECTLY! This, even though the FILTERS.XML file is now over 1000 lines long and actually has two additional {tiny} filter sets for two other group-sets in it (added since it was working 'as is' ...er, 'was' :wink: ).

I thought I remembered that there was - I believe in a private e-mail some year or two ago (bizarre which little things I can remember, when I often forget to EAT because of this third head injury!) ...I thought Quade told me that a space doesn't work in this type of filter (but I still have some spaces in some file filters from the Default filter set which has been working for years... go figure!), and that a "." takes the place of any single character. There are also other characters that don't work: as you pointed out "<" and ">" might cause problems... I believe so do things like "!" In addition, "[" and "]" and such are treated as 'filter delineators' rather than as characters... don't remember exactly which characters acted in this way. (I found out some of this from him when I wanted to eliminate any file where the filename started with an exclamation point: I'd never 'met' one that WASN'T spam! ...never could get that to work... even entering [!]*[.]jpg didn't work at that point {it ignored the "[!]" and simply filtered as if it was *[.]jpg (i.e. everything was filtered out!) who knows... it may work now}.

I really have no idea what was the problem with my initial file... maybe in my listing, I ended up with two identical filter lines and that freaked it out? (I did do line-by-line checking when I re-edited the file(s) and found a couple instances where there were two of the same entry which I then deleted one of.) That would be something that the program automatically prevents with the standard interface... (though if I'd used the standard interface, of course, I would have had to manually type in nearly 1000 lines to get the filter I now have, and have NO idea if I had typos in it!) From what I know about some programming: that one 'duplicated' line may very well have shut down everything.

At any rate... thanks for sticking with me through this... it is still so incredibly hard for me to believe what this will do to my ease of using the program! The term that came to mind when I realized how MUCH effort this filtering would save me was "elated" (particularly the all-in-one filter file, so I can just point the program at a huge list of newsgroups and say, "Fetch!" :) ) .

Thanks again!
Postby Quade » Sun Oct 19, 2003 11:39 pm

No doubt, I see a page of text on something like this and my eyes shy away. My guess is it could have been compressed down to a paragraph or two.

I imagine there are no real limits to the XML file size as long as you keep the syntax basic. The problem is the state machine for the filters. I think it's size is restricted. Something I need to look at.
Postby DThor » Mon Oct 20, 2003 6:15 pm

As someone who's famous for overwriting ;) , Kaigi, trust me: I've found more people read the less you write. It's all about condensing down to the bare, necessary info. I still have trouble doing it, but I try, because it's worth it. :)


