filter for [email protected]

Tips on writing regular expressions for searching the post list

Moderators: Quade, dexter

filter for [email protected]

Postby sly001 » Tue Feb 14, 2017 11:41 am

Can anyone write a regex to match any email address where the three parts repeat, like [email protected] or [email protected] or [email protected]. Each part can consist of upper/lower letters and numerals and be of any length. The only constants are the @ and the dot.
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby tl » Tue Feb 14, 2017 4:35 pm

sly001 wrote:Can anyone write a regex to match any email address where the three parts repeat, like [email protected] or [email protected] or [email protected]. Each part can consist of upper/lower letters and numerals and be of any length. The only constants are the @ and the dot.

Newsbin uses PCRE regexps with numbered capture groups disabled.

Based on a online regexp tester I think this should do what you want:
(?<first>[A-Za-z0-9]+)@\k<first>\.\k<first>
User avatar
tl
Seasoned User
Seasoned User
 
Posts: 114
Joined: Tue Jul 15, 2003 1:55 pm

Registered Newsbin User since: 04/01/03

Re: filter for [email protected]

Postby sly001 » Tue Feb 14, 2017 5:25 pm

Awesome. That seems to work. Thank you.
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby sly001 » Tue Feb 14, 2017 5:50 pm

Maybe I spoke too soon. It seemed to work, but not as a filter to reject the recent influx of spam postings. Here's what I did:

1 - Created new filter set to REJECT if POSTER contains (?<first>[A-Za-z0-9]+)@\k<first>\.\k<first>
2 - Add that as a Header Filter to the Unsorted group


But posts with the poster like 0a86be9a6 <[email protected]> still got through, even though this should have been caught by this filter. I can create a filter that says to ACCEPT and I can see the posts are filtered to show only these posters. So it works to accept them, but not as a reject pre-database write. Any ideas?
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby dexter » Tue Feb 14, 2017 6:02 pm

The only way to effectively handle this pattern is with backreferences. Unfortunately they seem to be disabled in Newsbin. Quade has an item on his list to look into this. If we can get backreferences enabled then this is the RE you would use:

([0-9a-z]+)[ ]\<\1\@\1\.\1\>

There is a space in the square brackets to catch the space before the email portion.

Other than that, if the repeating portions are all the same length, 9 characters in your example, you could do this:

[0-9a-z]{9}[ ]\<[0-9a-z]{9}\@[0-9a-z]{9}\.[0-9a-z]{9}\>

You don't need to include A-Z because RE's in Newsbin are case insensitive.
User avatar
dexter
Site Admin
Site Admin
 
Posts: 9514
Joined: Fri May 18, 2001 3:50 pm
Location: Northern Virginia, US

Registered Newsbin User since: 10/24/97

Re: filter for [email protected]

Postby sly001 » Tue Feb 14, 2017 6:08 pm

They are not always 9 characters. I've seen them vary from 6 to 12 - but really they can be any length. Guess I just need to wait for a NB update where backreferences are enabled.
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby dexter » Tue Feb 14, 2017 8:37 pm

If they are 6-12, you could do:


[0-9a-z]{6,12}[ ]\<[0-9a-z]{6,12}\@[0-9a-z]{6,12}\.[0-9a-z]{6,12}\>
User avatar
dexter
Site Admin
Site Admin
 
Posts: 9514
Joined: Fri May 18, 2001 3:50 pm
Location: Northern Virginia, US

Registered Newsbin User since: 10/24/97

Re: filter for [email protected]

Postby sly001 » Tue Feb 14, 2017 9:38 pm

Thank you. In trying to future-proof this:
- What is the range I can use? Can I do from 1-99 characters with {1,99} instead of {6,12}?
- Is this currently case-insensitive, or would I need to change it to [0-9a-zA-Z]?
- Is is possible to include special characters in addition to letters/numerals?
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby Quade » Wed Feb 15, 2017 1:40 pm

1 - yes - To me {1,99} is kinda pointless. If you mean "all" then ".*" is probably better. The power of the curly braces is being able to set minimum and maximum ranges. The spam has at least N characters of number/letters with no space so you're better off setting a minimum threshold for length. Some size smaller than the maximum but as large as possible so you don't catch too much. Filtering headers, if you filter too much you'll just lose records and won't really even know you lost them.

2 - Not case sensitive.

3 - yes but keep in mind that come characters have to be escaped.

[] needs to be escaped as \[\]

for example.

Here is the list of characters that need to be escaped to use them as normal literals:

Opening square bracket [
Backslash \
Caret ^
Dollar sign $
Period or dot .
Vertical bar or pipe symbol |
Question mark ?
Asterisk or star *
Plus sign +
Opening round bracket ( and the closing round bracket )
These special characters are often called "metacharacters".


Found:
http://stackoverflow.com/questions/1296 ... -net-regex


Keep in mind with regex's you only have to match on some minimum. You don't need to match the whole string.

If you reject a 40-90 character run of numbers and letters with no spaces, that's enough to block this spam.
User avatar
Quade
Eternal n00b
Eternal n00b
 
Posts: 45003
Joined: Sat May 19, 2001 12:41 am
Location: Virginia, US

Registered Newsbin User since: 10/24/97

Re: filter for [email protected]

Postby sly001 » Wed Feb 15, 2017 4:05 pm

So maybe I'm misunderstanding. This regex:
[0-9a-z]{6,12}[ ]\<[0-9a-z]{6,12}\@[0-9a-z]{6,12}\.[0-9a-z]{6,12}\>

catches this:
spam123 <[email protected]>

But will it also catch this?
notspam <[email protected]>

If so, then it's not what I want. I need to match repeated phrases - like where spam123 is used in each 'block'. Which sounds like backreferences - which are currently unsupported. So - what do you suggest is the best way to filter out this spam where the poster email address is continually changing, yet follows the pattern of Links not allowed for unregistered users?
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby dexter » Wed Feb 15, 2017 4:19 pm

Yeah, it will match "notspam <[email protected]>". That's why I said the best solution would be if Newsbin supported back-references.

Until that happens, the only other way would be to find some pattern in the subject that is common to all these posts.
User avatar
dexter
Site Admin
Site Admin
 
Posts: 9514
Joined: Fri May 18, 2001 3:50 pm
Location: Northern Virginia, US

Registered Newsbin User since: 10/24/97

Re: filter for [email protected]

Postby sly001 » Wed Feb 15, 2017 4:32 pm

damn
sly001
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Sun Apr 03, 2005 1:28 pm

Re: filter for [email protected]

Postby Quade » Wed Feb 15, 2017 5:41 pm

catches this:
spam123 <[email protected]>


You know filtering out email addresses is failure prone. You're better off just filtering in either the posters you like or the subjects you like ("\[FULL\]" for example)

Many spammers are using random posting fields so it's impossible to match them all. Better to filter IN what you like than trying to filter OUT what you don't like.
User avatar
Quade
Eternal n00b
Eternal n00b
 
Posts: 45003
Joined: Sat May 19, 2001 12:41 am
Location: Virginia, US

Registered Newsbin User since: 10/24/97


Return to Regular Expressions

Who is online

Users browsing this forum: No registered users and 17 guests