Home » SQL & PL/SQL » SQL & PL/SQL » Combining multiple regexp_like statements & correct regex syntax
Combining multiple regexp_like statements & correct regex syntax [message #679089] |
Mon, 03 February 2020 10:29 |
|
mgarret40
Messages: 1 Registered: February 2020
|
Junior Member |
|
|
Hi all,
I’m working on creating various 0/1 product flags category variables based on multiple substrings within a string field. The string values contained in the category_description field don’t stick to a consistent classification pattern. Notice for each flag variable I’m using two regexp_like statements. The two things I’m trying to accomplish are.
1. For each category flag variable combine the two regexp_like statements into only one regexp_like statement
2. I’m trying to correct some of the issues I’m running to with the regex syntax.
**Swimsuit classifying categories which actually dress suits and not swimsuits or flagging jumpsuits as swimsuits
**categories is being flagged as swimwear when it represents just shorts. Example: Men > Clothing > Shorts & Swimwear > Shorts
Situations where there are no spaces between substrings. Example: where the description Layette/Infant Unisex - Accessories – Sale is not being flagged as baby when it contains the word infant
Below is my current code, sample data, and I've attached an image on the final output I'm looking for.
All help will be greatly appreciated.
Sample data:
CREATE TABLE category_flags
(
category_description VARCHAR(80)
);
INSERT INTO category_flags
(category_description)
VALUES
("Baby > Basics > Accessories > Hats,Scarves, & Gloves")
("Men > Shoes & Accessories > Hats")
("Men > Shoes & Accessories > Ties")
("Women > Clothing > Jackets")
("Women > Shoes & Accessories > Hats")
("Kids > Boys > Jackets")
("Tall/Men/Outerwear/Cloth/Sale")
("Men > Bottoms > Swim")
("Men > Bottoms > Swimwear")
("US/Big_Tall/Men/Chinos/Flat_Front")
("men>clothing>sweatshirts")
("men>clothing>swimwear")
("Children/Girls/Outfit")
("Men > Accessories > Fragrance")
("Black_Top/Men/Dress_Shirts/Slim_Fit/Sale")
("Men > Bottoms > Suits, Sport Coats & Trousers")
("Women > Tops > Sweaters")
("Kids > Accessories > Boy's Accessories > Bags")
("boys>accessories>socks")
("Blue_Look/Women/Pants/Sale")
("women > Clothing/Knits/> Jackets")
("Women > Clothing > Jackets")
("Women > Shoes & Accessories > Hats")
("Women > Clothing > Jackets")
("Dresswear/Women > Belts")
("clothing > men >shorts & swim trunks > swimtrunks")
("Men > Clothing > Shorts & Swimwear > Shorts")
("men>clothing>swimsuits")
("boys>accessories>socks")
("Layette/Infant Unisex - Accessories")
("Layette/Infant Unisex - Accessories - Sale")
("Basics > Baby Boy & Girl> Accessories > Hats,Scarves, & Gloves")
("Women/Clothing & Shoes/Swimwear")
("Evening/Men > Shoes & Accessories > Ties")
("Summer/Child/Jumpers")
("women/fall/clothing & accessories/jumpsuits")
("Women/Clothing & Swimwear/Skirts")
);
Current script:
select
category_description,
case when regexp_like(category_description, '(.[^o]|[^w]o)men|^men','i')
and not regexp_like(category_description, 'accessories|hats|bags|belts|shoes|socks|ties' ,'i') then 1 else 0 end as Mens_Apparel,
case when regexp_like(category_description, '(.[^o]|[^w]o)men|^men','i')
and regexp_like(category_description, 'accessories|hats|bags|belts|shoes|socks|ties' ,'i') then 1 else 0 end as Mens_Accessories,
case when regexp_like (category_description, 'women','i')
and not regexp_like(category_description, 'accessories|hats|bags|belts|shoes|socks|ties' ,'i') then 1 else 0 end as Womens_Apprarel,
case when regexp_like (category_description, 'women','i')
and regexp_like(category_description, 'accessories|hats|bags|belts|shoes|socks|ties' ,'i') then 1 else 0 end as Womens_Accessories ,
case when regexp_like (category_description, 'boys?|girls?|junior|child(ren)?|kids?','i')
and not regexp_like (category_description,'infant|babys?|hidden|womens?|mens?|gift cards?|test','i') then 1 else 0 end as Kids_Products ,
case when regexp_like(category_description,'toddler|infant|babys?|','i')
and not regexp_like (category_description,'hidden|womens|mens|^gift cards','i') then 1 else 0 end as Baby_Products,
case when regexp_like(category_description, '(.[^o]|[^w]o)men|^men','i')
and regexp_like(category_description,'swim(wear)?|swim(suit)?','i')then 1 else 0 end as Mens_Swimwear ,
case when regexp_like(category_description,'womens','i')
and regexp_like(category_description,'(swim(wear|suits))','i') then 1 else 0 end as Womens_Swimwear
from category_flags
-
Attachment: output.png
(Size: 98.76KB, Downloaded 1186 times)
|
|
|
|
|
Re: Combining multiple regexp_like statements & correct regex syntax [message #679097 is a reply to message #679089] |
Tue, 04 February 2020 11:48 |
flyboy
Messages: 1903 Registered: November 2006
|
Senior Member |
|
|
Hi,
I do not know what do you expect as the whole assignment is not specifically related to Oracle.
Regular expression do not seem to be capable enough to treat these quite complex rules (can you precisely state them?) based on keyword positions among other terms, like conditional classification of the keyword "swimwear".
On the other hand, other cases (keywords "jumpsuit" and "infants") seem to be classified as per requirement in your output.
Anyway, adding another categories would be quite a pain.
Generally, this looks like a job for https://en.wikipedia.org/wiki/Lexical_analysis with some flavour of https://en.wikipedia.org/wiki/Natural_language_processing algorithms.
This approach may be quite an overkill for your task, it depends on overall requirement. Stating grammar covering all possible cases (and many others emerging later) may be difficult as well.
Just a note to your current regular expression masks: they would classify e.g. the value "garments" to "men".
This can be fixed by checking non-alpha character right before and after the word "men" in this mask.
There would be no need for checking the occurrance of "wo" in the string beginning to distinguish it from "women".
However, just mask change will not help in treating multiple occurrances of checked keywords in specific positions with different meanings.
|
|
|
|
Goto Forum:
Current Time: Sat Jun 15 16:02:20 CDT 2024
|