๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Project ESG+AI/[์‚ผ์ •KPMG]ESG ๋ฐ์ดํ„ฐ ํ™œ์šฉ ํ’€์Šคํ… ๊ฐœ๋ฐœ

44์ผ์ฐจ.

by GreenJin_S2 2025. 12. 15.

 

conda activate torch313

 

๋กœ ์ž…๋ ฅํ•˜๊ณ  ์ปค์„œ์ผœ๊ธฐ

 

 

 

 

 

 

 

 

https://parksrazor.tistory.com/93

 

ํŒŒ์ด์ฌ/์ž์—ฐ์–ด/2020-05-09/ ์‚ผ์„ฑ 2018 ๋ณด๊ณ ์„œ ๋ถ„์„ํ•˜๊ธฐ

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 8

parksrazor.tistory.com

 

 

๋ฐ์ดํ„ฐ ํŒŒ์ผ ์••์ถ• ํ’€๊ณ  ์„ธ๊ฐœํŒŒ์ผ ๋„ฃ๊ธฐ

 

 

 

์— ๋งˆ ๋ถ€๋ถ„์—์„œ def __init__ ์ด๋ถ€๋ถ„ ๊ธ์–ด ์˜ค๊ธฐ

 

 

 

    def extract_noun(self):
        # ์‚ผ์„ฑ์ „์ž์˜ ์Šค๋งˆํŠธํฐ์€ -> ์‚ผ์„ฑ์ „์ž ์Šค๋งˆํŠธํฐ
        noun_tokens = []
        tokens = self.change_token(self.extract_hangeul(http://self.read_file()))
        for i in tokens:
            pos = self.okt.pos(i)
            temp = [j[0] for j in pos if j[1] == 'Noun']
            if len(''.join(temp)) > 1 :
                noun_tokens.append(''.join(temp))
        texts = ' '.join(noun_tokens)
        ic(texts[:100])
        return texts
    def read_stopword(self):
        self.okt.pos("์‚ผ์„ฑ์ „์ž ๊ธ€๋กœ๋ฒŒ์„ผํ„ฐ ์ „์ž์‚ฌ์—…๋ถ€", stem=True)
        fname = './data/stopwords.txt'
        with open(fname, 'r', encoding='utf-8') as f:
            stopwords = http://f.read()
        return stopwords

    def remove_stopword(self):
        texts = self.extract_noun()
        tokens = self.change_token(texts)
        # print('------- 1 ๋ช…์‚ฌ -------')
        # print(texts[:30])
        stopwords = http://self.read_stopword()
        # print('------- 2 ์Šคํ†ฑ -------')
        # print(stopwords[:30])
        # print('------- 3 ํ•„ํ„ฐ -------')
        texts = [text for text in tokens
                 if text not in stopwords]
        # print(texts[:30])
        return texts
    def find_freq(self):
        texts = self.remove_stopword()
        freqtxt = pd.Series(dict(FreqDist(texts))).sort_values(ascending=False)
        ic(freqtxt[:30])
        return freqtxt
    def draw_wordcloud(self):
        texts = self.remove_stopword()
        wcloud = WordCloud('./data/D2Coding.ttf', relative_scaling=0.2,
                           background_color='white').generate(" ".join(texts))
        plt.figure(figsize=(12, 12))
        plt.imshow(wcloud, interpolation='bilinear')
        plt.axis('off')
        http://plt.show()
 
 
 
 

๊ฐ•์‚ฌ๋‹˜๊ฑฐ ๋žฉ์žฅ์—์„œ ๋ถ™์—ฌ๋„ฃ์Œ

 

 

๋‚ด๊ฑฐ ์ธํŠธ๊ฐ€ ์ข€ ๋‹ฌ๋ผ์„œ ๋ฏผ์†”์”จ๊ฑฐ ๋ถ™์—ฌ๋„ฃ์Œ

class SamsungWordCloud:
    """
    Generate a word cloud from the Gutenberg "Emma" corpus.

    The class downloads required NLTK resources on first use and
    produces a word cloud image file based on proper nouns frequency.
    """

    def __init__(self, quiet: bool = True):
        """
        ์ดˆ๊ธฐํ™” ๋ฉ”์„œ๋“œ
       
        Args:
            quiet: NLTK ๋‹ค์šด๋กœ๋“œ ์‹œ ์ถœ๋ ฅ ์—ฌ๋ถ€ (๊ธฐ๋ณธ๊ฐ’: True)
        """
        # NLTK ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ (word_tokenize ์‚ฌ์šฉ์„ ์œ„ํ•ด ํ•„์š”)
        try:
            nltk.download('punkt', quiet=quiet)
            nltk.download('punkt_tab', quiet=quiet)  # ์ตœ์‹  NLTK ๋ฒ„์ „์—์„œ ํ•„์š”
            nltk.download('stopwords', quiet=quiet)
        except Exception as e:
            # ๋‹ค์šด๋กœ๋“œ ์‹คํŒจ ์‹œ ๊ฒฝ๊ณ ๋งŒ ์ถœ๋ ฅํ•˜๊ณ  ๊ณ„์† ์ง„ํ–‰
            import warnings
            warnings.warn(f"NLTK ๋ฆฌ์†Œ์Šค ๋‹ค์šด๋กœ๋“œ ์ค‘ ์˜ค๋ฅ˜ ๋ฐœ์ƒ: {e}")
       
        self.okt = Okt()

 

์ž„ํฌํŠธ ํ•ด์ค˜์•ผํ•จ ํƒญ์œผ๋กœ

 

 

 


@ai.minsol.kr/mlservice/app/nlp/data/kr-Report_2018.txt ์—ฌ๊ธฐ์—์„œ ํ•œ๊ตญ์–ด BoW ๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ์‚ฌ์šฉํ•  ๊ตญ์–ด์‚ฌ์ „ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋‹ค์šด ๋ฐ›๊ณ  ์‹ถ์€๋ฐ, ์–ด๋А ๊ฒƒ์„ ์ถ”์ฒœํ•˜๊ณ , ํ•œ ๋ฒˆ ์„ค์ •ํ•ด์„œ ์ง€์†์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋žต์„ ์•Œ๋ ค์ค˜


์ง€๊ธˆ์€ Okt ์‚ฌ์šฉํ•ด์„œ ํ•„์š” ์—†์Œ ๋‚˜์ค‘์— ํšŒ์‚ฌ๊ฐ€์„œ ์ฐธ๊ณ 


 

 

๊ธ€์ž๊นจ์ง€๋ฉด

d2์ฝ”๋”ฉ ์ฐธ๊ณ ํ•ด์„œ ํ•ด์ค˜

 

 

 

@ai.seoeunjin.com/mlservice/app/nlp/samsung/samsung_wordcloud.py @ai.seoeunjin.com/mlservice/app/nlp/save ์—ฌ๊ธฐ์—์„œ save ํด๋”์— ์ €์žฅํ•  ๊ฒฝ๋กœ๋ฅผ ์ถ”๊ฐ€๋กœ ์ฝ”๋”ฉํ•ด์ค˜

 

 

@ai.minsol.kr/mlservice/app/nlp/nlp_router.py ์—ฌ๊ธฐ์— @ai.minsol.kr/mlservice/app/nlp/samsung/samsung_wordcloud.py ์ด ๋ฉ”์†Œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ localhost:8080/api/mlservice/nlp/samsung์—์„œ ์ž‘๋™ํ•˜๋„๋ก ์ฝ”๋”ฉํ•ด์ฃผ๊ณ , ์›Œ๋“œํด๋ผ์šฐ๋“œ๋กœ ์ƒ์„ฑ๋œ ํŒŒ์ผ์„ @ai.minsol.kr/mlservice/app/nlp/save ์—ฌ๊ธฐ์— ์ €์žฅ๋˜๋„๋ก ํ•ด์ค˜

 

@ai.seoeunjin.com/mlservice/app/nlp/nlp_router.py ์—ฌ๊ธฐ์— @ai.seoeunjin.com/mlservice/app/nlp/samsung/samsung_wordcloud.py ์ด ๋ฉ”์†Œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ localhost:8080/api/ml/samsung์—์„œ ์ž‘๋™ํ•˜๋„๋ก ์ฝ”๋”ฉํ•ด์ฃผ๊ณ , ์›Œ๋“œํด๋ผ์šฐ๋“œ๋กœ ์ƒ์„ฑ๋œ ํŒŒ์ผ์„ @ai.seoeunjin.com/mlservice/app/nlp/save ์—ฌ๊ธฐ์— ์ €์žฅ๋˜๋„๋ก ํ•ด์ค˜

 

 

์‚ฌ์šฉ ๋ฐฉ๋ฒ•:

  • GET localhost:8080/api/mlservice/nlp/samsung - ๊ธฐ๋ณธ ๊ฒฝ๋กœ(app/nlp/save/samsung_wordcloud.png)์— ์ €์žฅ
  • GET localhost:8080/api/mlservice/nlp/samsung?save=๊ฒฝ๋กœ/ํŒŒ์ผ๋ช….png - ์ง€์ •ํ•œ ๊ฒฝ๋กœ์— ์ €์žฅ

์ค‘์š”: ์„œ๋ฒ„๋ฅผ ์žฌ์‹œ์ž‘ํ•ด์•ผ ๋ณ€๊ฒฝ์‚ฌํ•ญ์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์žฌ์‹œ์ž‘ ํ›„ Postman์—์„œ localhost:8080/api/mlservice/nlp/samsung์„ ํ˜ธ์ถœํ•˜๋ฉด ํŒŒ์ผ์ด app/nlp/save/samsung_wordcloud.png์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

 

 

1) mlservice๋งŒ ์žฌ๋นŒ๋“œ/์žฌ์‹œ์ž‘

 

conda activate torch313

torch313 ๋“ค์–ด๊ฐ€์„œ ๋„์ปค์ปดํฌ์ฆˆ์—…!! 


ํด๋”์™€ ํŒŒ์ผ ์ƒ์„ฑํ•˜๊ณ  ์ฝ”ํผ์Šค์— ๋ฐ์ดํ„ฐ 50๊ฐœ ์ •๋„ ๋‹ค์šด๋ฐ›๊ธฐ

 

https://github.com/e9t/nsmc/

 

GitHub - e9t/nsmc: Naver sentiment movie corpus

Naver sentiment movie corpus. Contribute to e9t/nsmc development by creating an account on GitHub.

github.com

์—ฌ๊ธฐ์„œ rawํด๋”์— jsonํŒŒ์ผ 50๊ฐœ ์ •๋„ ๋‹ค์šด ๋ฐ›๊ธฐ

 

 


https://huggingface.co/monologg/koelectra-small-v3-discriminator/tree/main

 

 

ํŒŒ์ผ 4๊ฐœ ๋‹ค์šด๋ฐ›๊ธฐ .giattributes ๋Š” ํ•„์š”์—†์Œ