# Athena for Spark에서 Python 라이브러리 사용
<a name="notebooks-spark-python-library-support"></a>

**참고**  
이 페이지에서는 릴리스 버전 Pyspark 엔진 버전 3에서 Python 라이브러리를 사용하는 방법을 참조합니다. 릴리스 버전 Apache Spark 버전 3.5는 [Amazon EMR 7.12](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7120-release.html)에 기반합니다. 이 버전에 포함된 라이브러리는 EMR 7.12를 참조하세요.

이 페이지에서는 Apache Spark용 Amazon Athena에서 사용되는 런타임, 라이브러리 및 패키지에 사용되는 용어와 그에 따른 수명 주기 관리에 대해 설명합니다.

## 정의
<a name="notebooks-spark-python-library-support-definitions"></a>
+ **Amazon Athena for Apache Spark**(Apache Spark용 Amazon Athena)는 오픈 소스 Apache Spark의 사용자 지정 버전입니다. 현재 버전을 보려면 노트북 셀에서 `print(f'{spark.version}')` 명령을 실행합니다.
+ **Athena runtime**(Athena 런타임)은 코드가 실행되는 환경입니다. 환경에는 Python 인터프리터와 PySpark 라이브러리가 포함됩니다.
+ **external library or package**(외부 라이브러리 또는 패키지)는 Athena 런타임에는 포함되지 않지만 Athena for Spark 작업에 포함될 수 있는 Java, Scala JAR 또는 Python 라이브러리입니다. 외부 패키지는 Amazon 또는 사용자가 만들 수 있습니다.
+ **convenience package**(편의 패키지)는 Athena에서 선택된 외부 패키지 모음으로, Spark 애플리케이션에 포함하도록 선택할 수 있습니다.
+ **bundle**(번들)은 Athena 런타임과 편의 패키지를 결합합니다.
+ **user library**(사용자 라이브러리)는 Athena for Spark 작업에 명시적으로 추가되는 외부 라이브러리 또는 패키지입니다.
  + 사용자 라이브러리는 편의 패키지에 포함되지 않는 외부 패키지입니다. 일부 `.py` 파일을 작성하고 압축한 다음 `.zip` 파일을 애플리케이션에 추가할 때처럼 사용자 라이브러리를 로드하고 설치해야 합니다.
+ **Athena for Spark application**(Athena for Spark 애플리케이션)은 Athena for Spark에 제출되는 작업 또는 쿼리입니다.

## 수명 주기 관리
<a name="notebooks-spark-python-library-support-lifecycle-management"></a>

다음 섹션에서는 Athena for Spark에서 사용되는 런타임 및 편의 패키지와 관련된 버전 관리 및 지원 중단 정책을 설명합니다.

### 런타임 버전 관리 및 지원 중단
<a name="notebooks-spark-python-library-support-runtime-versioning-and-deprecation"></a>

Athena 런타임의 주요 구성 요소는 Python 인터프리터입니다. Python은 진화하는 언어이므로 새 버전이 정기적으로 릴리스되고 이전 버전에 대한 지원이 제거됩니다. 더 이상 사용되지 않는 Python 인터프리터 버전으로 프로그램을 실행하는 대신 가능하면 최신 Athena 런타임을 사용하는 것이 좋습니다.

Athena 런타임 지원 중단 일정은 다음과 같습니다.

1. Athena는 새 런타임을 제공한 이후에 6개월 동안 이전 런타임을 계속 지원합니다. 이 기간 동안 이전 런타임에 대한 보안 패치 및 업데이트를 적용합니다.

1. 6개월 후에는 이전 런타임에 대한 지원을 종료합니다. 이 경우 이전 런타임에 대한 보안 패치 및 기타 업데이트를 더 이상 적용하지 않습니다. 이전 런타임을 사용하는 Spark 애플리케이션은 더 이상 기술 지원을 받을 수 없습니다.

1. 12개월 후에는 이전 런타임을 사용하는 작업 그룹에서 더 이상 Spark 애플리케이션을 업데이트하거나 편집할 수 없습니다. 이 기간이 끝나기 전에 Spark 애플리케이션을 업데이트하는 것이 좋습니다. 이 기간이 끝난 후에도 기존 노트북을 계속 실행할 수 있지만, 노트북에서 이전 런타임을 계속 사용할 경우 해당 효과에 대한 경고가 기록됩니다.

1. 18개월 후에는 이전 런타임을 사용하여 작업 그룹에서 더 이상 작업을 실행할 수 없습니다.

### 편의 패키지 버전 관리 및 지원 중단
<a name="notebooks-spark-python-library-support-convenience-package-versioning-and-deprecation"></a>

편의 패키지의 내용은 시간이 지남에 따라 변경됩니다. Athena 때때로 이러한 편의 패키지를 추가, 제거 또는 업그레이드합니다.

Athena는 편의 패키지에 대해 다음 지침을 사용합니다.
+ 편의 패키지에는 1, 2, 3과 같은 간단한 버전 관리 체계가 있습니다.
+ 각 편의 패키지 버전에는 특정 버전의 외부 패키지가 포함되어 있습니다. Athena에서 편의 패키지를 생성한 후에는 편의 패키지의 외부 패키지 세트와 해당 버전이 변경되지 않습니다.
+ Athena는 새 외부 패키지를 포함하거나, 외부 패키지를 제거하거나, 하나 이상의 외부 패키지 버전을 업그레이드할 때 새 편의 패키지 버전을 생성합니다.

Athena는 패키지에서 사용되는 Athena 런타임 지원을 중단할 때 편의 패키지 지원을 중단합니다. Athena는 지원하는 번들 수를 제한하기 위해 패키지 지원을 더 빨리 중단할 수 있습니다.

편의 패키지 지원 중단 일정은 Athena 런타임 지원 중단 일정을 따릅니다.

# 사전 설치된 Python 라이브러리 목록
<a name="notebooks-spark-preinstalled-python-libraries"></a>

사전 설치된 Python 라이브러리에는 다음이 포함됩니다.

```
boto3==1.24.31
botocore==1.27.31
certifi==2022.6.15
charset-normalizer==2.1.0
cycler==0.11.0
cython==0.29.30
docutils==0.19
fonttools==4.34.4
idna==3.3
jmespath==1.0.1
joblib==1.1.0
kiwisolver==1.4.4
matplotlib==3.5.2
mpmath==1.2.1
numpy==1.23.1
packaging==21.3
pandas==1.4.3
patsy==0.5.2
pillow==9.2.0
plotly==5.9.0
pmdarima==1.8.5
pyathena==2.9.6
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
requests==2.28.1
s3transfer==0.6.0
scikit-learn==1.1.1
scipy==1.8.1
seaborn==0.11.2
six==1.16.0
statsmodels==0.13.2
sympy==1.10.1
tenacity==8.0.1
threadpoolctl==3.1.0
urllib3==1.26.10
pyarrow==9.0.0
```

## 참고
<a name="notebooks-spark-preinstalled-python-libraries-notes"></a>
+ MLlib(Apache Spark 기계 학습 라이브러리) 및 `pyspark.ml` 패키지는 지원되지 않습니다.
+ 현재 `pip install`은 Athena for Spark 세션에서 지원되지 않습니다.

Python 라이브러리를 Amazon Athena for Apache Spark로 가져오는 방법에 대한 자세한 내용은 [Athena for Spark로 파일 및 Python 라이브러리 가져오기](notebooks-import-files-libraries.md) 섹션을 참조하세요.

# Athena for Spark로 파일 및 Python 라이브러리 가져오기
<a name="notebooks-import-files-libraries"></a>

이 문서에서는 파일 및 Python 라이브러리를 Apache Spark용 Amazon Athena로 가져오는 방법에 대한 예제를 제공합니다.

## 고려 사항 및 제한
<a name="notebooks-import-files-libraries-considerations-limitations"></a>
+ **Python 버전** – 현재 Athena for Spark는 Python 버전 3.9.16을 사용합니다. Python 패키지는 Python 마이너 버전에 민감합니다.
+ **Athena for Spark 아키텍처** - Athena for Spark는 ARM64 아키텍처 기반 Amazon Linux 2를 사용합니다. 일부 Python 라이브러리는 이 아키텍처에 대한 바이너리를 배포하지 않습니다.
+ **바이너리 공유 객체(SO)** – SparkContext [addPyFile](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addPyFile.html) 메서드는 바이너리 공유 객체를 탐지하지 않으므로 Athena for Spark에서 공유 객체에 중속된 Python 패키지를 추가할 때 사용할 수 없습니다.
+ **Resilient Distributed Dataset(RDD)** – [RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html)는 지원되지 않습니다.
+ **Dataframe.foreach** – PySpark [DataFrame.foreach](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.foreach.html) 메서드는 지원되지 않습니다.

## 예제
<a name="notebooks-import-files-libraries-examples"></a>

예제에서는 다음과 같은 규칙을 사용합니다.
+ 자리 표시자 Amazon S3 위치 `s3://amzn-s3-demo-bucket`. 사용자의 S3 버킷으로 대체합니다.
+ Unix 셸에서 실행되는 모든 코드 블록은 *directory\$1name* `$`으로 표시됩니다. 예를 들어, `ls` 디렉터리의 `/tmp` 명령과 해당 출력은 다음과 같이 표시됩니다.

  ```
  /tmp $ ls
  ```

  **출력**

  ```
  file1 file2
  ```

## 계산에 사용할 텍스트 파일 가져오기
<a name="notebooks-import-files-libraries-importing-text-files"></a>

이 단원의 예제에서는 Athena for Spark의 노트북에서 계산에 사용할 텍스트 파일을 가져오는 방법을 보여줍니다.

### 로컬 임시 디렉터리에 파일을 쓴 후 노트북에 파일 추가
<a name="notebooks-import-files-libraries-adding-a-file-to-a-notebook-temporary-directory"></a>

다음 예제에서는 로컬 임시 디렉터리에 파일을 쓴 후 노트북에 추가하고 테스트하는 방법을 보여줍니다.

```
import os
from pyspark import SparkFiles
tempdir = '/tmp/'
path = os.path.join(tempdir, "test.txt")
with open(path, "w") as testFile:
    _ = testFile.write("5")
sc.addFile(path)

def func(iterator):
    with open(SparkFiles.get("test.txt")) as testFile:
        fileVal = int(testFile.readline())
        return [x * fileVal for x in iterator]

#Test the file
from pyspark.sql.functions import udf
from pyspark.sql.functions import col

udf_with_import = udf(func)
df = spark.createDataFrame([(1, "a"), (2, "b")])
df.withColumn("col", udf_with_import(col('_2'))).show()
```

**출력**

```
Calculation completed.
+---+---+-------+
| _1| _2|    col|
+---+---+-------+
|  1|  a|[aaaaa]|
|  2|  b|[bbbbb]|
+---+---+-------+
```

### Amazon S3에서 파일 가져오기
<a name="notebooks-import-files-libraries-importing-a-file-from-s3"></a>

다음 예제에서는 Amazon S3에서 노트북으로 파일을 가져와서 테스트하는 방법을 보여줍니다.

**Amazon S3에서 노트북으로 파일을 가져오려면**

1. 값 `5`가 포함된 한 줄로 구성된 `test.txt`라는 파일을 생성합니다.

1. Amazon S3에서 버킷에 파일을 추가합니다. 이 예제에서는 `s3://amzn-s3-demo-bucket` 위치를 사용합니다.

1. 다음 코드를 사용하여 파일을 노트북으로 가져온 다음 테스트합니다.

   ```
   from pyspark import SparkFiles
   sc.addFile('s3://amzn-s3-demo-bucket/test.txt')
   
   def func(iterator):
      with open(SparkFiles.get("test.txt")) as testFile:
          fileVal = int(testFile.readline())
          return [x * fileVal for x in iterator]
          
   #Test the file
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   udf_with_import = udf(func)
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   df.withColumn("col", udf_with_import(col('_2'))).show()
   ```

   **출력**

   ```
   Calculation completed.
   +---+---+-------+
   | _1| _2|    col|
   +---+---+-------+
   |  1|  a|[aaaaa]|
   |  2|  b|[bbbbb]|
   +---+---+-------+
   ```

## Python 파일 추가
<a name="notebooks-import-files-libraries-adding-python-files"></a>

이 단원의 예제에서는 Python 파일 및 라이브러리를 Athena의 Spark 노트북에 추가하는 방법을 보여줍니다.

### Python 파일 추가 및 UDF 등록
<a name="notebooks-import-files-libraries-adding-python-files-and-registering-a-udf"></a>

다음 예제에서는 Amazon S3의 Python 파일을 노트북에 추가하고 UDF를 등록하는 방법을 보여줍니다.

**Python 파일을 노트북에 추가하고 UDF를 등록하려면**

1. 사용자의 Amazon S3 위치를 사용하여 다음 콘텐츠가 포함된 `s3://amzn-s3-demo-bucket/file1.py` 파일을 생성합니다.

   ```
   def xyz(input):
       return 'xyz  - udf ' + str(input);
   ```

1. 동일한 S3 위치를 사용하여 다음 콘텐츠가 포함된 `s3://amzn-s3-demo-bucket/file2.py` 파일을 생성합니다.

   ```
   from file1 import xyz
   def uvw(input):
       return 'uvw -> ' + xyz(input);
   ```

1. Athena for Spark 노트북에서 다음 명령을 실행합니다.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/file1.py')
   sc.addPyFile('s3://amzn-s3-demo-bucket/file2.py')
   
   def func(iterator):
       from file2 import uvw
       return [uvw(x) for x in iterator]
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   udf_with_import = udf(func)
   
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   
   df.withColumn("col", udf_with_import(col('_2'))).show(10)
   ```

   **출력**

   ```
   Calculation started (calculation_id=1ec09e01-3dec-a096-00ea-57289cdb8ce7) in (session=c8c09e00-6f20-41e5-98bd-4024913d6cee). Checking calculation status...
   Calculation completed.
   +---+---+--------------------+
   | _1| _2|                 col|
   +---+---+--------------------+
   | 1 |  a|[uvw -> xyz - ud... |
   | 2 |  b|[uvw -> xyz - ud... |
   +---+---+--------------------+
   ```

### Python .zip 파일 가져오기
<a name="notebooks-import-files-libraries-importing-a-python-zip-file"></a>

Python `addPyFile` 및 `import` 메서드를 사용하여 Python .zip 파일을 노트북으로 가져올 수 있습니다.

**참고**  
Athena Spark로 가져오는 `.zip` 파일에는 Python 패키지만 포함될 수 있습니다. 예를 들어 C 기반 파일이 있는 패키지의 포함은 지원되지 않습니다.

**Python `.zip` 파일을 노트북으로 가져오려면**

1. 로컬 컴퓨터의 데스크톱 디렉터리(예: `\tmp`)에 `moduletest` 디렉터리를 생성합니다.

1. `moduletest` 디렉터리에 다음 콘텐츠로 `hello.py`라는 파일을 생성합니다.

   ```
   def hi(input):
       return 'hi ' + str(input);
   ```

1. 동일한 디렉터리에서 이름이 `__init__.py`인 빈 파일을 추가합니다.

   이제 디렉터리 콘텐츠가 다음과 같이 나열됩니다.

   ```
   /tmp $ ls moduletest
   __init__.py       hello.py
   ```

1. `zip` 명령을 사용하여 두 모듈 파일을 `moduletest.zip` 파일에 배치합니다.

   ```
   moduletest $ zip -r9 ../moduletest.zip *
   ```

1. Amazon S3의 버킷에 `.zip` 파일을 업로드합니다.

1. 다음 코드를 사용하여 Python`.zip` 파일을 노트북으로 가져옵니다.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/moduletest.zip')
   
   from moduletest.hello import hi
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   hi_udf = udf(hi)
   
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **출력**

   ```
   Calculation started (calculation_id=6ec09e8c-6fe0-4547-5f1b-6b01adb2242c) in (session=dcc09e8c-3f80-9cdc-bfc5-7effa1686b76). Checking calculation status...
   Calculation completed.
   +---+---+----+
   | _1| _2| col|
   +---+---+----+
   |  1|  a|hi a|
   |  2|  b|hi b|
   +---+---+----+
   ```

### 두 버전의 Python 라이브러리를 별도의 모듈로 가져오기
<a name="notebooks-import-files-libraries-importing-two-library-versions"></a>

다음 코드 예제에서는 Amazon S3의 한 위치에서 두 가지 버전의 Python 라이브러리를 두 개별 모듈로 추가하고 가져오는 방법을 보여줍니다. 이 코드는 S3에서 각 라이브러리 파일을 추가하고 가져온 다음 라이브러리 버전을 인쇄하여 가져오기를 확인합니다.

```
sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_15.zip')
sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_17_6.zip')

import simplejson_v3_15
print(simplejson_v3_15.__version__)
```

**출력**

```
3.15.0
```

```
import simplejson_v3_17_6
print(simplejson_v3_17_6.__version__)
```

**출력**

```
3.17.6
```

### PyPI에서 Python .zip 파일 가져오기
<a name="notebooks-import-files-libraries-importing-a-python-zip-file-from-a-github-project"></a>

이 예제에서는 `pip` 명령을 사용하여 [Python 패키지 인덱스(PyPI)](https://pypi.org/)에서 [bpabel/piglatin](https://github.com/bpabel/piglatin) 프로젝트의 Python .zip 파일을 다운로드합니다.

**PyPI에서 Python .zip 파일을 가져오려면**

1. 로컬 데스크톱에서 다음 명령을 사용하여 `testpiglatin` 디렉터리를 만들고 가상 환경을 생성합니다.

   ```
   /tmp $ mkdir testpiglatin
   /tmp $ cd testpiglatin
   testpiglatin $ virtualenv .
   ```

   **출력**

   ```
   created virtual environment CPython3.9.6.final.0-64 in 410ms
   creator CPython3Posix(dest=/private/tmp/testpiglatin, clear=False, no_vcs_ignore=False, global=False)
   seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/user1/Library/Application Support/virtualenv)
   added seed packages: pip==22.0.4, setuptools==62.1.0, wheel==0.37.1
   activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
   ```

1. 프로젝트를 보관할 `unpacked` 하위 디렉터리를 생성합니다.

   ```
   testpiglatin $ mkdir unpacked
   ```

1. `pip` 명령을 사용하여 `unpacked` 디렉터리에 프로젝트를 설치합니다.

   ```
   testpiglatin $ bin/pip install -t $PWD/unpacked piglatin
   ```

   **출력**

   ```
   Collecting piglatin
   Using cached piglatin-1.0.6-py2.py3-none-any.whl (3.1 kB)
   Installing collected packages: piglatin
   Successfully installed piglatin-1.0.6
   ```

1. 디렉터리의 내용을 확인합니다.

   ```
   testpiglatin $ ls
   ```

   **출력**

   ```
   bin lib pyvenv.cfg unpacked
   ```

1. `unpacked` 디렉터리로 변경하고 내용을 표시합니다.

   ```
   testpiglatin $ cd unpacked
   unpacked $ ls
   ```

   **출력**

   ```
   piglatin piglatin-1.0.6.dist-info
   ```

1. `zip` 명령을 사용하여 piglatin 프로젝트의 내용을 `library.zip` 파일에 포함합니다.

   ```
   unpacked $ zip -r9 ../library.zip *
   ```

   **출력**

   ```
   adding: piglatin/ (stored 0%)
   adding: piglatin/__init__.py (deflated 56%)
   adding: piglatin/__pycache__/ (stored 0%)
   adding: piglatin/__pycache__/__init__.cpython-39.pyc (deflated 31%)
   adding: piglatin-1.0.6.dist-info/ (stored 0%)
   adding: piglatin-1.0.6.dist-info/RECORD (deflated 39%)
   adding: piglatin-1.0.6.dist-info/LICENSE (deflated 41%)
   adding: piglatin-1.0.6.dist-info/WHEEL (deflated 15%)
   adding: piglatin-1.0.6.dist-info/REQUESTED (stored 0%)
   adding: piglatin-1.0.6.dist-info/INSTALLER (stored 0%)
   adding: piglatin-1.0.6.dist-info/METADATA (deflated 48%)
   ```

1. (선택 사항) 다음 명령을 사용하여 로컬에서 가져오기를 테스트합니다.

   1. Python 경로를 `library.zip` 파일 위치로 설정하고 Python을 시작합니다.

      ```
      /home $ PYTHONPATH=/tmp/testpiglatin/library.zip 
      /home $ python3
      ```

      **출력**

      ```
      Python 3.9.6 (default, Jun 29 2021, 06:20:32)
      [Clang 12.0.0 (clang-1200.0.32.29)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      ```

   1. 라이브러리를 가져오고 테스트 명령을 실행합니다.

      ```
      >>> import piglatin
      >>> piglatin.translate('hello')
      ```

      **출력**

      ```
      'ello-hay'
      ```

1. 다음과 같은 명령을 사용하여 Amazon S3에서 `.zip` 파일을 추가하고 Athena에 있는 노트북으로 파일을 가져와서 테스트합니다.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/library.zip')
   
   import piglatin
   piglatin.translate('hello')
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   hi_udf = udf(piglatin.translate)
   
   df = spark.createDataFrame([(1, "hello"), (2, "world")])
   
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **출력**

   ```
   Calculation started (calculation_id=e2c0a06e-f45d-d96d-9b8c-ff6a58b2a525) in (session=82c0a06d-d60e-8c66-5d12-23bcd55a6457). Checking calculation status...
   Calculation completed.
   +---+-----+--------+
   | _1|   _2|     col|
   +---+-----+--------+
   |  1|hello|ello-hay|
   |  2|world|orld-way|
   +---+-----+--------+
   ```

### PyPI에서 종속성이 있는 Python .zip 파일 가져오기
<a name="notebooks-import-files-libraries-importing-a-python-zip-file-with-dependencies"></a>

이 예제에서는 PyPI에서 마크다운의 텍스트를 [Gemini](https://gemini.circumlunar.space/) 텍스트 형식으로 변환하는 [md2gemini](https://github.com/makeworld-the-better-one/md2gemini) 패키지를 가져옵니다. 이 패키지는 다음에 대한 [종속성](https://libraries.io/pypi/md2gemini)을 가집니다.

```
cjkwrap
mistune
wcwidth
```

**종속성이 있는 Python .zip 파일을 가져오려면**

1. 로컬 컴퓨터에서 다음 명령을 사용하여 `testmd2gemini` 디렉터리를 만들고 가상 환경을 생성합니다.

   ```
   /tmp $ mkdir testmd2gemini
   /tmp $ cd testmd2gemini
   testmd2gemini$ virtualenv .
   ```

1. 프로젝트를 보관할 `unpacked` 하위 디렉터리를 생성합니다.

   ```
   testmd2gemini $ mkdir unpacked
   ```

1. `pip` 명령을 사용하여 `unpacked` 디렉터리에 프로젝트를 설치합니다.

   ```
   /testmd2gemini $ bin/pip install -t $PWD/unpacked md2gemini
   ```

   **출력**

   ```
   Collecting md2gemini
     Downloading md2gemini-1.9.0-py3-none-any.whl (31 kB)
   Collecting wcwidth
     Downloading wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
   Collecting mistune<3,>=2.0.0
     Downloading mistune-2.0.2-py2.py3-none-any.whl (24 kB)
   Collecting cjkwrap
     Downloading CJKwrap-2.2-py2.py3-none-any.whl (4.3 kB)
   Installing collected packages: wcwidth, mistune, cjkwrap, md2gemini
   Successfully installed cjkwrap-2.2 md2gemini-1.9.0 mistune-2.0.2 wcwidth-0.2.5
   ...
   ```

1. `unpacked` 디렉토리로 변경하고 내용을 확인합니다.

   ```
   testmd2gemini $ cd unpacked
   unpacked $ ls -lah
   ```

   **출력**

   ```
   total 16
   drwxr-xr-x  13 user1  wheel   416B Jun  7 18:43 .
   drwxr-xr-x   8 user1  wheel   256B Jun  7 18:44 ..
   drwxr-xr-x   9 user1  staff   288B Jun  7 18:43 CJKwrap-2.2.dist-info
   drwxr-xr-x   3 user1  staff    96B Jun  7 18:43 __pycache__
   drwxr-xr-x   3 user1  staff    96B Jun  7 18:43 bin
   -rw-r--r--   1 user1  staff   5.0K Jun  7 18:43 cjkwrap.py
   drwxr-xr-x   7 user1  staff   224B Jun  7 18:43 md2gemini
   drwxr-xr-x  10 user1  staff   320B Jun  7 18:43 md2gemini-1.9.0.dist-info
   drwxr-xr-x  12 user1  staff   384B Jun  7 18:43 mistune
   drwxr-xr-x   8 user1  staff   256B Jun  7 18:43 mistune-2.0.2.dist-info
   drwxr-xr-x  16 user1  staff   512B Jun  7 18:43 tests
   drwxr-xr-x  10 user1  staff   320B Jun  7 18:43 wcwidth
   drwxr-xr-x   9 user1  staff   288B Jun  7 18:43 wcwidth-0.2.5.dist-info
   ```

1. `zip` 명령을 사용하여 md2gemini 프로젝트의 내용을 `md2gemini.zip` 파일에 포함합니다.

   ```
   unpacked $ zip -r9 ../md2gemini *
   ```

   **출력**

   ```
     adding: CJKwrap-2.2.dist-info/ (stored 0%)
     adding: CJKwrap-2.2.dist-info/RECORD (deflated 37%)
     ....
     adding: wcwidth-0.2.5.dist-info/INSTALLER (stored 0%)
     adding: wcwidth-0.2.5.dist-info/METADATA (deflated 62%)
   ```

1. (선택 사항) 다음 명령을 사용하여 라이브러리가 로컬 컴퓨터에서 작동하는지 테스트합니다.

   1. Python 경로를 `md2gemini.zip` 파일 위치로 설정하고 Python을 시작합니다.

      ```
      /home $ PYTHONPATH=/tmp/testmd2gemini/md2gemini.zip 
      /home python3
      ```

   1. 라이브러리를 가져오고 테스트를 실행합니다.

      ```
      >>> from md2gemini import md2gemini
      >>> print(md2gemini('[abc](https://abc.def)'))
      ```

      **출력**

      ```
      https://abc.def abc
      ```

1. 다음 명령을 사용하여 Amazon S3에서 `.zip` 파일을 추가하고 Athena에 있는 노트북으로 파일을 가져와서 비 UDF 테스트를 수행합니다.

   ```
   # (non udf test)
   sc.addPyFile('s3://amzn-s3-demo-bucket/md2gemini.zip')
   from md2gemini import md2gemini
   print(md2gemini('[abc](https://abc.def)'))
   ```

   **출력**

   ```
   Calculation started (calculation_id=0ac0a082-6c3f-5a8f-eb6e-f8e9a5f9bc44) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status...
   Calculation completed.
   => https://abc.def (https://abc.def/) abc
   ```

1. 다음 명령을 사용하여 UDF 테스트를 수행합니다.

   ```
   # (udf test)
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   from md2gemini import md2gemini
   
   
   hi_udf = udf(md2gemini)
   df = spark.createDataFrame([(1, "[first website](https://abc.def)"), (2, "[second website](https://aws.com)")])
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **출력**

   ```
   Calculation started (calculation_id=60c0a082-f04d-41c1-a10d-d5d365ef5157) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status...
   Calculation completed.
   +---+--------------------+--------------------+
   | _1|                  _2|                 col|
   +---+--------------------+--------------------+
   |  1|[first website](h...|=> https://abc.de...|
   |  2|[second website](...|=> https://aws.co...|
   +---+--------------------+--------------------+
   ```