Files
INTUIA/Programa final/spacy/tests/tokenizer/__pycache__/test_tokenizer.cpython-312.pyc
T

131 lines
30 KiB
Plaintext
Raw Normal View History

2026-03-15 13:27:50 +00:00
Ë
?û gÉJãó
ddlZddlZddlZddlmZddlmZddlmZddl m
Z
ddl m Z ddl
mZddlmZmZmZmZdd lmZej,j/d
«d «Zej,j/d «ej,j3d
¬«ej,j5ddgd¢fdgd¢fdgd¢fdgd¢fdgd¢fdgd¢fdgd¢fg«d«««Zej,j/d«d „«Zej,j/d!«d"„«Zej,j3d#¬«ej,j/d$«d%„««Zej,j/d&«d'„«Zej,j/d(«d)„«Z ej,j/d*«d+„«Z!ej,j/d,«d-„«Z"ej,j/d.«d/„«Z#ej,j3d0¬«ej,j/d1«d2„««Z$ej,j/d3«d4„«Z%ej,j5d5d6d7g«ej,j/d8«d9„««Z&ej,j/d:«d;„«Z'ej,j/d<«d=„«Z(ej,j/d>«d?„«Z)ej,j3d@¬«ej,j/dA«dB„««Z*ej,j5dCdDgdE¢fdFdFgfg«dG„«Z+dH„Z,ej,j5d5dIg«dJ„«Z-dK„Z.dL„Z/dM„Z0ej,j5d5gdN¢«dO„«Z1ej,j5d5dPg«dQ„«Z2ej,j5d5gdR¢«dS„«Z3dT„Z4ej,j5dUdVg«dW„«Z5dX„Z6ej,j5ddIdYdZidYd[igfg«d\„«Z7ej,j5ddIdYdZidYd]igfdIdZd^d_œdYd[igfg«d`„«Z8ej,j5ddIdZdadbœdYd[igfg«dc„«Z9dd„Z:de„Z;df„Z<dg„Z=dh„Z>di„Z?dj„Z@dk„ZAdl„ZBej,j/dm«dn„«ZCdo„ZDy)péN)ÚGerman)ÚEnglish)ÚORTH)Ú Tokenizer)ÚDoc)ÚExample)Úcompile_infix_regexÚcompile_prefix_regexÚcompile_suffix_regexÚ ensure_path)ÚVocabiçcóztt«ddg«}|d}t|g«}t|«}|d|usJy)helloÚworldr)rr
ÚsetÚlist)ÚdocÚtokenÚitemss úeC:\Users\garci\AppData\Roaming\Python\Python312\site-packages\spacy/tests/tokenizer/test_tokenizer.pyÚ
test_issue743rsDä
Œeg˜ Ð
*€CØ ‰F€EÜ ˆUˆG‹ €AÜ ‹G€EØ ‰8 ÐÑ ói!zECan not be fixed unless with variable-width lookbehinds, cf. PR #3218)Úreasonz text,tokensz"deserve,"--and)údeservez,"--Úandzexception;--exclusive)Ú exceptionz;--Ú exclusivezday.--Is)Údayz.--ÚIszrefinement:--just)Ú
refinementz:--Újustz
memories?--To)Úmemoriesz?--ÚTozUseful.=--Therefore)ÚUsefulú.=--Ú Thereforez=Hope.=--Pandora)úHoper'ÚPandoracóŽ||«}t|«t|«k(sJ|Dcgc]}|jŒc}|k(sJycc}w)z;Test that special characters + hyphens are split correctly.N©ÚlenÚtext)Ú en_tokenizerr/ÚtokensrÚts rÚ
test_issue801r3sHñ$ 
€CÜ ˆs8”s˜6“{Ò  Ó ™CqˆAFF˜  FÒ  *ùÒ s¦Ai%cód}t«j}||«}d|Dcgc]}|jŒc}vsJd|Dcgc]}|jŒc}vsJ|jdtdig«||«}d|Dcgc]}|jŒc}vsJd|Dcgc]}|jŒc}vsJt«j}|jdtdig«||«}d|Dcgc]}|jŒc}vsJd|Dcgc]}|jŒc}vsJycc}wcc}wcc}wcc}wcc}wcc}w)z>Test special-case works after tokenizing. Was caching problem.zTI like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.ÚMATHÚ_MATH_N)rÚ tokenizerr/Úadd_special_caser)r/r7rÚws rÚtest_issue1061r:5sOð b€DÜ“ ×#€IÙ
D/€CØ ¡cÓ*¡c a—f“f cÑ  ©CÓ0©C q˜AŸFF¨CÑ  
×јx¬4°Ð*:Ð);Ô
D‹/€CØ ©Ó 1˜Ÿ¨Ñ  ©#Ó.©# Q˜!Ÿ&&¨#Ñ  “ ×#€IØ
×јx¬4°Ð*:Ð);Ô
D‹/€CØ ©Ó 1˜Ÿ¨Ñ  ©#Ó.©# Q˜!Ÿ&&¨#Ñ  .ùò+ùÚ0ùò-ùÚ.ùò -ùÚ.s#¤D*ÁD/ÂD4ÂD9Ã2D>ÄEcó*|d«}tjt|«dfd¬«|_|j «5}|j |dd«ddd«t|«dk(sJ|jj d k(sJy#1swYŒ5xYw)
z(Test that doc.merge() resizes doc.tensorza b c dé€Úf)Údtyperé)r@r<)ÚnumpyÚonesr.ÚtensorÚ
retokenizeÚmergeÚshape)r0rÚ retokenizers rÚtest_issue1963rHKs~ñ 
!€CÜœS ›X s˜O°3Ô7€C„JØ Ô ˜[Ø×ј#˜a ˜(Ô
ä ˆs‹8qŠ=Ј=Ø :‰:× Ñ ˜  
Ð ús ÁB Â BzICan not be fixed without variable-width look-behind (which we don't want)iÓcót«}d}||«}t|«dk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|d jd
k(sJ|d jdk(sJy )
z@Test that g is not split of if preceded by a number and a letterz
e2g 2g 52gérÚe2géÚ2r?Úgr@Ú52éN)rr.r/)ÚnlpÚ testwordsrs rÚtest_issue1235rSVs—ô ‹)€CØ€IÙ
ˆi‹.€CÜ ˆs‹8qŠ=Ј=Ø ˆq‰6;‰;˜%Ò ÐÐ Ø ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜$Ò ÐÐ Ø ˆq‰6;‰;˜#Ò ÐÑ rcóÌt«}|d«}t|«dk(sJt|jddg««}t|d«dk(sJt|d«dk(sJy)rrrL)rr.rÚpipe)rQrÚdocss rÚtest_issue1242rXgseä
)€CÙ
ˆb'€CÜ ˆs‹8qŠ=Ј ˜"˜g˜Ó (€DÜ ˆtA‰w<˜ ÐÐ Ü ˆtA‰w‹<˜1Ò ÐÑ rcó”tt«gd¢¬«}tt«gd¢¬«}|d|dk7sJ|d|dk(rJy)z#Test that tokens compare correctly.©ÚÚwords)r[r]ÚerN)rr
)Údoc1Údoc2s rÚtest_issue1257rcqsOô Œu‹wšoÔ .€DÜ Œu‹wšoÔ .€DØ ‰7d˜1‘gÒ ÐÐ ØA‰w˜$˜q™'Ò !ri_cóÒtt«gd¢¬«}tjt«5|dj d«sJ ddd«|dj d«j dk(sJtjt«5|dj d«sJ ddd«|dj d«j d k(sJy#1swYŒ‰xYw#1swYŒ9xYw)
zBTest that token.nbor() raises IndexError for out-of-bounds access.)Ú1rMr^réÿÿÿÿNrLrer?rM)rr
ÚpytestÚraisesÚ
IndexErrorÚnborr/)rs rÚtest_issue1375rlzô Œe‹gš_Ô
-€CÜ ”zÕ "Ø1‰v{‰{˜Ð÷
ˆq‰6;‰;r?× Ñ    ”zÕ "Ø1‰v{‰{˜1Œ~Љ~÷
ˆq‰6;‰;q>× Ñ   
#Ð "ú÷
#Ð "ús±CÂ
CÃCÃC&có,tjd«Štjd«Štjd«Štjd«Šˆˆˆˆfd}t«}||«|_|d«}|D]}|jrŒJy)zBTest that tokenizer can parse DOT inside non-whitespace separatorsz[\[\("']z[\]\)"']z[-~\.]z
^https?://cóŠt|jijjjj¬«S)N)Ú
prefix_searchÚ
suffix_searchÚinfix_finditerÚ token_match)rÚvocabÚsearchÚfinditerÚmatch)rQÚinfix_reÚ prefix_reÚ
simple_url_reÚ suffix_res €€€€rÚ my_tokenizerz$test_issue1488.<locals>.my_tokenizerŽs>ø€ÜØ I‰IØ Ø#×%×

ð
rzThis is a test.N©ÚreÚcompilerr7r/)r{rQrrrwrxryrzs @@@@rÚtest_issue1488rsyû€ô
˜+€IÜ
˜?Ó+€IÜz‰z˜-Ó(€HÜ—J‘JÐ1€M÷
ô ‹)€CÙ  Ó%€C„MÙ
ÐÓ
€CÛˆØz‹zЈzñrcóòtjd«Šdgd¢fdddgfdgd¢fg}ˆfd „}t«}||«|_|D]*\}}||«Dcgc]}|jŒc}|k(rŒ*Jy
cc}w) z&Test if infix_finditer works correctlyz[^a-z]z
token 123test)rrfrMÚtestz token 1testrÚ1testz hello...test)rú.r„r„rcóHt|jij¬«S)N)rq)rrsru)rQrws €rÚ
new_tokenizerz%test_issue1494.<locals>.new_tokenizer©sø€Ü˜Ÿ B°x×7HÑ7HÔIrNr|)Ú
test_casesr†rQr/Úexpectedrrws @rÚtest_issue1494r‰Ÿø€ôz‰z˜-Ó(€Hà Ò˜ Ò€Jô Jô )€CÙ! &€C„MÛ$‰ˆˆhÙ(+¨D¬ Ó2© ˜u
¨ Ñ2°hÓ%ùÚ2sÁA4zJCan not be fixed without iterative looping between prefix/suffix and infixicóHt«}|d«}t|«dk(sJy)zRTest that checks that a dot followed by a quote is handled
appropriately.
z.First sentence."A quoted sentence" he said ...é N)rr.©rQrs rÚtest_issue2070r²s&ô )€CÙ
Ð
?€CÜ ˆs8rŠ>Љ>rin cót|d«}t|«dk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|d jd
k(sJ|d jdk(sJ|d jd
k(sJ|djdk(sJ|djdk(sJy)zdTest that the tokenizer correctly splits tokens separated by a slash (/)
ending in a digit.
z"Learn html5/css3/javascript/jqueryérÚLearnrLÚhtml5r?ú/r@Úcss3rPrJÚ
javascriptééÚjqueryNr-)Ú fr_tokenizerrs rÚtest_issue2926r™Âñ
Ð
<€CÜ ˆs8qŠ=Ј ˆq‰6;‰;˜  ˆq‰6;‰;˜  ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜&Ò Ð Ð Ø ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜  ˆq‰6;‰;˜ ÐÐ Ø ˆq‰6;‰;˜  "rr/u ABLEItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of VolumeOver Registration of VolumeUnder Registration of VolumeOver Registration of VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volume TABLE ItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of VolumeOver Registration of VolumeUnder Registration of VolumeOver Registration of VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volume ItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of VolumeOver Registration of VolumeUnder Registration of VolumeOver Registration of VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volumezØoow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:iB
có||«}|sJy)zDCheck that sentence doesn't cause an infinite loop in the tokenizer.N©)r0r/rs rÚtest_issue2626_2835rœÓsñ 
€CÙ €J‰3ri`
cóì|d«}t|«dk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|d jd
k(sJ|d jd k(sJ|d
jdk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|djdk(sJy)z`Test that tokenizer correctly splits off punctuation after numbers with
decimal points.
z&I went for 40.3, and got home by 10.0.rrÚIrLÚwentr?Úforr@z40.3rPú,rJrr•ÚgotrÚhomerÚbyé z10.0é
r„Nr-)r0rs rÚtest_issue2656r§ásñ
Ð
@€CÜ ˆs8rŠ>Ј ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜&Ò Ð Ð Ø ˆq‰6;‰;˜%Ò ÐÐ Ø ˆq‰6;‰;˜&Ò Ð Ð Ø ˆq‰6;‰;˜#Ò ÐÐ Ø ˆq‰6;‰;˜%Ò ÐÐ Ø ˆq‰6;‰;˜%Ò ÐÐ Ø ˆq‰6;‰;˜&Ò Ð Ð Ø ˆq‰6;‰;˜$Ò ÐÐ Ø ˆq‰6;‰;˜&Ò Ð Ð Ø ˆr‰7<‰<˜ ÐÑ r
cót|d«}|djdk(sJ|d«}|djdk(sJy)zFTest that words like 'a' and 'a.m.' don't get exceptional norm values.r[rÚamN)Únorm_)r0r[s rÚtest_issue2754r«õsDñ €AØ ˆQ‰4:‰:˜Ò ÐÐ Ù €BØ
ˆa‰5;‰;˜$Ò ÐÑ r cóHt«}|d«}t|«dk(sJy)z;Test that the tokenizer doesn't hang on a long list of dotszW880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange ZahlrJN)rr.s rÚtest_issue3002r­þs*ô ‹(€CÙ
Ø €Cô ˆs8qŠ=Љ=rz;default suffix rules avoid one upper-case letter before dotiy
cóît«}|jd«d}d}d}||«}||«}||«}|djdk(sJ|djdk(sJ|djdk(sJy) sentencizerz>He gave the ball to I. Do you want to go to the movies with I?z?He gave the ball to I. Do you want to go to the movies with I?z>He gave the ball to I.
Do you want to go to the movies with I?rJ)rÚadd_piper/)rQÚtext1Útext2Útext3Út1Út2Út3s rÚtest_issue3449r·ô ‹)€C؇LLÔØ L€EØ M€EØ M€EÙ ˆU‹€BÙ ˆU‹€BÙ ˆU‹€BØ
ˆa‰5:‰:˜Ò ÐÐ Ø
ˆa‰5:‰:˜Ò ÐÐ Ø
ˆa‰5:‰:˜Ò ÐÑ rz
text,wordszA'B C)ÚCzA-BcóD||«}tj|d|i«y)Nr_)rÚ from_dict)r0r/r_rs rÚtest_gold_misalignedr¾s#ñ 
€CÜ ×Ñc˜G ,rcó4|d«}t|«dk(sJy)NrUr©r.)r7r1s rÚtest_tokenizer_handles_no_wordrÁ sÙ
r‹]€FÜ ˆv‹;˜!Ò ÐÑ rÚloremcó<||«}|dj|k(sJy)Nr©r/©r7r/r1s rÚ"test_tokenizer_handles_single_wordrÆ%s!á
t_€FØ !‰9>‰>˜  !rcóØd}||«}t|«dk(sJ|djdk(sJ|djdk(sJ|djdk(sJ|djdk7sJy) Nz
Lorem, ipsum.rPrÚLoremrLr?Úipsumr-s rÚtest_tokenizer_handles_punctrÊ+szØ €DÙ
t‹_€FÜ ˆv‹;˜!Ò ÐÐ Ø !‰9>‰>˜  !‰9>‰>˜ Ð Ð Ø !‰9>‰>˜  !‰9>‰>˜  $rcó8d}||«}t|«dk(sJy)NzLorem, (ipsum).r•s rÚ#test_tokenizer_handles_punct_bracesrÌ5s"Ø €DÙ
t‹_€FÜ ˆv‹;˜!Ò ÐÑ rcó´ddg}d}||«}|dj|vr9t|«dk(sJ|djdk(sJ|djdk(sJyy) NÚhuÚbnzLorem ipsum: 1984.rrJr@Ú1984)Úlang_r.r/)r7Ú
exceptionsr/r1s rÚtest_tokenizer_handles_digitsrÓ;slؘ€JØ €DÙ
t_€Fà
ˆay‡˜jÑ6‹{˜aÒÐÐØa‰y~‰~ Òa‰y~‰~ Ò)r)z
google.comz
python.orgzspacy.ioz explosion.aizhttp://www.google.comcó4||«}t|«dk(sJy©NrLs rÚtest_tokenizer_keep_urlsrÖFsñ
t‹_€FÜ ˆv‹;˜ ÐÑ rz NASDAQ:GOOGcó4||«}t|«dk(sJy)Nr@s rÚtest_tokenizer_colonsrØOsá
t‹_€FÜ ˆv‹;˜!Ò ÐÑ r)zhello123@example.comzhi+there@gmail.itzmatt@explosion.aicó4||«}t|«dk(sJys rÚtest_tokenizer_keeps_emailrÚUsñt‹_€FÜ ˆv‹;˜!Ò ÐÑ rcó8d}||«}t|«dkDsJy)NaÓLorem ipsum dolor sit amet, consectetur adipiscing elit
Cras egestas orci non porttitor maximus.
Maecenas quis odio id dolor rhoncus dignissim. Curabitur sed velit at orci ultrices sagittis. Nulla commodo euismod arcu eget vulputate.
Phasellus tincidunt, augue quis porta finibus, massa sapien consectetur augue, non lacinia enim nibh eget ipsum. Vestibulum in bibendum mauris.
"Nullam porta fringilla enim, a dictum orci consequat in." Mauris nec malesuada justo.rJs rÚ test_tokenizer_handles_long_textrÜ]s%ð Z€Dñt‹_€FÜ ˆv;˜Š?Љ?rÚ file_namezsun.txtcóütt«j|z }|jdd¬«5}|j «}ddd«t «dk7sJ||«}t |«dkDsJy#1swYŒ2xYw)utf8)Úencodingréd)r Ú__file__ÚparentÚopenÚreadr.)r7ÚlocÚinfiler/r1s rÚ$test_tokenizer_handle_text_from_fileréksmä
”hÓ
×
Ñ
2€CØ # ˆÔ '¨6Ø{‰{}ˆ÷
ˆt9˜Š>Ј
t‹_€FÜ ˆv‹;˜Ò ÐÑ ÷
(Ð 'ús °A2Á2A;có|d}d}||«}||«}|djdk(sJ|djdk(sJy)Nz2Lorem dolor sit amet, consectetur adipiscing elit.z8Lorem ipsum dolor sit amet, consectetur adipiscing elit.r)r7Útokens1Útokens2s rÚ(test_tokenizer_suspected_freeing_stringsríusLØ @€EØ F€EÙ˜Ó€GÙ˜Ó€GØ 1‰:?‰?˜gÒ  1‰:?‰?˜  %rÚorthÚloÚremcó |j||«||«}|dj|ddk(sJ|dj|ddk(sJy)NrrL©r8r/)r7r/r1rs rÚtest_tokenizer_add_special_caseró~sYà
×јt 
D/€CØ ˆq‰6;‰;˜& ™)   ˆq‰6;‰;˜& ™)   +rr}r¸)Útagcó„tjt«5|j||«ddd«y#1swYyxYw)N)rhriÚ
ValueErrorr8s rÚ$test_tokenizer_validate_special_caser÷s,ô
”zÕ "Ø×" Ô
#× "Ñ "úsš6?ÚLO)Únormcót«}t|iddd«}|j||«||«}|dj|ddk(sJ|dj|ddk(sJ|dj|ddk(sJy)NrrL)r